1 chapter 5 profile hmms for sequence families. 2 what have we done? so far, we have concentrated on...

1

Chapter 5

Profile HMMs for Sequence Families

2

What have we done?

• So far, we have concentrated on the intrinsic properties of single sequences (CpG islands), or on pairwise alignment of sequences

• Functional biological sequences typically come in families

• Many of the most powerful sequence analysis methods are based on identifying the relationship of an individual sequence to a sequence family

3

Sequence Families

• Sequences in a family have diverged from each other in their primary sequence during evolution, but normally maintain the same or a related function

• Thus, identifying that a sequence belongs to a family, and aligning it to the other members, often tells about its function

4

Sequence Family

• If we have a set of sequences belonging to a family, we can perform a database search for more members using pairwise alignment with one of the known family members as the query sequence

• We could even search with the known members one by one

• However, pairwise searching with any one of the members may not find sequences distantly related to the ones we have already

5

Sequence Family

• An alternative approach is to use statistical features of the whole set of sequences in the search

• Similarly, even when family membership is clear, accurate alignment can be often improved significantly by concentrating on features that are conserved in the whole family

6

Sequence Family

• A pairwise alignment captures much of the relationship between two sequences

• A multiple alignment can show how the sequences in a family relate to each other

7

Sequences From A Globin Family

8

Globin Family• It is clear that some positions in the globin a

lignment are more conserved than others• The helices are more conserved than the loo

p regions between them• Certain residues are particularly strongly co

nserved (two conserved histidines H)• When identifying a new sequence as a globi

n, it would be desirable to concentrate on checking that these more conserved features are present

9

HMM Profile• Consensus modeling of the family using a p

robabilistic model

• Built from a given multiple alignment (assumed to be correct!)

• Develop a particular type of hidden Markov model well suited to modelling multiple alignments

• We call these profile HMMs

10

HMM Profile

• We will assume that we are given a correct multiple alignment, from which we will build a model that can be used to find and score potential matches to new sequences

11

Ungapped Score Matrices

• A natural probabilistic model for a conserved region would be to specify independent probabilities ei

(a) of observing amino acid a in position i

• The probability of a new sequence x according to this model is

P(x | M) ei(x i)i1

L

12

Log-odds ratio

• We are interested in the ratio of the probability to the probability of x under the random model

• The values log(ei(a)/qa) behave like elements in a score matrix s(a,b), where the second index is position i, rather than amino acid b

S logei(x i)

qxii1

L

Position specific score matrix (PSSM)

13

Adding Insert and Delete States

• A PSSM might capture some conservation information, it is clearly an inadequate representation of all the information in a multiple alignment of a protein family.

• We have to find some way to take account of gaps

14

Components of profile HMMs

• Consideration of gaps– Henikoff & Henikoff [1991]

• Combining the multiple ungapped block models

– Allowing gaps at each position using the same gap scores (g) at each position (where gaps are more or less likely?)

• Profile HMMs– Repetitive structure of states– Different probabilities in each position– Full probabilistic model for sequences in the sequence

family

15


• The PSSM can be viewed as a trival HMM with a series of identical states that we will call match states, separated by transitions of probability 1

• Alignment is trivial since there is no choice of transitions.

• Match states– Emission probabilities

Begin Mj End....

..

..

)(aeiM

16

Ungapped profiles and the corresponding HMMs

Beg Mj End… …

Example

AGAAACTAGGAATTTGAATCT

P(AGAAACT)=16/81P(TGGATTT)=1/81

1 2 3 4 5 6 7

A 2/3 0 2/3 1 2/3 0 0

T 1/3 0 0 0 1/3 1/3 1

C 0 0 0 0 0 2/3 0

G 0 1 1/3 0 0 0 0

Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:

Typically, pseudo-counts are added in HMMs to avoid zero probabilities.

17

Insertions and deletions in profile HMMs

Beg Mj End

Ij

Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.

Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).

18


• Insert states– Emission prob.

• Usually back ground distribution qa.

– Transition prob.• Mi to Ii, Ii to itself, Ii to Mi+1

– Log-odds score of a gap of length k (no logg-odds from emission)

Begin Mj End

Ij

)(I aei

jjjjjjakaa II1MIIM log)1(loglog

19

Insertions and deletions in profile HMMs

Beg Mj End

Dj

Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).

The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.

20


• Delete states– No emission prob.

– Cost of a deletion• M→D, D→D, D→M

• Each D→D might be different, each I→I will have the same cost

Begin Mj End

Dj

21

Gap penalties: evolutionary and computational considerations

• Linear gap penalties:

(k) = - k d

for a gap of length k and constant d

• Affine gap penalties:

(k) = - [ d + (k -1) e ]

where d is opening gap penalty and e an extension gap penalty.

22


• Combining all parts

Begin Mj End

Ij

Dj

Figure 5.2 The transition structure of a profile HMM.

23

Profile HMMs as a model for multiple alignments

Beg Mj End

Ij

Dj

ExampleAG---CA-AG-CAG-AA---AAACAG---C** *

25

Profile HMMs generalize pairwise alignment

• Mj→M, Ij→X (insertion), Dj→Y (deletion)

• eMi(a)=pyia /qa the conditional probabilities of seeing a given yi in a pairwise alignment

• aMiIi= aMiDi+1=δ, aIiIi= aDiDi+1=ε

26

Deriving Profile HMMs From Multiple Alignments

• The key idea behind profile HMMs is that we can use the same structure as shown in Figure 5.2, but set the transition and emission probabilities to capture specific information about each position in the multiple alignment of the whole family

• Essentially, we want to build a model representing the consensus sequence for a family, rather than the sequence of any particular member

• Non-probabilistic profiles and profile HMMs

27

Non-probabilistic Profiles

• A model similar to the profile HMM was first introduced by Gribskov, McLachlan and Eisenberg 1987

• No underlying probabilistic model, but rather assigned position specific scores for each match state and gap penalty

• The score for each consensus position is set to the average of the standard substitution scores from all the residues in the corresponding multiple sequence alignment column

28

HMMs from multiple alignments

• Key idea behind profile HMMs– Model representing the consensus for the family– Not the sequence of any particular member

HBA_HUMAN ...VGA--HAGEY...HBB_HUMAN ...V----NVDEV...MYG_PHYCA ...VEA--DVAGH...GLB3_CHITP ...VKG------D...GLB5_PETMA ...VYS--TYETS...LGB2_LUPLU ...FNA--NIPKH...GLB1_GLYDI ...IAGADNGAGV... *** *****

Figure 5.3 Ten columns from the multiple alignment of seven globin protein sequences shown in Figure 5.1 The starred columns are ones that will be treated as ‘matches’ in the profile HMM.

29


s(a,b) : standard substitution matrix

The score for residue ‘a’ incolumn 1

30

HMMs from multiple alignments

• Non-probabilistic profiles– Gribskov, Mclachlan & Eisenberg [1987]

• Score for residue a in column 1

– Disadvantages• More conserved region might be corrupted.

• Intuition about the likelihood can’t be maintained.

• The score for gaps do not behave as expected.

),I(7

1),F(

7

1),V(

7

5asasas

31


• They also set gap penalties for each column using a heuristic equation that decrease the cost of a gap according to the length of the longest gap observed in the multiple alignment spanning the column

32

Problem With The Approach

• If we had an alignment with 100 sequences, all with a cysteine (C), at some position, the probability distribution for that column for an “average” profile would be exactly the same as would be derived from a single sequence

• Doesn’t correspond to our expectation that the likelihood of a cysteine should go up as we see more confirming examples

33

Similar Problem With Gaps

Scores for a deletion in columns 2 and 4 would be set to the same value

More reasonable to set the probability of a new gap opening to be higher in column 4

34

Basic Profile HMM Parameterization

• HMM profiles have emission and transition probabilities

• Assuming that these probabilities are non-zero, a profile HMM can model any possible sequence of residues from the given alphabet

• A profile HMM defines a probability distribution over the whole space of sequences

• The aim of parameterization is to make this distribution peak around members of the family

• Parameters: probabilities and the length of the model (control the shape of the distribution)

35

Model Length• The choice of length of the model corresponds more

precisely to a decision on which multiple alignment columns to assign to match states, and which to assign to insert states

• The consensus sequence at Figure 5.3 should only have 8 residues, and that the two non-starred residues in GLB1_GLYDI should be treated as an insertion with respect to the consensus

• Should decide which columns should correspond to match states, and which to inserts

• A simple rule that works well in practice is that columns that are more than half gap characters should be modeled by inserts

36

Probability Values

akl Akl

Akl 'l '

ek (a) Ek (a)

Ek (a')a '

k,l : indices over states

akl and ek : transition and emission probabilities

Akl and Ek : transition and emission frequencies

37

Problem With The Approach

• Transitions and emissions that don’t appear in the training dataset would acquire zero probability (would never be allowed)

• Solution: add pseudocounts to the observed frequencies

• Simplest pseudocount method is Laplace’s rule: add one to each frequency

38

Example

39

Example: Full Profile HMM

40

Searching With Profile HMMs• One of the main purposes of developing profile H

MMs is to use them to detect potential membership in a family by obtaining significant matches of a sequence to the profile HMM

• We assume that we are looking for global matches

• We can use either the Viterbi algorithm to get the most probable alignment or the forward algorithm to calculate the full probability of the sequence summed over all possible paths

41

Searching with profile HMMs• Maintaining log-odd ratio compared with random

model

i

xiqRxP )|(

42

Viterbi Equations

• VjM(i) is the log-odds score of the best path

matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj

• VjI(i) is the score of the best path ending in

xi being emitted by Ij, and VjD(i) for the best

path ending in state Dj

43

Viterbi Algorithm

44

Viterbi Equations

• In a typical case, there is no emission score eIj(xi) in the equation for Vj

I(i) since we assu

me that the emission distribution from the insert state Ij is the same as the background distribution, so the probabilities cancel in the log-odds form

• Also the D→I and I→D trsnsition terms may not be present

45

Forward Algorithm

46

Variants for non-global alignments

• We can generalize profile HMMs for local, repeat and overlap alignments

• The profile HMM for local algorithm is to specify a new model for the complete sequence x ,which incorporates the original profile HMM together with one or more copies of a simple selflooping model that is used to account for the regions of unaligned sequence

47


• Notice that as well as specifying the emission probabilities of the new states, which will normally of course be qa , we must specify a number of new transition probabilities

• The looping probability on the flanking states should be close to 1, since they must account for long stretches of sequence

• Let us set these to (1-η)

48


• For the transition probabilities from the left flanking state to different start points in the model, we can give them equal probabilities, η/L

• Another option is to assign more probability to starting at the beginning of the model. That is the option used in HMMER package (Eddy 1996)

49


• Local alignments (flanking model)– Emission prob. in flanking states use background values qa.

– Looping prob. close to 1, e.g. (1- ) for some small .

Mj

Ij

Dj

Begin End

Q Q

50


• Overlap alignments– Only transitions to the first model state are allowed.

– When expecting to find either present as a whole or absent

– Transition to first delete state allows missing first residue

Begin Mj End

IjQ

Dj

Q

51


• Repeat alignments– Transition from right flanking state back to random model

– Can find multiple matching segments in query string

Mj

Ij

Dj

Begin EndQ

52

More on estimation of prob. • Maximum likelihood (ML) estimation

– given observed freq. cja of residue a in position j.

• Problem of ML estimation– If observed cases are absent?– Specially when observed examples are somewhat few.

' 'M )(

a ja

ja

c

cae

j

53

More on estimation of prob. • Simple pseudocounts

– qa: background distribution

– A: weight factor

– Laplace’s rule: Aqa = 1

• Bayesian framework– Dirichlet prior (αa= Aqa)

(θare model probabilities)

' '

M )(a ja

aja

cA

Aqcae

j

)(

)()|()|(

DP

PDPDP

54

Dirichlet Distribution

• The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a pdf on RK–1 given by

55

More on estimation of prob. • Dirichlet mixtures

– Mixtures of dirichlet prior: better than single dirichlet prior

– With K pseudocount priors,

)()|()(

'' 'M k

aa ja

kaja

kj c

ckPae

j

c

' ' )'|(

)|()|(

k jk

jkj kPp

kPpkP

c

cc

a

ka

kaa jaa ja

a

kaa

kajaa ja

j cc

cckP

)()(!

)()()!()|(

c

56

Optimal model construction• In order to estimate the probability parameters

from profile HMMs, we need to consider– Which columns to insert states or which to match

states? (called model construction)– If marked multiple alignments have no errors, the

optimal model can be constructed– In the profile HMM formalism, it is assumed that an

aligned column of symbols corresponds either to emissions from the same match state or to emissions from the same insert state

57

Optimal model construction• It therefore suffices to mark which columns come

from match states to specify a profile HMM architecture and the state paths for all the sequences in the alignment, as shown in Figure 5.7

• In a marked column, symbols are assigned to match states and gaps are assigned to delete states

• In an unmarked column, symbols are assigned to insert states and gaps are ignored

• State transition and symbol emission counts are obtained from the state paths, and these counts can be used to estimate probability parameters

58

Optimal model construction

• 2L combinations for markings of L columns, and hence 2L different profile HMMs to choose from

• A manual construction is shown in Figure 5.7

• Other ways to determine the marking include heuristic construction and a maximum a posteriori (MAP) choice

59

Optimal model construction

beg M M M end

II II

D DD

x x . . . xbat A G - - - Crat A - A G - Ccat A G - A A -gnat - - A A A Cgoat A G - - - C 1 2 . . . 3

(a) Multiple alignment:

(b) Profile-HMM architecture:

0 1 2 3 4

0 1 2 3A - 4 0 0C - 0 0 4G - 0 3 0T - 0 0 0A 0 0 6 0C 0 0 0 0G 0 0 1 0T 0 0 0 0M-M 4 3 2 4M-D 1 1 0 0M-I 0 0 1 0I-M 0 0 2 0I-D 0 0 1 0I-I 0 0 4 0D-M - 0 0 1D-D - 1 0 0D-I - 0 2 0

(c) Observed emission/transition counts

matchemissions

insertemissions

statetransitions

60

Optimal Model Construction • MAP match-insert assignment

– Recursive calculation of a number Sj

• Sj: log prob. of the optimal model for alignment up to and including column j, assuming j is marked.

• Sj is calculated from Si and summed log prob. between i and j.

• Tij: summed log prob. of all the state transitions between marked i and j.

– cxy are obtained from partial state paths implied by marking i and j.

ID,M,,

logyx

xyxyij acT

61

Optimal Model Construction • Algorithm: MAP model construction

– Initialization:• S0 = 0, ML+1 = 0.

– Recurrence: for j = 1,..., L+1:

– Traceback: from j = L+1, while j > 0:• Mark column j as a match column

• j = j.

;maxarg

;max

1,10

1,10

jijijiji

j

jijijiji

j

IMTS

IMTSS

62

Weighting Training Sequences

• One issue that we have avoided completely so far is that of weighting sequences when estimating parameters

• In a typical alignment, there are often some sequences that very closely related to each other

• Intuitively, some of the information from these sequences is shared, so we should not give them each the same influence in the estimation process as a single sequence that is more highly diverged from all the others

63


• In the extreme that two sequences are identical, it makes sense that they should each get half the weight of other sequences

• There have been a large number of proposals for different ways to assign weights to sequences

64


• Good random sample do you have?

• “Assumption : all examples are independent samples” might be incorrect

• Solutions– Weight sequences based on similarity

65

Simple Weighting Scheme Derived from a Tree

• Many weighting approaches based on building a tree relating the sequences

• Since sequences in a family are related by an evolutionary tree, a very natural approach is to try to reconstruct this tree and use it when estimating the independent contribution of each of the observed sequences, down weighting sequences that have only recently diverged

66

Simple Weighting Scheme Derived from a Tree

• Here we assume that we are given a tree connecting the sequences, with branch lengths indicating the relative degrees of divergence for each edge in the tree

67


• Simple weighting schemes derived from a tree– Phylogenetic tree is given.

• [Thompson, Higgins & Gibson 1994b]– Kirchohoff’s law

• [Gerstein, Sonnhammer & Chothia 1994]

nk k

ini w

wtw

below leaves

68


• We are given a tree made of a conducting wire of constant thickness and apply a voltage V to the root

• All the leaves are set to zero potential and the currents flowing from them are measured and taken to be the weights

• Clearly, the currents will be smaller in the highly divided parts of the tree so these weights have the right qualitative properties

69


t4 = 8t3 = 5

t2 = 2t1 = 2

t5 = 3

t6 = 3

5

6

7

1 2 3 4

I4I1+I2

I1+I2+I3

V5

V6

V7

I1 I2

I3

I1:I2:I3:I4 = 20:20:32:47w1:w2:w3:w4 = 35:35:50:64

70

Weighting Training Sequences• Root weights from Gaussian parameters

– Influence of leaves on the root distr.– Altchul-Carroll-Lipman wieghts

• Make gaussian distr.

• Mean : linearly combination of xi.

• Combination weights represent the influences of leaves.

71


t3

t2t1

4

x1 x2 x3

5

12

22211

2

)(

121 ),|4 nodeat ( t

xvxvx

eKLLxP

2211

21211221112121 )/(),/(),/(

xvxv

ttttttttvtttv

72

Weighting Training Sequences • Voronoi weights

– Proportional to the volume of empty space– Sequence family in sequence space– Algorithm

• Random sample: choosing at kth position uniformly from the set of residues occurring kth position

• ni: count of samples closest to the ith family

• ith weight

k ki nn /

73

Weighting Training Sequences • Maximum discrimination weights

– Focus: decision on whether sequences are members of the family or not

– discrimination

– weight: 1-P(M|xi)

– effect: difficult members are given big weight

k

kxMPD )|(

))(1)(|()()|(

)()|()|(

MPRxPMPMxP

MPMxPxMP

74

Weighting Training Sequences • Maximum entropy weights (1)

– Intuition• kia: number of residues of type a in column i of a multiple

alignment

• mi: number of different types of residues in column i

• As uniform as possible– weight for sequence k:

– ML estimation under the weights: pia = 1/mi

– Averaging over all columns [Henikoff 1994]

)/(1 kiixikm

i ixi

kki

kmw

1

75

Weighting Training Sequences • Maximum entropy weights (2)

– entropy: an measure of the ‘uniformity’ [Krogh & Mitchison 1995]

– maximize

– example• x1 = AFA, x2 = AAC, x3 = DAC

• w1 = w3 =0.5, w2 = 0

k ki i wwH )(

iaa iai ppwH log)(

(sum to one constraints)

)log()(log)(

)log()(log)(

log)log()()(

3232113

3232112

3321211

wwwwwwwH

wwwwwwwH

wwwwwwwH

1 chapter 5 profile hmms for sequence families. 2 what have we done? so far, we have concentrated on...

Documents

globin alignment

given multiple alignment

correct multiple alignment

sequencesa multiple

accurate alignment

conserved features

known family members

individual sequence