1 chapter 5 profile hmms for sequence families. 2 what have we done? so far, we have concentrated on...

75
1 Chapter 5 Profile HMMs for Seque nce Families

Upload: rebecca-shonda-casey

Post on 21-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

1

Chapter 5

Profile HMMs for Sequence Families

Page 2: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

2

What have we done?

• So far, we have concentrated on the intrinsic properties of single sequences (CpG islands), or on pairwise alignment of sequences

• Functional biological sequences typically come in families

• Many of the most powerful sequence analysis methods are based on identifying the relationship of an individual sequence to a sequence family

Page 3: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

3

Sequence Families

• Sequences in a family have diverged from each other in their primary sequence during evolution, but normally maintain the same or a related function

• Thus, identifying that a sequence belongs to a family, and aligning it to the other members, often tells about its function

Page 4: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

4

Sequence Family

• If we have a set of sequences belonging to a family, we can perform a database search for more members using pairwise alignment with one of the known family members as the query sequence

• We could even search with the known members one by one

• However, pairwise searching with any one of the members may not find sequences distantly related to the ones we have already

Page 5: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

5

Sequence Family

• An alternative approach is to use statistical features of the whole set of sequences in the search

• Similarly, even when family membership is clear, accurate alignment can be often improved significantly by concentrating on features that are conserved in the whole family

Page 6: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

6

Sequence Family

• A pairwise alignment captures much of the relationship between two sequences

• A multiple alignment can show how the sequences in a family relate to each other

Page 7: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

7

Sequences From A Globin Family

Page 8: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

8

Globin Family• It is clear that some positions in the globin a

lignment are more conserved than others• The helices are more conserved than the loo

p regions between them• Certain residues are particularly strongly co

nserved (two conserved histidines H)• When identifying a new sequence as a globi

n, it would be desirable to concentrate on checking that these more conserved features are present

Page 9: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

9

HMM Profile• Consensus modeling of the family using a p

robabilistic model

• Built from a given multiple alignment (assumed to be correct!)

• Develop a particular type of hidden Markov model well suited to modelling multiple alignments

• We call these profile HMMs

Page 10: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

10

HMM Profile

• We will assume that we are given a correct multiple alignment, from which we will build a model that can be used to find and score potential matches to new sequences

Page 11: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

11

Ungapped Score Matrices

• A natural probabilistic model for a conserved region would be to specify independent probabilities ei

(a) of observing amino acid a in position i

• The probability of a new sequence x according to this model is

P(x | M) ei(x i)i1

L

Page 12: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

12

Log-odds ratio

• We are interested in the ratio of the probability to the probability of x under the random model

• The values log(ei(a)/qa) behave like elements in a score matrix s(a,b), where the second index is position i, rather than amino acid b

S logei(x i)

qxii1

L

Position specific score matrix (PSSM)

Page 13: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

13

Adding Insert and Delete States

• A PSSM might capture some conservation information, it is clearly an inadequate representation of all the information in a multiple alignment of a protein family.

• We have to find some way to take account of gaps

Page 14: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

14

Components of profile HMMs

• Consideration of gaps– Henikoff & Henikoff [1991]

• Combining the multiple ungapped block models

– Allowing gaps at each position using the same gap scores (g) at each position (where gaps are more or less likely?)

• Profile HMMs– Repetitive structure of states– Different probabilities in each position– Full probabilistic model for sequences in the sequence

family

Page 15: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

15

Components of profile HMMs

• The PSSM can be viewed as a trival HMM with a series of identical states that we will call match states, separated by transitions of probability 1

• Alignment is trivial since there is no choice of transitions.

• Match states– Emission probabilities

Begin Mj End....

..

..

)(aeiM

Page 16: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

16

Ungapped profiles and the corresponding HMMs

Beg Mj End… …

Example

AGAAACTAGGAATTTGAATCT

P(AGAAACT)=16/81P(TGGATTT)=1/81

1 2 3 4 5 6 7

A 2/3 0 2/3 1 2/3 0 0

T 1/3 0 0 0 1/3 1/3 1

C 0 0 0 0 0 2/3 0

G 0 1 1/3 0 0 0 0

Each blue square represents a match state that “emits” each letter withcertain probability ej(a) which is defined by frequency of a at position j:

Typically, pseudo-counts are added in HMMs to avoid zero probabilities.

Page 17: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

17

Insertions and deletions in profile HMMs

Beg Mj End

Ij

Insert states emit symbols just like the match states, however, theemission probabilities are typically assumed to follow the backgrounddistribution and thus do not contribute to log-odds scores.

Transitions Ij -> Ij are allowed and account for an arbitrary numberof inserted residues that are effectively unaligned (their order withinan inserted region is arbitrary).

Page 18: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

18

Components of profile HMMs

• Insert states– Emission prob.

• Usually back ground distribution qa.

– Transition prob.• Mi to Ii, Ii to itself, Ii to Mi+1

– Log-odds score of a gap of length k (no logg-odds from emission)

Begin Mj End

Ij

)(I aei

jjjjjjakaa II1MIIM log)1(loglog

Page 19: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

19

Insertions and deletions in profile HMMs

Beg Mj End

Dj

Deletions are represented by silent states which do not emit any letters.A sequence of deletions (with D -> D transitions) may be used to connectany two match states, accounting for segments of the multiple alignmentthat are not aligned to any symbol in a query sequence (string).

The total cost of a deletion is the sum of the costs of individual transitions(M->D, D->D, D->M) that define this deletion. As in case of insertions, bothlinear and affine gap penalties can be easily incorporated in this scheme.

Page 20: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

20

Components of profile HMMs

• Delete states– No emission prob.

– Cost of a deletion• M→D, D→D, D→M

• Each D→D might be different, each I→I will have the same cost

Begin Mj End

Dj

Page 21: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

21

Gap penalties: evolutionary and computational considerations

• Linear gap penalties:

(k) = - k d

for a gap of length k and constant d

• Affine gap penalties:

(k) = - [ d + (k -1) e ]

where d is opening gap penalty and e an extension gap penalty.

Page 22: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

22

Components of profile HMMs

• Combining all parts

Begin Mj End

Ij

Dj

Figure 5.2 The transition structure of a profile HMM.

Page 23: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

23

Profile HMMs as a model for multiple alignments

Beg Mj End

Ij

Dj

ExampleAG---CA-AG-CAG-AA---AAACAG---C** *

Page 24: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

24

Page 25: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

25

Profile HMMs generalize pairwise alignment

• Mj→M, Ij→X (insertion), Dj→Y (deletion)

• eMi(a)=pyia /qa the conditional probabilities of seeing a given yi in a pairwise alignment

• aMiIi= aMiDi+1=δ, aIiIi= aDiDi+1=ε

Page 26: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

26

Deriving Profile HMMs From Multiple Alignments

• The key idea behind profile HMMs is that we can use the same structure as shown in Figure 5.2, but set the transition and emission probabilities to capture specific information about each position in the multiple alignment of the whole family

• Essentially, we want to build a model representing the consensus sequence for a family, rather than the sequence of any particular member

• Non-probabilistic profiles and profile HMMs

Page 27: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

27

Non-probabilistic Profiles

• A model similar to the profile HMM was first introduced by Gribskov, McLachlan and Eisenberg 1987

• No underlying probabilistic model, but rather assigned position specific scores for each match state and gap penalty

• The score for each consensus position is set to the average of the standard substitution scores from all the residues in the corresponding multiple sequence alignment column

Page 28: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

28

HMMs from multiple alignments

• Key idea behind profile HMMs– Model representing the consensus for the family– Not the sequence of any particular member

HBA_HUMAN ...VGA--HAGEY...HBB_HUMAN ...V----NVDEV...MYG_PHYCA ...VEA--DVAGH...GLB3_CHITP ...VKG------D...GLB5_PETMA ...VYS--TYETS...LGB2_LUPLU ...FNA--NIPKH...GLB1_GLYDI ...IAGADNGAGV... *** *****

Figure 5.3 Ten columns from the multiple alignment of seven globin protein sequences shown in Figure 5.1 The starred columns are ones that will be treated as ‘matches’ in the profile HMM.

Page 29: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

29

Non-probabilistic Profiles

s(a,b) : standard substitution matrix

The score for residue ‘a’ incolumn 1

Page 30: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

30

HMMs from multiple alignments

• Non-probabilistic profiles– Gribskov, Mclachlan & Eisenberg [1987]

• Score for residue a in column 1

– Disadvantages• More conserved region might be corrupted.

• Intuition about the likelihood can’t be maintained.

• The score for gaps do not behave as expected.

),I(7

1),F(

7

1),V(

7

5asasas

Page 31: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

31

Non-probabilistic Profiles

• They also set gap penalties for each column using a heuristic equation that decrease the cost of a gap according to the length of the longest gap observed in the multiple alignment spanning the column

Page 32: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

32

Problem With The Approach

• If we had an alignment with 100 sequences, all with a cysteine (C), at some position, the probability distribution for that column for an “average” profile would be exactly the same as would be derived from a single sequence

• Doesn’t correspond to our expectation that the likelihood of a cysteine should go up as we see more confirming examples

Page 33: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

33

Similar Problem With Gaps

Scores for a deletion in columns 2 and 4 would be set to the same value

More reasonable to set the probability of a new gap opening to be higher in column 4

Page 34: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

34

Basic Profile HMM Parameterization

• HMM profiles have emission and transition probabilities

• Assuming that these probabilities are non-zero, a profile HMM can model any possible sequence of residues from the given alphabet

• A profile HMM defines a probability distribution over the whole space of sequences

• The aim of parameterization is to make this distribution peak around members of the family

• Parameters: probabilities and the length of the model (control the shape of the distribution)

Page 35: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

35

Model Length• The choice of length of the model corresponds more

precisely to a decision on which multiple alignment columns to assign to match states, and which to assign to insert states

• The consensus sequence at Figure 5.3 should only have 8 residues, and that the two non-starred residues in GLB1_GLYDI should be treated as an insertion with respect to the consensus

• Should decide which columns should correspond to match states, and which to inserts

• A simple rule that works well in practice is that columns that are more than half gap characters should be modeled by inserts

Page 36: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

36

Probability Values

akl Akl

Akl 'l '

ek (a) Ek (a)

Ek (a')a '

k,l : indices over states

akl and ek : transition and emission probabilities

Akl and Ek : transition and emission frequencies

Page 37: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

37

Problem With The Approach

• Transitions and emissions that don’t appear in the training dataset would acquire zero probability (would never be allowed)

• Solution: add pseudocounts to the observed frequencies

• Simplest pseudocount method is Laplace’s rule: add one to each frequency

Page 38: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

38

Example

Page 39: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

39

Example: Full Profile HMM

Page 40: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

40

Searching With Profile HMMs• One of the main purposes of developing profile H

MMs is to use them to detect potential membership in a family by obtaining significant matches of a sequence to the profile HMM

• We assume that we are looking for global matches

• We can use either the Viterbi algorithm to get the most probable alignment or the forward algorithm to calculate the full probability of the sequence summed over all possible paths

Page 41: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

41

Searching with profile HMMs• Maintaining log-odd ratio compared with random

model

i

xiqRxP )|(

Page 42: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

42

Viterbi Equations

• VjM(i) is the log-odds score of the best path

matching subsequence x1…i to the submodel up to state j, ending with xi being emitted by state Mj

• VjI(i) is the score of the best path ending in

xi being emitted by Ij, and VjD(i) for the best

path ending in state Dj

Page 43: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

43

Viterbi Algorithm

Page 44: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

44

Viterbi Equations

• In a typical case, there is no emission score eIj(xi) in the equation for Vj

I(i) since we assu

me that the emission distribution from the insert state Ij is the same as the background distribution, so the probabilities cancel in the log-odds form

• Also the D→I and I→D trsnsition terms may not be present

Page 45: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

45

Forward Algorithm

Page 46: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

46

Variants for non-global alignments

• We can generalize profile HMMs for local, repeat and overlap alignments

• The profile HMM for local algorithm is to specify a new model for the complete sequence x ,which incorporates the original profile HMM together with one or more copies of a simple selflooping model that is used to account for the regions of unaligned sequence

Page 47: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

47

Variants for non-global alignments

• Notice that as well as specifying the emission probabilities of the new states, which will normally of course be qa , we must specify a number of new transition probabilities

• The looping probability on the flanking states should be close to 1, since they must account for long stretches of sequence

• Let us set these to (1-η)

Page 48: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

48

Variants for non-global alignments

• For the transition probabilities from the left flanking state to different start points in the model, we can give them equal probabilities, η/L

• Another option is to assign more probability to starting at the beginning of the model. That is the option used in HMMER package (Eddy 1996)

Page 49: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

49

Variants for non-global alignments

• Local alignments (flanking model)– Emission prob. in flanking states use background values qa.

– Looping prob. close to 1, e.g. (1- ) for some small .

Mj

Ij

Dj

Begin End

Q Q

Page 50: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

50

Variants for non-global alignments

• Overlap alignments– Only transitions to the first model state are allowed.

– When expecting to find either present as a whole or absent

– Transition to first delete state allows missing first residue

Begin Mj End

IjQ

Dj

Q

Page 51: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

51

Variants for non-global alignments

• Repeat alignments– Transition from right flanking state back to random model

– Can find multiple matching segments in query string

Mj

Ij

Dj

Begin EndQ

Page 52: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

52

More on estimation of prob. • Maximum likelihood (ML) estimation

– given observed freq. cja of residue a in position j.

• Problem of ML estimation– If observed cases are absent?– Specially when observed examples are somewhat few.

' 'M )(

a ja

ja

c

cae

j

Page 53: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

53

More on estimation of prob. • Simple pseudocounts

– qa: background distribution

– A: weight factor

– Laplace’s rule: Aqa = 1

• Bayesian framework– Dirichlet prior (αa= Aqa)

(θare model probabilities)

' '

M )(a ja

aja

cA

Aqcae

j

)(

)()|()|(

DP

PDPDP

Page 54: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

54

Dirichlet Distribution

• The Dirichlet distribution of order K ≥ 2 with parameters α1, ..., αK > 0 has a pdf on RK–1 given by

Page 55: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

55

More on estimation of prob. • Dirichlet mixtures

– Mixtures of dirichlet prior: better than single dirichlet prior

– With K pseudocount priors,

)()|()(

'' 'M k

aa ja

kaja

kj c

ckPae

j

c

' ' )'|(

)|()|(

k jk

jkj kPp

kPpkP

c

cc

a

ka

kaa jaa ja

a

kaa

kajaa ja

j cc

cckP

)()(!

)()()!()|(

c

Page 56: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

56

Optimal model construction• In order to estimate the probability parameters

from profile HMMs, we need to consider– Which columns to insert states or which to match

states? (called model construction)– If marked multiple alignments have no errors, the

optimal model can be constructed– In the profile HMM formalism, it is assumed that an

aligned column of symbols corresponds either to emissions from the same match state or to emissions from the same insert state

Page 57: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

57

Optimal model construction• It therefore suffices to mark which columns come

from match states to specify a profile HMM architecture and the state paths for all the sequences in the alignment, as shown in Figure 5.7

• In a marked column, symbols are assigned to match states and gaps are assigned to delete states

• In an unmarked column, symbols are assigned to insert states and gaps are ignored

• State transition and symbol emission counts are obtained from the state paths, and these counts can be used to estimate probability parameters

Page 58: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

58

Optimal model construction

• 2L combinations for markings of L columns, and hence 2L different profile HMMs to choose from

• A manual construction is shown in Figure 5.7

• Other ways to determine the marking include heuristic construction and a maximum a posteriori (MAP) choice

Page 59: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

59

Optimal model construction

beg M M M end

II II

D DD

x x . . . xbat A G - - - Crat A - A G - Ccat A G - A A -gnat - - A A A Cgoat A G - - - C 1 2 . . . 3

(a) Multiple alignment:

(b) Profile-HMM architecture:

0 1 2 3 4

0 1 2 3A - 4 0 0C - 0 0 4G - 0 3 0T - 0 0 0A 0 0 6 0C 0 0 0 0G 0 0 1 0T 0 0 0 0M-M 4 3 2 4M-D 1 1 0 0M-I 0 0 1 0I-M 0 0 2 0I-D 0 0 1 0I-I 0 0 4 0D-M - 0 0 1D-D - 1 0 0D-I - 0 2 0

(c) Observed emission/transition counts

matchemissions

insertemissions

statetransitions

Page 60: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

60

Optimal Model Construction • MAP match-insert assignment

– Recursive calculation of a number Sj

• Sj: log prob. of the optimal model for alignment up to and including column j, assuming j is marked.

• Sj is calculated from Si and summed log prob. between i and j.

• Tij: summed log prob. of all the state transitions between marked i and j.

– cxy are obtained from partial state paths implied by marking i and j.

ID,M,,

logyx

xyxyij acT

Page 61: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

61

Optimal Model Construction • Algorithm: MAP model construction

– Initialization:• S0 = 0, ML+1 = 0.

– Recurrence: for j = 1,..., L+1:

– Traceback: from j = L+1, while j > 0:• Mark column j as a match column

• j = j.

;maxarg

;max

1,10

1,10

jijijiji

j

jijijiji

j

IMTS

IMTSS

Page 62: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

62

Weighting Training Sequences

• One issue that we have avoided completely so far is that of weighting sequences when estimating parameters

• In a typical alignment, there are often some sequences that very closely related to each other

• Intuitively, some of the information from these sequences is shared, so we should not give them each the same influence in the estimation process as a single sequence that is more highly diverged from all the others

Page 63: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

63

Weighting Training Sequences

• In the extreme that two sequences are identical, it makes sense that they should each get half the weight of other sequences

• There have been a large number of proposals for different ways to assign weights to sequences

Page 64: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

64

Weighting Training Sequences

• Good random sample do you have?

• “Assumption : all examples are independent samples” might be incorrect

• Solutions– Weight sequences based on similarity

Page 65: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

65

Simple Weighting Scheme Derived from a Tree

• Many weighting approaches based on building a tree relating the sequences

• Since sequences in a family are related by an evolutionary tree, a very natural approach is to try to reconstruct this tree and use it when estimating the independent contribution of each of the observed sequences, down weighting sequences that have only recently diverged

Page 66: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

66

Simple Weighting Scheme Derived from a Tree

• Here we assume that we are given a tree connecting the sequences, with branch lengths indicating the relative degrees of divergence for each edge in the tree

Page 67: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

67

Weighting Training Sequences

• Simple weighting schemes derived from a tree– Phylogenetic tree is given.

• [Thompson, Higgins & Gibson 1994b]– Kirchohoff’s law

• [Gerstein, Sonnhammer & Chothia 1994]

nk k

ini w

wtw

below leaves

Page 68: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

68

Weighting Training Sequences

• We are given a tree made of a conducting wire of constant thickness and apply a voltage V to the root

• All the leaves are set to zero potential and the currents flowing from them are measured and taken to be the weights

• Clearly, the currents will be smaller in the highly divided parts of the tree so these weights have the right qualitative properties

Page 69: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

69

Weighting Training Sequences

t4 = 8t3 = 5

t2 = 2t1 = 2

t5 = 3

t6 = 3

5

6

7

1 2 3 4

I4I1+I2

I1+I2+I3

V5

V6

V7

I1 I2

I3

I1:I2:I3:I4 = 20:20:32:47w1:w2:w3:w4 = 35:35:50:64

Page 70: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

70

Weighting Training Sequences• Root weights from Gaussian parameters

– Influence of leaves on the root distr.– Altchul-Carroll-Lipman wieghts

• Make gaussian distr.

• Mean : linearly combination of xi.

• Combination weights represent the influences of leaves.

Page 71: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

71

Weighting Training Sequences

t3

t2t1

4

x1 x2 x3

5

12

22211

2

)(

121 ),|4 nodeat ( t

xvxvx

eKLLxP

2211

21211221112121 )/(),/(),/(

xvxv

ttttttttvtttv

Page 72: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

72

Weighting Training Sequences • Voronoi weights

– Proportional to the volume of empty space– Sequence family in sequence space– Algorithm

• Random sample: choosing at kth position uniformly from the set of residues occurring kth position

• ni: count of samples closest to the ith family

• ith weight

k ki nn /

Page 73: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

73

Weighting Training Sequences • Maximum discrimination weights

– Focus: decision on whether sequences are members of the family or not

– discrimination

– weight: 1-P(M|xi)

– effect: difficult members are given big weight

k

kxMPD )|(

))(1)(|()()|(

)()|()|(

MPRxPMPMxP

MPMxPxMP

Page 74: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

74

Weighting Training Sequences • Maximum entropy weights (1)

– Intuition• kia: number of residues of type a in column i of a multiple

alignment

• mi: number of different types of residues in column i

• As uniform as possible– weight for sequence k:

– ML estimation under the weights: pia = 1/mi

– Averaging over all columns [Henikoff 1994]

)/(1 kiixikm

i ixi

kki

kmw

1

Page 75: 1 Chapter 5 Profile HMMs for Sequence Families. 2 What have we done? So far, we have concentrated on the intrinsic properties of single sequences (CpG

75

Weighting Training Sequences • Maximum entropy weights (2)

– entropy: an measure of the ‘uniformity’ [Krogh & Mitchison 1995]

– maximize

– example• x1 = AFA, x2 = AAC, x3 = DAC

• w1 = w3 =0.5, w2 = 0

k ki i wwH )(

iaa iai ppwH log)(

(sum to one constraints)

)log()(log)(

)log()(log)(

log)log()()(

3232113

3232112

3321211

wwwwwwwH

wwwwwwwH

wwwwwwwH