chapter 6 profiles and hidden markov models

24
Chapter 6 Profiles and Hidden Markov Models

Upload: trista

Post on 12-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or DNA) sequences Position-specific scoring matrix (PSSM) Profile Hidden Markov Model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 6 Profiles and Hidden Markov Models

Chapter 6

Profiles and Hidden Markov Models

Page 2: Chapter 6 Profiles and Hidden Markov Models

•The following approaches can also be used to identify distantly related members to a family of protein (or DNA) sequences

•Position-specific scoring matrix (PSSM)•Profile•Hidden Markov Model

•These methods work by providing a statistical frame where the probability of residues or nucleotides at specific sequences are tested•Thus, in multiple alignments, information on all the members in the alignment is retained.

Page 3: Chapter 6 Profiles and Hidden Markov Models

Position-specific scoring matrices

Position 1 2 3 4 5 6

Sequence 1 A T G T C G

Sequence 2 A A G A C T

Sequence 3 T A C T C A

Sequence 4 C G G A G G

Sequence 5 A A C C T G

Pos 1 2 3 4 5 6 Overall Freq.

A 0.6 0.6 - 0.4 - 0.2 0.3

T 0.2 0.2 - 0.4 0.2 0.2 0.2

G - 0.2 0.6 - 0.2 0.6 0.27

C 0.2 - 0.4 0.2 0.6 - 0.23

Pos 1 2 3 4 5 6

A 2.0 2.0 - 1.33 - 0.67

T 1.0 1.0 - 2.0 1.0 1.0

G - 0.74 2.22 - 0.74 2.22

C 0.87 - 1.74 0.87 2.61 -

Pos 1 2 3 4 5 6

A 1.0 1.0 - 0.41 - -0.58

T 0.0 0.0 - 1.0 0.0 0.0

G - -0.43 1.15 - -0.43 1.15

C -0.2 - 0.8 -0.2 1.38 -

Frequencies of observations in a position

Normalised to overall frequencies Converted to log2

Page 4: Chapter 6 Profiles and Hidden Markov Models

Pos 1 2 3 4 5 6

A 1.0 1.0 - 0.41 - -0.58

T 0.0 0.0 - 1.0 0.0 0.0

G - -0.43 1.15 - -0.43 1.15

C -0.2 - 0.8 -0.2 1.38 -

Match AACTCG to the PSSM matrix:

•1.0+1.0+0.8+1.0+1.38+1.15 = 6.33•26.33 = ~80•Thus, the sequence AACTCG is 80 times more likely to fit than a random 6 nucleotide sequence

Page 5: Chapter 6 Profiles and Hidden Markov Models

Profiles

•Profiles are PSSMs that include gap penalty information•This is not a trivial problem, and is incorporated in Position specific iterated (PSI) BLAST•A normal BLASTP is performed with the query sequence, homologs obtained, and a multiple alignment performed•A Profile is based on this alignment•The profile is used to search the database again, and a new profile is created by adding in newly identified homologs•This process is repeated until no new homologs are identified•PSI-BLAST very sensitive approach to search for distant relatives of a family•High sensitivity can generate high false positive count•Inclusion of false positives can lead to profile drift•User can visually inspect each iteration result to decide on inclusion of sequences•Typically 3-5 iterations sufficient to identiofy distant homologs

Page 6: Chapter 6 Profiles and Hidden Markov Models

Markov Model and Hidden Markov Model

•A Markov chain described a series of events or states•There is a certain probability to move from one state to the next state•This is known as the transition probability•Sequences can also be seen as Markov chains where the occurrence of a given nucleotide may depend on the preceding nucleotide •Zero order Markov model described a state that is independent of a previous state•First order Markov model state is dependent on direct precursor (i.e., di-nucleotide sequences)•Second order Markov model, depends on three nucleotides, for example codons•Thus frequency of transitions in tri-mers may be different in coding and non-coding regions of the genome•The Markov model is therefore applicable to finding genes in genomes•In a Markov model all states are observable

1 P122 P23

3 P344 P45

5

Page 7: Chapter 6 Profiles and Hidden Markov Models

Hidden Markov model

1

2

2’

3

3’

4

4’

5

P12

P23 P34

P45

P12’

P23’ P34’

P45’

A Markov model may consist of observable states and unobservable or “hidden” statesThe hidden states also affect the outcome of the observed statesIn a sequence alignment, a gap is an unobserved state that influences to probability of the next nucleotideThe probability of going from one state to the next state is called the transition probabilityIn DNA, there are four symbols or states: G, A, T and C (20 in proteins)The probability value associated with each symbol is the emission probabilityTo calculate the probability of a particular path, the transition and emission probabilities of all possible paths have to be considered

Begin state End state

Observable states

Hidden states

Page 8: Chapter 6 Profiles and Hidden Markov Models

A 0.80

C 0.02

G 0.10

T 0.08

Emissionprobability

A 0.11

C 0.08

G 0.32

T 0.49

0.40

Transitionprobability

State 1 State 2

This particular Markov model has a probability of 0.80 X 0.40 X 0.32 = 0.102to generate the sequence AGThis particular model shows that the sequence AT has the highest probability to occurWhere do these numbers come from?A Markov model has to be “trained” with examples

A simple two state example

Page 9: Chapter 6 Profiles and Hidden Markov Models

Training

•The frequencies of occurrence of nucleotides in a multiply aligned sequence is used to calculate the emission and transition probabilities of each symbol at each state•The trained HMM is then used to test how well a new sequence fits to the model•The use a HMM for gaps sequence alignments, a state can either be a

• match/mismatch (mismatch is low probability match) (observable)•Insertion (hidden)•Deletion (hidden)

B

D1

I1

M1

D2

I2

M2

D3

I3

M3

I4

E

There is one optimal path from B to E that describes the most probable sequence and the optimal alignment to the multiply aligned sequence family

Page 10: Chapter 6 Profiles and Hidden Markov Models

Viterbi algorithm

Page 11: Chapter 6 Profiles and Hidden Markov Models

Bubblesort

index value

0 6

1 4

2 2

3 5

4 1

5 8

6 7

7 9

8 3

9 0

The algorithm

•Two loops•The outer loop starts at index max-1 and decrements by -1 with every loop•The inner loop starts at 0 and increments by +1 to the value of the outer loop•Compare values at index and at index+1 in the inner loop•If value[index]<value[index+1], swap them•Continue until outer loop is 1max

0

A brief interlude, looking at algorithms…

Page 12: Chapter 6 Profiles and Hidden Markov Models

index value

0 6

1 4

2 0

3 5

4 1

5 8

6 7

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 0

3 5

4 1

5 8

6 7

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 0

3 5

4 1

5 8

6 7

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 0

4 1

5 8

6 7

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 8

5 0

6 7

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 8

5 7

6 0

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 8

5 7

6 9

7 0

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 8

5 7

6 9

7 3

8 0

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 0

5 8

6 7

7 9

8 3

9 2

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 8

5 7

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

Outerloop=9

Smallest number is now at the bottom

Page 13: Chapter 6 Profiles and Hidden Markov Models

index value

0 6

1 4

2 5

3 1

4 8

5 7

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 4

2 5

3 1

4 8

5 7

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 1

4 8

5 7

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 1

4 8

5 7

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 8

4 7

5 1

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 8

4 7

5 9

6 1

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 8

4 7

5 9

6 3

7 1

8 2

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 8

4 7

5 9

6 3

7 2

8 1

9 0

Outer loop = 9Inner loop = 0

index value

0 6

1 5

2 4

3 8

4 1

5 7

6 9

7 3

8 2

9 0

Outer loop = 9Inner loop = 0

Outerloop=9

Next smallest number is now at the bottom-1

Page 14: Chapter 6 Profiles and Hidden Markov Models

import random

def bubblesort(list_of_numbers): for outer_loop in range(len(list_of_numbers)-1, 0, -1): for index in range(outer_loop): if list_of_numbers[index] < list_of_numbers[index + 1]: temporary = list_of_numbers[index] list_of_numbers[index] = list_of_numbers[index + 1] list_of_numbers[index + 1] = temporary return list_of_numbers

numbers=range(10) #get a list of numbers from 0 to 9random.shuffle(numbers) # shuffle the numbersprint "In random order: ", numbersprint "In order: ", bubblesort(numbers)

Python code for Bubblesort algorithm

Page 15: Chapter 6 Profiles and Hidden Markov Models

def qsort2(L): if len(L)<=1: return L pivot=L[0] less= [x for x in L if x<pivot] equal= [x for x in L if x==pivot] greater= [x for x in L if x>pivot] return qsort2(less)+equal+qsort2(greater)

Quicksort

Page 16: Chapter 6 Profiles and Hidden Markov Models

Applications of HMMs

•HMMs include predictive information of insertions and deletions separately•Not arbitrary “gap penalties”•Once HMMs are trained, can be used to identify distant family members in a database•Can be used for protein family classification•Advanced gene and promoter prediction•Transmembrane protein prediction•Protein fold recognition•Nucleosome positions•HMMer (http://hmmer.wustl.edu/) suite of linux programs

•hmmalign, aligns sequences to an HMM profile. •hmmbuild, build a hidden Markov model from an alignment.•hmmcalibrate, calibrate HMM search statistics. •hmmconvert, convert between profile HMM file formats. •hmmemit, generate sequences from a profile HMM. •hmmfetch, retrieve specific HMM from an HMM database. •hmmindex, create SSI index for an HMM database. •hmmpfam, search one or more sequences against HMM database. •hmmsearch, search a sequence database with a profile HMM.

Page 17: Chapter 6 Profiles and Hidden Markov Models

Chapter 7

Protein Motif and Domain Prediction

Page 18: Chapter 6 Profiles and Hidden Markov Models

•A motif is a conserved sequence 10-230 aa long•Eg. Zn-finger motif•Domain is 40-700 aa in length•Eg. transmembrane domain•Motifs and domains are often evolutionally conserved•Useful to identify functions of proteins that should little homology over full sequence•Motifs and domains often identified by PSSM and HMMs•Motifs or domains can be stored in a database•Unknown proteins can be matched to this database to identify motifs and domains and illuminate possible protein fundctions•Motifs domains can be stored as

•regular expression ([ST]-X-[RK])•Or as PSSM or HMMs

Page 19: Chapter 6 Profiles and Hidden Markov Models

Regular expressions

E-X(2)-[FHM]-X(4)-{P}-L

•Invariant•Conserved in square [] brackets•Disallowed in curly {} brackets•Nonspecific shown by X•Repetions by number in round () brackets

•PROSITE (http://expasy.org/prosite/)•High number of false negatives•Database must be continually updated• PSSM, profiles and HMMs incorporate statistical information and are much more accurate

Page 20: Chapter 6 Profiles and Hidden Markov Models
Page 21: Chapter 6 Profiles and Hidden Markov Models

PRINTSMatches smaller regions of a motifs called “fingerprints” to queryhttp://www.bioinf.manchester.ac.uk/dbbrowser/

BLOCKSPSSM or aligned sequences used to define blocks that are larger than motifshttp://blocks.fhcrc.org/blocks/

ProDomDatabase generated with PSI-BLASThttp://prodom.prabi.fr/prodom/current/html/home.php

PfamContains HMMs of seeded smaller alignment from SWISSPROT and trEMBLhttp://pfam.sanger.ac.uk/

SMARTDatabase of HMMs based on manual structural alignments or PSI-BLAST profileshttp://smart.embl-heidelberg.de/

Page 22: Chapter 6 Profiles and Hidden Markov Models

Protein family databases

COG (Cluster of orthologous groups)All against all comparison of all sequenced genomesIf best fit is obtained in prokaryotes, archeae and eukaryotes, defined as clusterClusters can be searched to identify possible function of unknown proteinhttp://www.ncbi.nlm.nih.gov/COG/

ProtoNetPairwise BLAST alignment of all protein sequences in SWISSPROTQuery sequence searched against this databasehttp://www.protonet.cs.huji.ac.il/

Page 23: Chapter 6 Profiles and Hidden Markov Models

Finding distant/little conserved motifs

Expectation MaximizationUse predicted alignment of sequencesCalculate PSSMIterate over used sequences and modify PSSM to better fit each in turn

Gibbs Motif samplingUse estimated alignment of all but one sequenceCalculate PSSMRecalculate PSSM with one left-out sequenceIterate process to convergence setting

Page 24: Chapter 6 Profiles and Hidden Markov Models

Weblogo

Graphical representation of the motif sequenceHighly conserved residues are shown as larger symbolsAmbiguity indicated

http://weblogo.berkeley.edu/logo.cgi

Helix-turn-helix motif of E. coli CAP family protein