hidden markov modeling, multiple alignments and structure bioinformatic modeling techniques student:...

Hidden Markov Modeling,

Multiple Alignments

and Structure

Bioinformatic Modeling TechniquesStudent: Patricia Pearl

The basic notion of a hidden Markov model was

covered during the class lectures and in our midterm.

There are more issues about its

history

development

and future

that we’ll discuss tonight.

There was a time

when scientists started to think about

using hidden Markov models

for multiple protein alignments.

When was that?

Which professional field was using it already?

This is the bibliographic reference for the article that protein scientists used when they got started.

Rabiner, L. R.

“A tutorial on hidden Markov models and

selected

application in speech recognition.”

Proceedings of the IEEE, 77 (2), 257-286. 1989.

This work was sophisticated and a groupof scientists at University of California at Santa Cruzcould make an analogy between computer speechrecognition and protein multiple alignments.

How did they make the analogy between speech recognition and multiple protein and DNA alignments?

Speech Recognition Multiple Alignments

Alphabet phonemes amino acids

Observation words or strings primary sequence of phonemes

Good – assigns sounds that sequences in the high probability are real words set

The paper they published is:

Krogh, A., Brown, M., Mian, I.S., Sjölander, K., and Haussler, D.

“Hidden Markov Models in Computational Biology:

Applications to Protein Modeling.”

Journal of Molecular Biology, 1994, 235:1501-1531.

Sean Eddy was a student at UCSC then. In an article of his, (1996)

he describes the paper referenced above as:

“The paper that introduced the use of HMM methods for protein

and DNA sequence profiles. “

Then, the software was developed by two collections ofscientists and grad students, separately. There aremany researchers in the subject that are not at these labs.

University of California at Santa Cruz andUniversity of Washington, St Louis, Missouri, by UCSC’s former student, Sean Eddy and his research group.

Two suites of software have been developed. Their differences are non-trivial.

SAM at UCSC Sequence Alignment and Modeling System. HMMER at U of W. Both suites can be downloaded. SAM needs UNIX. HMMER can use many systems.

As has been emphasized in lecture, the advantage of the HMM

approach is that it does not guess aabout gap penalties, nor about

amino acids nor states. It bases those values on actual data,

Bayesian probabilities based in facts.

SAM at UCSC

Sequence Alignment and Modeling System.

<http://www.cse.ucsc.edu/research/compbio/>

Their software is based on HMM’s.

Also use a mathematical approach called

Dirichlet mixtures to improve detection of weak

homologies and to derive hidden Markov models

for protein families.

HMMER at University of Washington

Sean Eddy’s Lab Home Page

http://www.genetics.wustl.edu/eddy/publications/

This page and related pages have many articles that are available

to download.

URL for User’s Guide

http://www.psc.edu/general/software/packages/hmmer/manual/main.html

If we had HMMER installed at BRANDEIS for us, we could all

use it with the help of this manual.

http://www.genetics.wustl.edu/eddy/publications/

http://www.psc.edu/general/software/packages/hmmer/manual/main.html

HMMER

One of the approaches that Sean Eddy has taken to improve

HMMER is to use an approach from computational physical

chemistry and x-ray diffraction protein crystallography called

simulated annealing. The probability values of the

fundamental

recursive HMM algorithm are varied by an exponential

factor taken from the Boltzman formula for physical entropy.

S = kb ln Ω

The Boltzman constant, kb, is multiplied by t, for temperature.

It is started at t = high temp and decreased. The “kt” is used

as an exponent P^(1/kt). Eddy reports that it improves

accuracy. (Eddy, S., 1995)

Many people are developing the HMM approach to

use it on RNA sequences. It is meaningful to briefly

describe a recent paper that makes extensive use of

primarily hand done RNA alignments, using both primary

sequence and secondary RNA structure. It produces

evidence toward resolving a problem in systematics biology

or evolutionary biology.

With HMMER, or any similar software, for RNA

alignments, much of this work may be much easier and

have measurable probabilistic statistics in the future.

“However, accurate alignment is only possible for proteins of known structure – at least for an identifiable core of residues that comprises the secondary structure elements and active site of the molecule.”

S. Eddy(1995) quoting Chothia and Lesk(1986)

Common ancestor Common ancestor

ORAnatomicalEvidenceAnd more

rRNAMultiplealignmentsw/outsecondarystructure

Crocodile Bird Mammal

10 20 30 40 ----|----|----|----|----|----|----|----|Seq1 A-CC-----GC--------GA--CUUG--GA-CC-CG--GSeq2 A-CC-----GU--------GA--CUUG--GA-CC-CG--GSeq3 AACCCCGGUGUAGGGGGAAGAACCUUGAUGAACCUCGAUGSeq4 AACCCCGGUGCAGGGGGAAGAACCUUCAUGAACCUCGAUG

Figure 1. The problem of aligning short and long sequences.

Sequences 1 and 2 are like the reptilian and bird ribosomal 18s RNA.Sequences 3 and 4 are like mammals.

Reference: Xiam X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.”Systematic Biology. Washington: Jun 2003. Vol 52, Iss.3; pg 283.

Phylogenetic tree

From: Xiam et al., 2003

They produced several phylogenetic trees, using differentmethods, with the careful manual alignments that tooksecondary structure into account. In all, the birds arecloser to the crocodiles than to the mammals.

“Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by:

1.) misalignment of sequences2.) the inappropriate use of the frequency parameters3.) poor sequence quality.

When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parametersare estimated either from all sites or from the variable domains wheresubstitutions have occurred, the 18S rRNA sequences no longer supportthe grouping of the avian species with the mammalian species.” Xia, X., et al., 2003

If there were more time, this presentation would also

Include discussions of Psi Blast and of SuperFam.

Psi Blast is a BLAST software at NCBI that uses HMM’s

and can use multiple alignments.

<

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.ht

ml

> a tutorial

<http://www.ncbi.nlm.nih.gov/BLAST/> the site

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html

http://www.ncbi.nlm.nih.gov/BLAST/



SuperFam is a relatively new website. It uses the HMM approach, 59

genomes, and all the solved structures, from those genomes, that are

publicly available, as well.

<http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/>

The head scientist of SuperFam, Prof. Cyrus Chothia,

also supervised a web site called SCOP, or Structural

Classification of Proteins. You might find it interesting, that all of the

protein structures that are “solved” are actually organized and classified.

<http://scop.mrc-lmb.cam.ac.uk/scop/>

Bibliography

Eddy, S.R. “Multiple alignment using hidden Markov models.” Proc. Int. Conf. Intell. Syst. Mol Biol. 1995;3:114-120.

Eddy, S.R. “Hidden Markov Models.” Curr Opin Struct Biol. 1996 Jun;6(3):361-5. Review.

Eddy, S.R., “Profile hidden Markov models.” Bioinformatics, 1998;14(9): 755-763. Review.

Gough, J., and Chothia, C., “SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.” Nucleic Acids Research, 2002, Vol 30:1.

Krogh, A., Brown, M., Mian, I.S., Sjolander, Haussler, D. “Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501-1531, February1994.

Rabiner, L. R. “A tutorial on hidden Markov models and selectedapplication in speech recognition.” Proceedings of the IEEE, 77 (2), 257-286. 1989.

Xia, X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.” Systematic Biology. Washington: Jun 2003. Jun 2003. Vol. 52, Iss. 3; pg 283.

hidden markov modeling, multiple alignments and structure bioinformatic modeling techniques student:...

Documents