hidden markov modeling, multiple alignments and structure bioinformatic modeling techniques student:...
TRANSCRIPT
Hidden Markov Modeling,
Multiple Alignments
and Structure
Bioinformatic Modeling TechniquesStudent: Patricia Pearl
The basic notion of a hidden Markov model was
covered during the class lectures and in our midterm.
There are more issues about its
history
development
and future
that we’ll discuss tonight.
There was a time
when scientists started to think about
using hidden Markov models
for multiple protein alignments.
When was that?
Which professional field was using it already?
This is the bibliographic reference for the article that protein scientists used when they got started.
Rabiner, L. R.
“A tutorial on hidden Markov models and
selected
application in speech recognition.”
Proceedings of the IEEE, 77 (2), 257-286. 1989.
This work was sophisticated and a groupof scientists at University of California at Santa Cruzcould make an analogy between computer speechrecognition and protein multiple alignments.
How did they make the analogy between speech recognition and multiple protein and DNA alignments?
Speech Recognition Multiple Alignments
Alphabet phonemes amino acids
Observation words or strings primary sequence of phonemes
Good – assigns sounds that sequences in the high probability are real words set
The paper they published is:
Krogh, A., Brown, M., Mian, I.S., Sjölander, K., and Haussler, D.
“Hidden Markov Models in Computational Biology:
Applications to Protein Modeling.”
Journal of Molecular Biology, 1994, 235:1501-1531.
Sean Eddy was a student at UCSC then. In an article of his, (1996)
he describes the paper referenced above as:
“The paper that introduced the use of HMM methods for protein
and DNA sequence profiles. “
Then, the software was developed by two collections ofscientists and grad students, separately. There aremany researchers in the subject that are not at these labs.
University of California at Santa Cruz andUniversity of Washington, St Louis, Missouri, by UCSC’s former student, Sean Eddy and his research group.
Two suites of software have been developed. Their differences are non-trivial.
SAM at UCSC Sequence Alignment and Modeling System. HMMER at U of W. Both suites can be downloaded. SAM needs UNIX. HMMER can use many systems.
As has been emphasized in lecture, the advantage of the HMM
approach is that it does not guess aabout gap penalties, nor about
amino acids nor states. It bases those values on actual data,
Bayesian probabilities based in facts.
SAM at UCSC
Sequence Alignment and Modeling System.
<http://www.cse.ucsc.edu/research/compbio/>
Their software is based on HMM’s.
Also use a mathematical approach called
Dirichlet mixtures to improve detection of weak
homologies and to derive hidden Markov models
for protein families.
HMMER at University of Washington
Sean Eddy’s Lab Home Page
http://www.genetics.wustl.edu/eddy/publications/
This page and related pages have many articles that are available
to download.
URL for User’s Guide
http://www.psc.edu/general/software/packages/hmmer/manual/main.html
If we had HMMER installed at BRANDEIS for us, we could all
use it with the help of this manual.
HMMER
One of the approaches that Sean Eddy has taken to improve
HMMER is to use an approach from computational physical
chemistry and x-ray diffraction protein crystallography called
simulated annealing. The probability values of the
fundamental
recursive HMM algorithm are varied by an exponential
factor taken from the Boltzman formula for physical entropy.
S = kb ln Ω
The Boltzman constant, kb, is multiplied by t, for temperature.
It is started at t = high temp and decreased. The “kt” is used
as an exponent P^(1/kt). Eddy reports that it improves
accuracy. (Eddy, S., 1995)
Many people are developing the HMM approach to
use it on RNA sequences. It is meaningful to briefly
describe a recent paper that makes extensive use of
primarily hand done RNA alignments, using both primary
sequence and secondary RNA structure. It produces
evidence toward resolving a problem in systematics biology
or evolutionary biology.
With HMMER, or any similar software, for RNA
alignments, much of this work may be much easier and
have measurable probabilistic statistics in the future.
“However, accurate alignment is only possible for proteins of known structure – at least for an identifiable core of residues that comprises the secondary structure elements and active site of the molecule.”
S. Eddy(1995) quoting Chothia and Lesk(1986)
Common ancestor Common ancestor
ORAnatomicalEvidenceAnd more
rRNAMultiplealignmentsw/outsecondarystructure
Crocodile Bird Mammal
10 20 30 40 ----|----|----|----|----|----|----|----|Seq1 A-CC-----GC--------GA--CUUG--GA-CC-CG--GSeq2 A-CC-----GU--------GA--CUUG--GA-CC-CG--GSeq3 AACCCCGGUGUAGGGGGAAGAACCUUGAUGAACCUCGAUGSeq4 AACCCCGGUGCAGGGGGAAGAACCUUCAUGAACCUCGAUG
Figure 1. The problem of aligning short and long sequences.
Sequences 1 and 2 are like the reptilian and bird ribosomal 18s RNA.Sequences 3 and 4 are like mammals.
Reference: Xiam X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.”Systematic Biology. Washington: Jun 2003. Vol 52, Iss.3; pg 283.
Phylogenetic tree
From: Xiam et al., 2003
They produced several phylogenetic trees, using differentmethods, with the careful manual alignments that tooksecondary structure into account. In all, the birds arecloser to the crocodiles than to the mammals.
“Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by:
1.) misalignment of sequences2.) the inappropriate use of the frequency parameters3.) poor sequence quality.
When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parametersare estimated either from all sites or from the variable domains wheresubstitutions have occurred, the 18S rRNA sequences no longer supportthe grouping of the avian species with the mammalian species.” Xia, X., et al., 2003
If there were more time, this presentation would also
Include discussions of Psi Blast and of SuperFam.
Psi Blast is a BLAST software at NCBI that uses HMM’s
and can use multiple alignments.
<
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.ht
ml
> a tutorial
<http://www.ncbi.nlm.nih.gov/BLAST/> the site
SuperFam is a relatively new website. It uses the HMM approach, 59
genomes, and all the solved structures, from those genomes, that are
publicly available, as well.
<http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY/>
The head scientist of SuperFam, Prof. Cyrus Chothia,
also supervised a web site called SCOP, or Structural
Classification of Proteins. You might find it interesting, that all of the
protein structures that are “solved” are actually organized and classified.
<http://scop.mrc-lmb.cam.ac.uk/scop/>
Bibliography
Eddy, S.R. “Multiple alignment using hidden Markov models.” Proc. Int. Conf. Intell. Syst. Mol Biol. 1995;3:114-120.
Eddy, S.R. “Hidden Markov Models.” Curr Opin Struct Biol. 1996 Jun;6(3):361-5. Review.
Eddy, S.R., “Profile hidden Markov models.” Bioinformatics, 1998;14(9): 755-763. Review.
Gough, J., and Chothia, C., “SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.” Nucleic Acids Research, 2002, Vol 30:1.
Krogh, A., Brown, M., Mian, I.S., Sjolander, Haussler, D. “Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501-1531, February1994.
Rabiner, L. R. “A tutorial on hidden Markov models and selectedapplication in speech recognition.” Proceedings of the IEEE, 77 (2), 257-286. 1989.
Xia, X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.” Systematic Biology. Washington: Jun 2003. Jun 2003. Vol. 52, Iss. 3; pg 283.