doug raiford lesson 5. dynamic programming methods needleman-wunsch (global alignment) ...

Download Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:

If you can't read please download the document

Upload: alexandrina-thornton

Post on 08-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

 BLAST fast (linear)  But not as sensitive Speed Sensitivity

TRANSCRIPT

Doug Raiford Lesson 5 Dynamic programming methods Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) BLAST Fixed: best Linear: next best Polynomial (n 2 ): not bad Exponential (3 n ): very bad BLAST fast (linear) But not as sensitive Speed Sensitivity Similarity matrix Especially with amino acids Some amino acids have similar chemical characteristics Similarity to all 8, mers calculated Usually ~50 are above a threshold All of these ~50 are considered hits when searching Matrices PAM (Point Accepted Mutation) Built from observed substitution rates in closely related proteins BLOSOM (BLOck SUbstitution Matrix) Built from observed substitution rates in evolutionarily divergent proteins PSI-BLAST (Position Specific Iterative) Align using default similarity matrix At each query location build a Position Specific Scoring Matrix (PSSM) based upon observed search and alignment results Repeat with new matrix until results no longer change Build sensitivity by specifying allowed similarity at each position Slower, but still faster than local alignment PSI-BLAST Central to bioinformatics Need for Phylogeny Protein function Protein structure Structure function Drug discovery Some parts of proteins are very important to maintain function Must be similar from species to species Can we spot these regions through alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac Often conserved regions are near active sights Ligand binding sights (docking) Protein-to-protein interface Important regions for tertiary structure Ligand: small molecule, target of protein, e.g. O 2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts Ligand: small molecule, target of protein, e.g. O 2 is the ligand for hemoglobin Substrate: a molecule upon which an enzyme acts What if we look at more proteins Increase our confidence? But how to go about performing multiple sequence alignment? atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag Hyper-dimensional dynamic programming Becomes exponential with respect to number of sequences O(n L ) with L = number of sequences Determine all pair-wise distances Fast: number of l-mer matches Slower: full global alignments Start with closest pair and aligns Then aligns the next closest to those two And so on.. ClustalW: cluster-alignment Profile: matrix of real values, representing the probability of amino acids at each position in a corresponding multiple sequence alignment A modification of the Smith/Waterman algorithm Degree to which an aa is preferred is the degree of match between the profile and the sequence Consensus1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33 Consensus1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38 | : : : || : ::::: : |: | ::|: : | : OPSD_XENLA 1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33 Mistakes early in a progressive approach propagated throughout process Once aligned not revisited Iterative methods devised to revisit Newest version of ClustalW (version 2) includes iteration Other MSA apps T-Coffee PSalign DIALIGN MUSCLE Other MSA apps T-Coffee PSalign DIALIGN MUSCLE Height of letter represents how prevalent that letter is at that position Database Searches16 Scores are affected by sequence lengths If want scores that can be compared across different query lengths need to normalize Term bit comes from fact that probabilities are stored as log 2 values (binary, bit) Done so can add across length of sequence instead of multiply