multiple alignment stuart m. brown nyu school of medicine
TRANSCRIPT
Multiple Alignment
Stuart M. Brown
NYU School of Medicine
Pairwise Alignment
The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.
The best solution seems to be an approach called Dynamic Programming.
Dynamic Programming Dynamic Programming is a very general
programming technique. It is applicable when a large search space
can be structured into a succession of stages, such that: the initial stage contains trivial solutions to
sub-problems each partial solution in a later stage can
be calculated by recurring a fixed number of partial solutions in an earlier stage
the final stage contains the overall solution
Global vs. Local Alignments
Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.
Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
GAP The GCG program GAP implements the Needleman
and Wunsch Global alignment algorithm.
Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.
Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.
GAP is useful when you want to force two sequences to align over their entire length
BESTFIT
The GCG program BESTFIT implements the Smith-Waterman local alignment algorithm.
FASTA and BLAST are local alignment algorithms
NCBI has a “BLAST 2 Sequences” feature on its website:
http://www.ncbi.nlm.nih.gov/gorf/bl2.html
Pairwise Alignment on the Web
The ALIGN global alignment program is available at several servers:http://molbiol.soton.ac.uk/compute/align.htmlhttp://www2.igh.cnrs.fr/bin/align-guess.cgi
LALIGN local alignment program is available at several servers:http://www2.igh.cnrs.fr/bin/lalign-guess.cgihttp://www.ch.embnet.org/software/LALIGN_form.html
LFASTA uses FASTA for local alignment of 2 sequences:
http://pbil.univ-lyon1.fr/lfasta.html
BLAST 2 Sequences (NCBI)http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
Multiple Alignments In theory, making an optimal alignment
between two sequences is computationally straightforward (Smith-Waterman algorithm), but aligning a large number of sequences using the same method is almost impossible.
The problem increases exponentially with the number of sequences involved
(the product of the sequence lengths)
Optimal Alignment
For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.
Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.
Progressive PairwiseMethods
Most of the available multiple alignment programs use some sort of incremental or progressive method that makes pairwise alignments, then adds new sequences one at a time to these aligned groups.
This is an approximate method!
PILEUP PILEUP is the multiple alignment
program in the GCG package CLUSTAL is another popular
program (also available on the RCR server) that uses a similar algorithm.
The PILEUP Algorithm First, PILEUP calculates approximate pairwise
similarity scores between all sequences to be aligned, and they are clustered into a dendrogram (tree structure).
Then the most similar pairs of sequences are aligned.
Averages (similar to consensus sequences) are calculated for the aligned pairs.
New sequences and clusters of sequences are added one by one, according to the branching order in the dendrogram.
PILEUP Considerations
Since the alignment is calculated on a progressive basis, the order of the initial sequences can affect the final alignment.
PILEUP paramaters: 2 gap penalties (gap insert and gap extend) and an amino acid comparison matrix.
PILEUP will refuse to align sequences that require too many gaps or mismatches.
PILEUP will take quite a while to align more than about 10 sequences
Instructions for running PILEUP
PILEUP uses a list of sequence files as input
You can use output from a FASTA or LOOKUP search as a list or make your own list in a text editor
A list file can include files from your own directory and/or GCG database files.
LIST file format List files always begin with two dots ..
..
gp:S31321 gp:Yno3_Yeast S51900.pep Yan2_Schpo Ypd1_Caeel A36205 Mpp1_Rat begin:100 end:345 B46665.pep Ymxg_Bacsu begin:150 end:464 A48043.pep
List files can also include Begin and End positions within a sequence
PILEUP @myseqs.list
Now at the > prompt, type PILEUP and the name of the file that is your list of sequence names.
However, GCG requires that you must precede the name of your list file with the @ character.
So the command looks like this:
> PILEUP @myseqs.list
1501 1550 Hsirf2 SERPSKKGKK PKTEKEDKVK HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2 SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1 LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1 LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48 ...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4 ...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp ...PEEDQK. .......... .......... CKLGVATAGC VNEVTEMECG Muicsbp ...PEEEQK. .......... .......... CKLGVAPAGC MSEVPEMECG Chkicsbp ...PEEEQK. .......... .......... CKIGVGNGSS LTDVGDMDCS
1551 1600 Hsirf2 IKNEVDSTVN IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........ NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP... P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE .PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP... PQETCRN..P PIPDWWMQQP
PILEUP Output> more myseqs.msf
PILEUP options For a first try, take the default options,
but give the output file a meaningful name.
If you don’t get a good alignment, try a less stringent matrix and/or gap penalties.
> PILEUP -matr=oldpep.cmp
It is a good idea to run PILEUP in batch mode if you have more than 10 sequences to align:
> PILEUP -bat
CLUSTAL CLUSTAL is a stand-alone (i.e. not
integrated into GCG) multiple alignment program that is superior in some respects to PILEUP
Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure.
it can re-align just selected sequences or selected regions in an existing alignment
It can compute phylogenetic trees from a set of aligned sequences.
There are also Mac and PC versions with a nice graphical interface (CLUSTALX).
Using CLUSTAL
On mcrcr0 type: clustal
CLUSTAL can only work with sequences in multi-sequence FASTA format.
The GCG program TOFASTA can convert lists of file names into FASTA multi-sequence format.
Multiple Alignment tools on the Web
There are a variety of multiple alignment tools available for free on the web.
CLUSTAL is available from a number of sites (with a variety of restrictions)
Other algorithms are available too Watch out for “experimental” algorithms;
there may be a good reason why you have never heard of some oddball program
Some URLs
EMBL-EBIhttp://www.ebi.ac.uk/clustalw/
BCM Search Launcher: Multiple Alignment
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html
Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)http://www.ibc.wustl.edu/service/msa/
Editing Multiple Alignments
There are a variety of tools that can be used to modify a multiple alignment.
These programs can be very useful in formatting and annotating an alignment for publication.
An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.
GCG alignment editors
Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP.
Nicely shaded printouts can be produced with PRETTYBOX
GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind.
Other editors
The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free and provide excellent editor functionality.
Many “comprehensive” molecular biology programs include multiple alignment functions: MacVector, OMIGA, Vector NTI, and
GeneTool/PepTool all include a built-in version of CLUSTAL
SeqVu
Editors on the Web Check out CINEMA (Colour
INteractive Editor for Multiple Alignments) It is an editor created completely in
JAVA (old browsers beware) It includes a fully functional version
of CLUSTAL, BLAST, and a DotPlot module
http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/