multiple alignment stuart m. brown nyu school of medicine

Multiple Alignment

Stuart M. Brown

NYU School of Medicine

Pairwise Alignment

The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.

The best solution seems to be an approach called Dynamic Programming.

Dynamic Programming Dynamic Programming is a very general

programming technique. It is applicable when a large search space

can be structured into a succession of stages, such that: the initial stage contains trivial solutions to

sub-problems each partial solution in a later stage can

be calculated by recurring a fixed number of partial solutions in an earlier stage

the final stage contains the overall solution

Global vs. Local Alignments

Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.

Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

GAP The GCG program GAP implements the Needleman

and Wunsch Global alignment algorithm.

Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.

Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

GAP is useful when you want to force two sequences to align over their entire length

BESTFIT

The GCG program BESTFIT implements the Smith-Waterman local alignment algorithm.

FASTA and BLAST are local alignment algorithms

NCBI has a “BLAST 2 Sequences” feature on its website:

http://www.ncbi.nlm.nih.gov/gorf/bl2.html

Pairwise Alignment on the Web

The ALIGN global alignment program is available at several servers:http://molbiol.soton.ac.uk/compute/align.htmlhttp://www2.igh.cnrs.fr/bin/align-guess.cgi

LALIGN local alignment program is available at several servers:http://www2.igh.cnrs.fr/bin/lalign-guess.cgihttp://www.ch.embnet.org/software/LALIGN_form.html

LFASTA uses FASTA for local alignment of 2 sequences:

http://pbil.univ-lyon1.fr/lfasta.html

BLAST 2 Sequences (NCBI)http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html

Multiple Alignments In theory, making an optimal alignment

between two sequences is computationally straightforward (Smith-Waterman algorithm), but aligning a large number of sequences using the same method is almost impossible.

The problem increases exponentially with the number of sequences involved

(the product of the sequence lengths)

Optimal Alignment

For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.

Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.

Progressive PairwiseMethods

Most of the available multiple alignment programs use some sort of incremental or progressive method that makes pairwise alignments, then adds new sequences one at a time to these aligned groups.

This is an approximate method!

PILEUP PILEUP is the multiple alignment

program in the GCG package CLUSTAL is another popular

program (also available on the RCR server) that uses a similar algorithm.

The PILEUP Algorithm First, PILEUP calculates approximate pairwise

similarity scores between all sequences to be aligned, and they are clustered into a dendrogram (tree structure).

Then the most similar pairs of sequences are aligned.

Averages (similar to consensus sequences) are calculated for the aligned pairs.

New sequences and clusters of sequences are added one by one, according to the branching order in the dendrogram.

PILEUP Considerations

Since the alignment is calculated on a progressive basis, the order of the initial sequences can affect the final alignment.

PILEUP paramaters: 2 gap penalties (gap insert and gap extend) and an amino acid comparison matrix.

PILEUP will refuse to align sequences that require too many gaps or mismatches.

PILEUP will take quite a while to align more than about 10 sequences

Instructions for running PILEUP

PILEUP uses a list of sequence files as input

You can use output from a FASTA or LOOKUP search as a list or make your own list in a text editor

A list file can include files from your own directory and/or GCG database files.

LIST file format List files always begin with two dots ..

..

gp:S31321 gp:Yno3_Yeast S51900.pep Yan2_Schpo Ypd1_Caeel A36205 Mpp1_Rat begin:100 end:345 B46665.pep Ymxg_Bacsu begin:150 end:464 A48043.pep

List files can also include Begin and End positions within a sequence

PILEUP @myseqs.list

Now at the > prompt, type PILEUP and the name of the file that is your list of sequence names.

However, GCG requires that you must precede the name of your list file with the @ character.

So the command looks like this:

> PILEUP @myseqs.list

1501 1550 Hsirf2 SERPSKKGKK PKTEKEDKVK HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2 SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1 LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1 LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48 ...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4 ...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp ...PEEDQK. .......... .......... CKLGVATAGC VNEVTEMECG Muicsbp ...PEEEQK. .......... .......... CKLGVAPAGC MSEVPEMECG Chkicsbp ...PEEEQK. .......... .......... CKIGVGNGSS LTDVGDMDCS

1551 1600 Hsirf2 IKNEVDSTVN IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........ NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP... P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE .PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP... PQETCRN..P PIPDWWMQQP

PILEUP Output> more myseqs.msf

PILEUP options For a first try, take the default options,

but give the output file a meaningful name.

If you don’t get a good alignment, try a less stringent matrix and/or gap penalties.

> PILEUP -matr=oldpep.cmp

It is a good idea to run PILEUP in batch mode if you have more than 10 sequences to align:

> PILEUP -bat

CLUSTAL CLUSTAL is a stand-alone (i.e. not

integrated into GCG) multiple alignment program that is superior in some respects to PILEUP

Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure.

it can re-align just selected sequences or selected regions in an existing alignment

It can compute phylogenetic trees from a set of aligned sequences.

There are also Mac and PC versions with a nice graphical interface (CLUSTALX).

Using CLUSTAL

On mcrcr0 type: clustal

CLUSTAL can only work with sequences in multi-sequence FASTA format.

The GCG program TOFASTA can convert lists of file names into FASTA multi-sequence format.

Multiple Alignment tools on the Web

There are a variety of multiple alignment tools available for free on the web.

CLUSTAL is available from a number of sites (with a variety of restrictions)

Other algorithms are available too Watch out for “experimental” algorithms;

there may be a good reason why you have never heard of some oddball program

Some URLs

EMBL-EBIhttp://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple Alignment

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)http://www.ibc.wustl.edu/service/msa/

Editing Multiple Alignments

There are a variety of tools that can be used to modify a multiple alignment.

These programs can be very useful in formatting and annotating an alignment for publication.

An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.

GCG alignment editors

Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP.

Nicely shaded printouts can be produced with PRETTYBOX

GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind.

Other editors

The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free and provide excellent editor functionality.

Many “comprehensive” molecular biology programs include multiple alignment functions: MacVector, OMIGA, Vector NTI, and

GeneTool/PepTool all include a built-in version of CLUSTAL

Editors on the Web Check out CINEMA (Colour

INteractive Editor for Multiple Alignments) It is an editor created completely in

JAVA (old browsers beware) It includes a fully functional version

of CLUSTAL, BLAST, and a DotPlot module

http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/

multiple alignment stuart m. brown nyu school of medicine

Documents

alignment n

optimal alignment

n local alignment algorithms

global alignment program

n blast

bestfit n

n fasta

sequences dna