multiple alignment stuart m. brown nyu school of medicine

33
Multiple Alignment Stuart M. Brown NYU School of Medicine

Upload: rudolf-daniel

Post on 25-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multiple Alignment Stuart M. Brown NYU School of Medicine

Multiple Alignment

Stuart M. Brown

NYU School of Medicine

Page 2: Multiple Alignment Stuart M. Brown NYU School of Medicine
Page 3: Multiple Alignment Stuart M. Brown NYU School of Medicine

Pairwise Alignment

The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.

The best solution seems to be an approach called Dynamic Programming.

Page 4: Multiple Alignment Stuart M. Brown NYU School of Medicine

Dynamic Programming Dynamic Programming is a very general

programming technique. It is applicable when a large search space

can be structured into a succession of stages, such that: the initial stage contains trivial solutions to

sub-problems each partial solution in a later stage can

be calculated by recurring a fixed number of partial solutions in an earlier stage

the final stage contains the overall solution

Page 5: Multiple Alignment Stuart M. Brown NYU School of Medicine
Page 6: Multiple Alignment Stuart M. Brown NYU School of Medicine

Global vs. Local Alignments

Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.

Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.

Page 7: Multiple Alignment Stuart M. Brown NYU School of Medicine

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 8: Multiple Alignment Stuart M. Brown NYU School of Medicine

GAP The GCG program GAP implements the Needleman

and Wunsch Global alignment algorithm.

Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.

Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared.

GAP is useful when you want to force two sequences to align over their entire length

Page 9: Multiple Alignment Stuart M. Brown NYU School of Medicine

BESTFIT

The GCG program BESTFIT implements the Smith-Waterman local alignment algorithm.

FASTA and BLAST are local alignment algorithms

NCBI has a “BLAST 2 Sequences” feature on its website:

http://www.ncbi.nlm.nih.gov/gorf/bl2.html

Page 10: Multiple Alignment Stuart M. Brown NYU School of Medicine

Pairwise Alignment on the Web

The ALIGN global alignment program is available at several servers:http://molbiol.soton.ac.uk/compute/align.htmlhttp://www2.igh.cnrs.fr/bin/align-guess.cgi

LALIGN local alignment program is available at several servers:http://www2.igh.cnrs.fr/bin/lalign-guess.cgihttp://www.ch.embnet.org/software/LALIGN_form.html

LFASTA uses FASTA for local alignment of 2 sequences:

http://pbil.univ-lyon1.fr/lfasta.html

BLAST 2 Sequences (NCBI)http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html

Page 11: Multiple Alignment Stuart M. Brown NYU School of Medicine
Page 12: Multiple Alignment Stuart M. Brown NYU School of Medicine

Multiple Alignments In theory, making an optimal alignment

between two sequences is computationally straightforward (Smith-Waterman algorithm), but aligning a large number of sequences using the same method is almost impossible.

The problem increases exponentially with the number of sequences involved

(the product of the sequence lengths)

Page 13: Multiple Alignment Stuart M. Brown NYU School of Medicine

Optimal Alignment

For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.

Determining what alignment is best for a given set of sequences is really up to the judgement of the investigator.

Page 14: Multiple Alignment Stuart M. Brown NYU School of Medicine

Progressive PairwiseMethods

Most of the available multiple alignment programs use some sort of incremental or progressive method that makes pairwise alignments, then adds new sequences one at a time to these aligned groups.

This is an approximate method!

Page 15: Multiple Alignment Stuart M. Brown NYU School of Medicine

PILEUP PILEUP is the multiple alignment

program in the GCG package CLUSTAL is another popular

program (also available on the RCR server) that uses a similar algorithm.

Page 16: Multiple Alignment Stuart M. Brown NYU School of Medicine

The PILEUP Algorithm First, PILEUP calculates approximate pairwise

similarity scores between all sequences to be aligned, and they are clustered into a dendrogram (tree structure).

Then the most similar pairs of sequences are aligned.

Averages (similar to consensus sequences) are calculated for the aligned pairs.

New sequences and clusters of sequences are added one by one, according to the branching order in the dendrogram.

Page 17: Multiple Alignment Stuart M. Brown NYU School of Medicine

PILEUP Considerations

Since the alignment is calculated on a progressive basis, the order of the initial sequences can affect the final alignment.

PILEUP paramaters: 2 gap penalties (gap insert and gap extend) and an amino acid comparison matrix.

PILEUP will refuse to align sequences that require too many gaps or mismatches.

PILEUP will take quite a while to align more than about 10 sequences

Page 18: Multiple Alignment Stuart M. Brown NYU School of Medicine

Instructions for running PILEUP

PILEUP uses a list of sequence files as input

You can use output from a FASTA or LOOKUP search as a list or make your own list in a text editor

A list file can include files from your own directory and/or GCG database files.

Page 19: Multiple Alignment Stuart M. Brown NYU School of Medicine

LIST file format List files always begin with two dots ..

..

gp:S31321 gp:Yno3_Yeast S51900.pep Yan2_Schpo Ypd1_Caeel A36205 Mpp1_Rat begin:100 end:345 B46665.pep Ymxg_Bacsu begin:150 end:464 A48043.pep

List files can also include Begin and End positions within a sequence

Page 20: Multiple Alignment Stuart M. Brown NYU School of Medicine

PILEUP @myseqs.list

Now at the > prompt, type PILEUP and the name of the file that is your list of sequence names.

However, GCG requires that you must precede the name of your list file with the @ character.

So the command looks like this:

> PILEUP @myseqs.list

Page 21: Multiple Alignment Stuart M. Brown NYU School of Medicine

1501 1550 Hsirf2 SERPSKKGKK PKTEKEDKVK HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2 SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1 LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1 LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48 ...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4 ...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp ...PEEDQK. .......... .......... CKLGVATAGC VNEVTEMECG Muicsbp ...PEEEQK. .......... .......... CKLGVAPAGC MSEVPEMECG Chkicsbp ...PEEEQK. .......... .......... CKIGVGNGSS LTDVGDMDCS

1551 1600 Hsirf2 IKNEVDSTVN IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........ NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP... P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE .PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP... PQETCRN..P PIPDWWMQQP

PILEUP Output> more myseqs.msf

Page 22: Multiple Alignment Stuart M. Brown NYU School of Medicine

PILEUP options For a first try, take the default options,

but give the output file a meaningful name.

If you don’t get a good alignment, try a less stringent matrix and/or gap penalties.

> PILEUP -matr=oldpep.cmp

It is a good idea to run PILEUP in batch mode if you have more than 10 sequences to align:

> PILEUP -bat

Page 23: Multiple Alignment Stuart M. Brown NYU School of Medicine

CLUSTAL CLUSTAL is a stand-alone (i.e. not

integrated into GCG) multiple alignment program that is superior in some respects to PILEUP

Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure.

it can re-align just selected sequences or selected regions in an existing alignment

It can compute phylogenetic trees from a set of aligned sequences.

There are also Mac and PC versions with a nice graphical interface (CLUSTALX).

Page 24: Multiple Alignment Stuart M. Brown NYU School of Medicine

Using CLUSTAL

On mcrcr0 type: clustal

CLUSTAL can only work with sequences in multi-sequence FASTA format.

The GCG program TOFASTA can convert lists of file names into FASTA multi-sequence format.

Page 25: Multiple Alignment Stuart M. Brown NYU School of Medicine

Multiple Alignment tools on the Web

There are a variety of multiple alignment tools available for free on the web.

CLUSTAL is available from a number of sites (with a variety of restrictions)

Other algorithms are available too Watch out for “experimental” algorithms;

there may be a good reason why you have never heard of some oddball program

Page 26: Multiple Alignment Stuart M. Brown NYU School of Medicine

Some URLs

EMBL-EBIhttp://www.ebi.ac.uk/clustalw/

BCM Search Launcher: Multiple Alignment

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

Multiple Sequence Alignment for Proteins (Wash. U. St. Louis)http://www.ibc.wustl.edu/service/msa/

Page 27: Multiple Alignment Stuart M. Brown NYU School of Medicine

Editing Multiple Alignments

There are a variety of tools that can be used to modify a multiple alignment.

These programs can be very useful in formatting and annotating an alignment for publication.

An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by one of the automated alignment programs.

Page 28: Multiple Alignment Stuart M. Brown NYU School of Medicine

GCG alignment editors

Alignments produced with PILEUP (or CLUSTAL) can be adjusted with LINEUP.

Nicely shaded printouts can be produced with PRETTYBOX

GCG's SeqLab X-Windows interface has a superb multiple sequence editor - the best editor of any kind.

Page 29: Multiple Alignment Stuart M. Brown NYU School of Medicine
Page 30: Multiple Alignment Stuart M. Brown NYU School of Medicine

Other editors

The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free and provide excellent editor functionality.

Many “comprehensive” molecular biology programs include multiple alignment functions: MacVector, OMIGA, Vector NTI, and

GeneTool/PepTool all include a built-in version of CLUSTAL

Page 31: Multiple Alignment Stuart M. Brown NYU School of Medicine

SeqVu

Page 32: Multiple Alignment Stuart M. Brown NYU School of Medicine

Editors on the Web Check out CINEMA (Colour

INteractive Editor for Multiple Alignments) It is an editor created completely in

JAVA (old browsers beware) It includes a fully functional version

of CLUSTAL, BLAST, and a DotPlot module

http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/

Page 33: Multiple Alignment Stuart M. Brown NYU School of Medicine