woods hole, massachusetts july 25, 2006, 7 to 10 pm marine biological laboratory — workshop on...

30
Woods Hole, Massachusetts Woods Hole, Massachusetts July 25, 2006, 7 to July 25, 2006, 7 to 10 PM 10 PM Marine Biological Marine Biological Laboratory — Workshop Laboratory — Workshop on Molecular Evolution on Molecular Evolution

Upload: alicia-welch

Post on 04-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Woods Hole, MassachusettsWoods Hole, Massachusetts

July 25, 2006, 7 to 10 PMJuly 25, 2006, 7 to 10 PM

Marine Biological Laboratory Marine Biological Laboratory — Workshop on Molecular — Workshop on Molecular

EvolutionEvolution

Page 2: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

More data yields stronger analyses — if done carefully!More data yields stronger analyses — if done carefully!

Mosaic ideas and evolutionary ‘importance.’Mosaic ideas and evolutionary ‘importance.’

Multiple Sequence Multiple Sequence Alignment & Analysis Alignment & Analysis thru GCG’s SeqLabthru GCG’s SeqLab

Steven M. ThompsonSteven M. Thompson

Florida State University School of Florida State University School of Computational Science (SCS)Computational Science (SCS)

Page 3: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

But first a prelude: My definitions

Biocomputing and computational biology are synonymous and Biocomputing and computational biology are synonymous and

describe the use of computers and computational techniques to describe the use of computers and computational techniques to

analyze any biological system, from molecules, through cells, analyze any biological system, from molecules, through cells,

tissues, organisms, and populations, to complete ecologies.tissues, organisms, and populations, to complete ecologies.

Bioinformatics describes using computational techniques to access, Bioinformatics describes using computational techniques to access,

analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the

available online biological databases.available online biological databases.

Sequence analysis is the study of molecular sequence data for the Sequence analysis is the study of molecular sequence data for the

purpose of inferring the function, mechanism, interactions, purpose of inferring the function, mechanism, interactions,

evolution, and perhaps structure of biological molecules.evolution, and perhaps structure of biological molecules.

Genomics analyzes the context of genes or complete genomes (the Genomics analyzes the context of genes or complete genomes (the

total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.

Proteomics is a subdivision of genomics concerned with analyzing Proteomics is a subdivision of genomics concerned with analyzing

the complete protein complement, i.e. the proteome, of the complete protein complement, i.e. the proteome, of

organisms, both within and between different organisms.organisms, both within and between different organisms.

Page 4: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

from a ‘virtual’ DNA sequence to actual molecular from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round.physical characterization, not the other way ‘round.

Using bioinformatics tools, you can infer all sorts Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, structural of functional, evolutionary, and, structural insights into a gene product, without the need insights into a gene product, without the need to isolate and purify massive amounts of to isolate and purify massive amounts of protein! Eventually you can go on to clone protein! Eventually you can go on to clone and express the gene based on that analysis and express the gene based on that analysis using PCR techniques.using PCR techniques.

The computer and molecular databases are an The computer and molecular databases are an essential part of this process.essential part of this process.

And a ‘way’ to think about it:And a ‘way’ to think about it:The reverse biochemistry analogyThe reverse biochemistry analogy

Page 5: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

The exponential growth of molecular sequence databasesYearYear BasePairs BasePairs SequencesSequences

19821982 680338 680338 606 606

19831983 2274029 2274029 2427 2427

19841984 3368765 3368765 4175 4175

19851985 5204420 5204420 5700 5700

19861986 9615371 9615371 9978 9978

19871987 1551477615514776 1458414584

19881988 23800000 23800000 2057920579

19891989 34762585 34762585 2879128791

19901990 49179285 49179285 3953339533

19911991 71947426 71947426 55627 55627

19921992 101008486 101008486 78608 78608

19931993 157152442 157152442 143492143492

19941994 217102462 217102462 215273 215273

19951995 384939485 384939485 555694555694

19961996 651972984 651972984 10212111021211

19971997 1160300687 1160300687 17658471765847

19981998 2008761784 2008761784 28378972837897

19991999 3841163011 3841163011 4864570 4864570

20002000 1110106628811101066288 1010602310106023

20012001 1584992143815849921438 1497631014976310

20022002 2850799016628507990166 22318883 2231888320032003 3655336848536553368485 3096841830968418

20042004 4457574517644575745176 4060431940604319

20052005 5603773446256037734462 52016762 52016762

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

& cpu power& cpu power

Doubling time ~ 1 Doubling time ~ 1 year!year!

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 6: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

So what; why even bother? So what; why even bother?

Applications:Applications:

Probe/primer, and motif/profile design;Probe/primer, and motif/profile design;

Graphical illustrations;Graphical illustrations;

Comparative ‘homology’ inference;Comparative ‘homology’ inference;

Molecular evolutionary analysis.Molecular evolutionary analysis.

OK — well, how do you do it?OK — well, how do you do it?

Back to multiple sequence Back to multiple sequence alignment — Applicability?alignment — Applicability?

Page 7: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared:sequences being compared:

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences

Page 8: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

See —See —

MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only) on the multiple PIMA (‘local’ portions only) on the multiple alignment page at thealignment page at the

Baylor College of Medicine’s Search Baylor College of Medicine’s Search Launcher —Launcher —

http://searchlauncher.bcm.tmc.edu/ — but, — but,

severely limiting restrictions!severely limiting restrictions!

‘‘Global’ heuristic solutionsGlobal’ heuristic solutions

Page 9: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts the programming restricts the solution to the neighbor-solution to the neighbor-hood of only two hood of only two sequences at a time.sequences at a time.

All sequences are All sequences are compared, pairwise, and compared, pairwise, and then each is aligned to its then each is aligned to its most similar partner or most similar partner or group of partners. Each group of partners. Each group of partners is then group of partners is then aligned to finish the aligned to finish the complete multiple complete multiple sequence alignment.sequence alignment.

Multiple Sequence Dynamic ProgrammingMultiple Sequence Dynamic Programming

Page 10: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Reliability and the Reliability and the Comparative Approach —Comparative Approach —

explicit homologous correspondence;explicit homologous correspondence;

manual adjustments should be manual adjustments should be encouraged — based on knowledge,encouraged — based on knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeqLab andTherefore, editors like SeqLab and

the Ribosomal Database Project:the Ribosomal Database Project:

http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp

Page 11: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLab —the Wisconsin Package’s SeqLab —

Page 12: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Work with proteins!Work with proteins!If at all possible —If at all possible —

Twenty match symbols versus four, plus Twenty match symbols versus four, plus similarity! Way better signal to noise.similarity! Way better signal to noise.

Also guarantees no indels are placed Also guarantees no indels are placed within codons. So translate, then align.within codons. So translate, then align.

Nucleotide sequences will only reliably Nucleotide sequences will only reliably align if they are align if they are veryvery similarsimilar to each to each other. And they will require extensive other. And they will require extensive hand editing and careful consideration.hand editing and careful consideration.

Page 13: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Parologous Parologous versus versus orthologous;orthologous;

genomic versus genomic versus cDNA;cDNA;

mature versus mature versus precursor.precursor.

Page 14: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Mask out uncertain areas —Mask out uncertain areas —

Page 15: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Complications —Complications —Order dependence.Order dependence.

Not that big of a deal.Not that big of a deal.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

A very big deal!A very big deal!

Regional ‘realignment’ becomes incredibly Regional ‘realignment’ becomes incredibly

important, especially with sequences that important, especially with sequences that

have areas of high and low similarity have areas of high and low similarity

(GCG’ PileUp -InSitu option).(GCG’ PileUp -InSitu option).

Page 16: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Complications cont. —Complications cont. —

Format hassles!Format hassles!

Specialized format conversion Specialized format conversion tools such as GCG’s tools such as GCG’s SeqConv+ program and SeqConv+ program and PAUPSearch, andPAUPSearch, and

Don Gilbert’s public domain Don Gilbert’s public domain ReadSeq program.ReadSeq program.

Page 17: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Still more complications —Still more complications —

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

Page 18: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Web resources for pairwise, Web resources for pairwise, progressive multiple alignment —progressive multiple alignment —http://www.techfak.uni-bielefeld.de/bcd/Curric/

MulAli/welcome.html..

http://pbil.univ-lyon1.fr/alignment.html

http://www.ebi.ac.uk/clustalw/

http://searchlauncher.bcm.tmc.edu/

However, problems with very large datasets and However, problems with very large datasets and huge multiple alignments make doing multiple huge multiple alignments make doing multiple sequence alignment on the Web impractical sequence alignment on the Web impractical after your dataset has reached a certain size. after your dataset has reached a certain size. You’ll know it when you’re there!You’ll know it when you’re there!

Page 19: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

If large datasets become intractable for analysis on the Web, what other resources are available?Desktop software solutions — public domain Desktop software solutions — public domain

programs are available, but . . . complicated to programs are available, but . . . complicated to

install, configure, and maintain. User must be install, configure, and maintain. User must be

pretty computer savvy. So, pretty computer savvy. So,

commercial software packages are available, e.g. commercial software packages are available, e.g.

MacVector, DS Gene, DNAsis, DNAStar, etc.,MacVector, DS Gene, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per but . . . license hassles, big expense per

machine, and Internet and/or CD database machine, and Internet and/or CD database

access all complicate matters!access all complicate matters!

Page 20: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Therefore, UNIX server-based solutions

Public domain solutions also exist, but now a very cooperative Public domain solutions also exist, but now a very cooperative

systems manager needs to maintain everything for users, so,systems manager needs to maintain everything for users, so,

commercial products, e.g. the Accelrys GCG Wisconsin Package commercial products, e.g. the Accelrys GCG Wisconsin Package

and the SeqLab Graphical User Interface, simplify matters for and the SeqLab Graphical User Interface, simplify matters for

administrators and users.administrators and users. One format, one ‘look-and-feel.’ One format, one ‘look-and-feel.’

One license fee for an entire institution and very fast, convenient One license fee for an entire institution and very fast, convenient

database access on local server disks. Connections from any database access on local server disks. Connections from any

networked terminal or workstation anywhere!networked terminal or workstation anywhere!

Operating system:Operating system: UNIX command line operation hassles; UNIX command line operation hassles;

communications software — telnet, ssh, and terminal emulation; X communications software — telnet, ssh, and terminal emulation; X

graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs, graphics; file transfer — ftp, and scp/sftp; and editors — vi, emacs,

pico (or desktop word processing followed by file transfer [save as pico (or desktop word processing followed by file transfer [save as

"text only!"]). See my supplement pdf file."text only!"]). See my supplement pdf file.

Page 21: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

The Genetics Computer Group —

The Accelrys Wisconsin Package for Sequence AnalysisThe Accelrys Wisconsin Package for Sequence Analysis

GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the

University of Wisconsin, Madison; and then starting in 1990 it University of Wisconsin, Madison; and then starting in 1990 it

became a private company; which was acquired by the Oxford became a private company; which was acquired by the Oxford

Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc., Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc.,

U.S.A., in 2000; and then in 2004 Accelrys, San Diego, U.S.A., in 2000; and then in 2004 Accelrys, San Diego,

California, left Pharmacopeia to become an independent entity.California, left Pharmacopeia to become an independent entity.

The suite contains around 150 programs designed to work in a The suite contains around 150 programs designed to work in a

“toolbox” fashion. Several simple programs used in succession “toolbox” fashion. Several simple programs used in succession

can lead to very sophisticated results.can lead to very sophisticated results.

Also ‘internal compatibility,’ i.e. once you learn to use one program, Also ‘internal compatibility,’ i.e. once you learn to use one program,

all programs can be run similarly, and, the output from many all programs can be run similarly, and, the output from many

programs can be used as input for other programs.programs can be used as input for other programs.

Used all over the world at over 950 institutions, so learning it will Used all over the world at over 950 institutions, so learning it will

likely be useful at other research institutions as well.likely be useful at other research institutions as well.

Page 22: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

To answer the always perplexing GCG question — “What sequence(s)? . . . .”

The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and SeqConv+ programs)account. (GCG Reformat and SeqConv+ programs)

The sequence is in a local GCG database in which case you ‘point’ to it by The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper the logical name apart from either an accession number or a proper identifier name or a wildcard expression, and they are case insensitive.identifier name or a wildcard expression, and they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.

Finally, the most powerful method of specifying sequences is in a GCG “list” Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, you can program is to precede it with an at sign, “@.” Furthermore, you can supply attribute information within list files to specify something special supply attribute information within list files to specify something special about the sequence such as begin and end constraints.about the sequence such as begin and end constraints.

Specifying sequences, GCG style;Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:

Page 23: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

!!NA_SEQUENCE 1.0!!NA_SEQUENCE 1.0

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.

Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future

you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The

line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA

51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after

‘reformat’ (or the SeqConv+ program)‘reformat’ (or the SeqConv+ program)

SeqLab’s Editor mode can also SeqLab’s Editor mode can also

“Import” native GenBank format and “Import” native GenBank format and

ABI or LI-COR trace files!ABI or LI-COR trace files!

Page 24: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Logical terms for the Wisconsin PackageSequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:

GENBANKPLUSGENBANKPLUS all of GenBank plus EST, HTC & GSS subdivisionsall of GenBank plus EST, HTC & GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations

GBPGBP all of GenBank plus EST, HTC & GSS subdivisionsall of GenBank plus EST, HTC & GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations

GENBANKGENBANK all of GenBank except EST, HTC & GSS subdivisionsall of GenBank except EST, HTC & GSS subdivisions UNIPROT or UNIUNIPROT or UNI all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

GBGB all of GenBank except EST, HTC & GSS subdivisionsall of GenBank except EST, HTC & GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision UNISPROTUNISPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

GSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SWISSSWISS all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

HTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA SWSW all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

HTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic UNITREMBLUNITREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

ININ GenBank invertebrate subdivisionGenBank invertebrate subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

INVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PP all of PIR Protein all of PIR Protein

OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIRPIR all of PIR Protein all of PIR Protein

OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR1PIR1 PIR fully annotated subdivision PIR fully annotated subdivision

OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR2PIR2 PIR preliminary subdivision PIR preliminary subdivision

PATPAT GenBank patent subdivision GenBank patent subdivision PIR3PIR3 PIR unverified subdivision PIR unverified subdivision

PATENTPATENT GenBank patent subdivision GenBank patent subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision

PHPH GenBank phage subdivision GenBank phage subdivision Note: not all GCG installations support the PIR databaseNote: not all GCG installations support the PIR database

PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision

PLPL GenBank plant subdivision GenBank plant subdivision General data files: General data files:

PLANTPLANT GenBank plant subdivision GenBank plant subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files

PRPR GenBank primate subdivision GenBank primate subdivision GENRUNDATAGENRUNDATA path to GCG default data filespath to GCG default data files

PRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivision

RORO GenBank rodent subdivisionGenBank rodent subdivision

RODENTRODENT GenBank rodent subdivisionGenBank rodent subdivision

STSSTS GenBank (Sequence Tagged Sites) subdivisionGenBank (Sequence Tagged Sites) subdivision

SYSY GenBank synthetic subdivisionGenBank synthetic subdivision

SYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivision

TAGSTAGS GenBank EST, HTC & GSS subdivisionsGenBank EST, HTC & GSS subdivisions

UNUN GenBank unannotated subdivisionGenBank unannotated subdivision

UNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivision

VIVI GenBank viral subdivisionGenBank viral subdivision

VIRALVIRAL GenBank viral subdivisionGenBank viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.

Page 25: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

GCG MSF & RSF format

The trick is to not forget the Braces and ‘wild card,’ e.g. The trick is to not forget the Braces and ‘wild card,’ e.g.

filename{filename{**}, when specifying!}, when specifying!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

Page 26: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

The List File Format

!!!SEQUENCE_LIST 1.0!SEQUENCE_LIST 1.0

An example GCG list file of many elongation An example GCG list file of many elongation

1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG

data files, two periods separate data files, two periods separate

documentation from data. ..documentation from data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected] The ‘way’ SeqLab works!The ‘way’ SeqLab works!

remember the @ sign!remember the @ sign!

Page 27: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

SeqLab — GCG’s X-based GUI!

SeqLab is the merger of Steve Smith’s Genetic SeqLab is the merger of Steve Smith’s Genetic

Data Environment and GCG’s Wisconsin Data Environment and GCG’s Wisconsin

Package Interface:Package Interface:

GDE + WPI = SeqLabGDE + WPI = SeqLab

Requires an X-Windowing environment — Requires an X-Windowing environment —

either native on UNIX computers (including either native on UNIX computers (including

LINUX, but not installed by default on Mac OS LINUX, but not installed by default on Mac OS

X [v.10+] systems, however, see Apple’s free X [v.10+] systems, however, see Apple’s free

X11 package or XDarwin), or emulated with X-X11 package or XDarwin), or emulated with X-

Server Software on personal computers.Server Software on personal computers.

Page 28: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

FOR MORE INFO...FOR MORE INFO...

Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html.

Contact me (Contact me (stevetstevet@[email protected]) for specific long-distance ) for specific long-distance bioinformatics assistance and collaboration.bioinformatics assistance and collaboration.

Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be able to make a . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”all we have to become better biologists. But that’s all it takes.”

Conclusions —Conclusions —

Page 29: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

Many texts are now available in Many texts are now available in

the field. the field. To ‘honk-my-own-horn’ a bit, To ‘honk-my-own-horn’ a bit,

check out:check out:

Current Protocols in BioinformaticsCurrent Protocols in Bioinformatics

from John Wiley & Sons, Inc.from John Wiley & Sons, Inc.

(http://www.does.org/cp/bioinfo.html);(http://www.does.org/cp/bioinfo.html);

and Horizon Scientific and Horizon Scientific

Press’ Press’

Computational Computational

Genomics: Theory and Genomics: Theory and

ApplicationApplication

((http://http://

www.horizonpress.com/hsp/www.horizonpress.com/hsp/

books/com.html).books/com.html).

AND FOR EVEN MORE INFO...

Humana Press’ Humana Press’

Introduction to Bioinformatics:Introduction to Bioinformatics:

A Theoretical And Practical ApproachA Theoretical And Practical Approach

((http://www.humanapress.com/http://www.humanapress.com/

Product.pasp?Product.pasp?

txtCatalog=HumanaBooks&txtCategorytxtCatalog=HumanaBooks&txtCategory

=&txtProductID=1-58829-241-=&txtProductID=1-58829-241-

X&isVariant=0X&isVariant=0););

They all asked me to They all asked me to

contribute chapters on contribute chapters on

multiple sequence multiple sequence

alignment and analysis alignment and analysis

using GCG software.using GCG software.

Page 30: Woods Hole, Massachusetts July 25, 2006, 7 to 10 PM Marine Biological Laboratory — Workshop on Molecular Evolution

On to a demonstration of some of On to a demonstration of some of

SeqLab’s multiple sequence SeqLab’s multiple sequence

dataset capabilities —dataset capabilities —

some of my prebuilt alignments, and . . .some of my prebuilt alignments, and . . .

Elongation Factor 1Elongation Factor 1/Tu, how to do it./Tu, how to do it.