the “nuts and bolts” of ‘doing’ bioinformatics with the wisconsin package at fsu steve...

The “Nuts and Bolts” of ‘doing’ The “Nuts and Bolts” of ‘doing’

bioinformatics with the bioinformatics with the

Wisconsin Package at FSUWisconsin Package at FSU

Steve ThompsonSteve Thompson

Florida State University School of Florida State University School of

Computational Science (SCS)Computational Science (SCS)

BCH 5425BCH 5425 Molecular BiologyMolecular Biology

Dr. Hong LiDr. Hong Li

February 16, 2005February 16, 2005

Given nucleotide or amino acid sequence data, Given nucleotide or amino acid sequence data,

what can we learn about biological molecules, what can we learn about biological molecules,

using the popular Accelrys Wisconsin Package?using the popular Accelrys Wisconsin Package?

But first some of my definitions, lots of overlap —But first some of my definitions, lots of overlap —

BiocomputingBiocomputing and and computational biologycomputational biology are synonyms and are synonyms and

describe the use of computers and computational techniques to describe the use of computers and computational techniques to

analyze any type of a biological system, from individual molecules analyze any type of a biological system, from individual molecules

to organisms to overall ecology.to organisms to overall ecology.

BioinformaticsBioinformatics describes using computational techniques to access, describes using computational techniques to access,

analyze, and interpret the biological information in any type of analyze, and interpret the biological information in any type of

biological database.biological database.

Sequence analysisSequence analysis is the study of molecular sequence data for the is the study of molecular sequence data for the

purpose of inferring the function, interactions, evolution, and purpose of inferring the function, interactions, evolution, and

perhaps structure of biological molecules.perhaps structure of biological molecules.

GenomicsGenomics analyzes the context of genes or complete genomes (the analyzes the context of genes or complete genomes (the

total DNA content of an organism) within the same and/or across total DNA content of an organism) within the same and/or across

different genomes.different genomes.

ProteomicsProteomics is the subdivision of genomics concerned with analyzing is the subdivision of genomics concerned with analyzing

the complete protein complement, i.e. the proteome, of organisms, the complete protein complement, i.e. the proteome, of organisms,

both within and between different organisms.both within and between different organisms.

And one way to think about the field —And one way to think about the field —

The reverse biochemistry analogy.The reverse biochemistry analogy.

Biochemists no longer have to begin a research project by Biochemists no longer have to begin a research project by

isolating and purifying massive amounts of a protein from isolating and purifying massive amounts of a protein from

its native organism in order to characterize a particular its native organism in order to characterize a particular

gene product. Rather, now scientists can amplify a gene product. Rather, now scientists can amplify a

section of some genome based on its similarity to other section of some genome based on its similarity to other

genomes, sequence that piece of DNA and, genomes, sequence that piece of DNA and, using using

sequence analysis tools, infer all sorts of functional, sequence analysis tools, infer all sorts of functional,

evolutionary, and, perhaps, structural insight into that evolutionary, and, perhaps, structural insight into that

stretch of DNA! They can then clone and express it.stretch of DNA! They can then clone and express it.

The computer and molecular databases are a The computer and molecular databases are a

necessary, integral part of this entire process.necessary, integral part of this entire process.

The exponential growth of molecular sequence The exponential growth of molecular sequence databases databases & cpu power —& cpu power —YearYear BasePairsBasePairs SequencesSequences

19821982 680338680338 606606

19831983 22740292274029 24272427

19841984 33687653368765 41754175

19851985 52044205204420 57005700

19861986 96153719615371 99789978

19871987 1551477615514776 1458414584

19881988 2380000023800000 2057920579

19891989 3476258534762585 2879128791

19901990 4917928549179285 3953339533

19911991 7194742671947426 5562755627

19921992 101008486101008486 7860878608

19931993 157152442157152442 143492143492

19941994 217102462217102462 215273215273

19951995 384939485384939485 555694555694

19961996 651972984651972984 10212111021211

19971997 11603006871160300687 17658471765847

19981998 20087617842008761784 28378972837897

19991999 38411630113841163011 48645704864570

20002000 1110106628811101066288 1010602310106023

20012001 1584992143815849921438 1497631014976310

20022002 2850799016628507990166 2231888322318883

20032003 3655336848536553368485 3096841830968418

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

doubling time ~doubling time ~one yearone year

Another perspective on size and some organization stuff —Another perspective on size and some organization stuff —

Nucleic Acid DB’sNucleic Acid DB’s

GenBank/EMBL/DDBJGenBank/EMBL/DDBJ

all Taxonomic all Taxonomic

categories + HTC’s, categories + HTC’s,

HTG’s, & STS’sHTG’s, & STS’s

““Tags”Tags”

EST’sEST’s

GSS’sGSS’s

Amino Acid DB’sAmino Acid DB’sSWISS-PROTSWISS-PROT

TrEMBLTrEMBL

PIRPIR

PIR1PIR1

PIR2PIR2

PIR3PIR3

PIR4PIR4

NRL_3DNRL_3D

GenpeptGenpept

As of February 2005 the sequences in GenBank also include over As of February 2005 the sequences in GenBank also include over 240 complete genomes, not including viruses! Nucleic acid 240 complete genomes, not including viruses! Nucleic acid sequence databases (and TrEMBL) are split into subdivisions sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi and Archaea based on taxonomy (historical rankings — the Fungi and Archaea warning!). PIR is split into subdivisions based on level of warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation.as they receive increased levels of annotation.

So how do you access and manipulate all this data?So how do you access and manipulate all this data?Often on the InterNet over the World Wide Web:Often on the InterNet over the World Wide Web:

SiteSite URL (Uniform Resource Locator)URL (Uniform Resource Locator) ContentContent

Nat’l Center Biotech' Info'Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ databases/analysis/softwaredatabases/analysis/software

PIR/NBRFPIR/NBRF http://www-nbrf.georgetown.edu/http://www-nbrf.georgetown.edu/ protein sequence databaseprotein sequence database

IUBIO Biology ArchiveIUBIO Biology Archive http://iubio.bio.indiana.edu/http://iubio.bio.indiana.edu/ database/software archivedatabase/software archive

Univ. of MontrealUniv. of Montreal http://megasun.bch.umontreal.ca/http://megasun.bch.umontreal.ca/ database/software archivedatabase/software archive

Japan's GenomeNetJapan's GenomeNet http://www.genome.ad.jp/http://www.genome.ad.jp/ databases/analysis/softwaredatabases/analysis/software

European Mol' Bio' Lab'European Mol' Bio' Lab' http://www.embl-heidelberg.de/http://www.embl-heidelberg.de/ databases/analysis/softwaredatabases/analysis/software

European BioinformaticsEuropean Bioinformatics http://www.ebi.ac.uk/http://www.ebi.ac.uk/ databases/analysis/softwaredatabases/analysis/software

The Sanger InstituteThe Sanger Institute http://www.sanger.ac.uk/http://www.sanger.ac.uk/ databases/analysis/softwaredatabases/analysis/software

Univ. of Geneva BioWebUniv. of Geneva BioWeb http://www.expasy.ch/http://www.expasy.ch/ databases/analysis/softwaredatabases/analysis/software

ProteinDataBankProteinDataBank http://www.rcsb.org/pdb/http://www.rcsb.org/pdb/ 3D mol' structure database3D mol' structure database

Molecules to GoMolecules to Go http://molbio.info.nih.gov/cgi-bin/pdb/http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization3D protein/nuc' visualization

The Genome DataBaseThe Genome DataBase http://www.gdb.org/http://www.gdb.org/ The Human Genome ProjectThe Human Genome Project

Stanford GenomicsStanford Genomics http://genome-www.stanford.edu/http://genome-www.stanford.edu/ various genome projectsvarious genome projects

Inst. for Genomic Res’rchInst. for Genomic Res’rch http://www.tigr.org/http://www.tigr.org/ esp. microbial genome projectsesp. microbial genome projects

HIV Sequence DatabaseHIV Sequence Database http://hiv-web.lanl.gov/http://hiv-web.lanl.gov/ HIV epidemeology seq' DBHIV epidemeology seq' DB

The Tree of LifeThe Tree of Life http://tolweb.org/tree/phylogeny.htmlhttp://tolweb.org/tree/phylogeny.html overview of all phylogenyoverview of all phylogeny

Ribosomal Database Proj’Ribosomal Database Proj’ http://rdp.cme.msu.edu/index.jsphttp://rdp.cme.msu.edu/index.jsp databases/analysis/softwaredatabases/analysis/software

PUMA2 at ArgonnePUMA2 at Argonne http://compbio.mcs.anl.gov/puma2/cgi-bin/http://compbio.mcs.anl.gov/puma2/cgi-bin/ metabolic reconstructionmetabolic reconstruction

Harvard Bio' LaboratoriesHarvard Bio' Laboratories http://golgi.harvard.edu/http://golgi.harvard.edu/ nice bioinformatics links listnice bioinformatics links list

With a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRSWith a World Wide Web browser and tools like NCBI’s Entrez & EMBL’s SRS

But ‘doing’ bioinformatics on the Web has But ‘doing’ bioinformatics on the Web has

both its pros and its cons —both its pros and its cons —

Advantages: Accesses the very latest database Advantages: Accesses the very latest database

updates. It’s fun and very fast. It can be very updates. It’s fun and very fast. It can be very

powerful and efficient, if you know what you’re doing. powerful and efficient, if you know what you’re doing.

In most cases relational links between different In most cases relational links between different

databases ease navigation, and in some cases databases ease navigation, and in some cases

neighboring concepts link similar entries.neighboring concepts link similar entries.

Disadvantages: Can be very inefficient, if you don’t Disadvantages: Can be very inefficient, if you don’t

know what you’re doing. know what you’re doing. ReformattingReformatting downloaded downloaded

sequence data is usually essential, if the sequence is sequence data is usually essential, if the sequence is

to be used in any other software. And, it’s very easy to be used in any other software. And, it’s very easy

to get lost and distracted in cyberspace!to get lost and distracted in cyberspace!

Also, problems sometimes arise with the World Wide Also, problems sometimes arise with the World Wide

Web itself, like dropped or slow connections . . . .Web itself, like dropped or slow connections . . . .

So what are the alternatives?So what are the alternatives?

Personal computer software solutions — public domain Personal computer software solutions — public domain

programs are available, but . . . a bit complicated to programs are available, but . . . a bit complicated to

install, configure, and maintain. User must be pretty install, configure, and maintain. User must be pretty

computer savvy. So, computer savvy. So,

good commercial software packages are also available, good commercial software packages are also available,

e.g. Sequencher, MacVector, DNAStar, DNAsis, etc.,e.g. Sequencher, MacVector, DNAStar, DNAsis, etc.,

but . . . license hassles, especially big expense per but . . . license hassles, especially big expense per

machine, and Internet and/or CD database access all machine, and Internet and/or CD database access all

complicate matters!complicate matters!

Therefore, UNIX server-based, non-Web Therefore, UNIX server-based, non-Web

solutions are available as an alternative.solutions are available as an alternative.Public domain solutions also exist for UNIX servers, but Public domain solutions also exist for UNIX servers, but

now a very cooperative systems manager needs to now a very cooperative systems manager needs to

maintain everything for users. So,maintain everything for users. So,

commercial products, e.g. the commercial products, e.g. the Accelrys Accelrys

GCG Wisconsin PackageGCG Wisconsin Package [a [a Pharmacopeia Co.]Pharmacopeia Co.] and the and the

SeqLab Graphical User Interface, simplify matters for SeqLab Graphical User Interface, simplify matters for

administrators and users.administrators and users. One commercial license fee One commercial license fee

for an entire institution and very fast, convenient for an entire institution and very fast, convenient

database access on local server disks. Connections database access on local server disks. Connections

from any networked terminal or workstation anywhere, from any networked terminal or workstation anywhere,

anytime!anytime!

Mendel (mendel.csit.fsu.edu) — FSU’s Mendel (mendel.csit.fsu.edu) — FSU’s

UNIX (Linux) Biocomputing Server —UNIX (Linux) Biocomputing Server —Operating systemOperating system — UNIX command line; — UNIX command line;

communications software — telnet vs. ssh; X graphics; communications software — telnet vs. ssh; X graphics;

ssh -X [email protected] -X [email protected]

file transfer — ftp vs. scp/sftp;file transfer — ftp vs. scp/sftp;

and editors — vi, emacs, pico (or word processing and editors — vi, emacs, pico (or word processing

followed by file transfer [save as "text only!"]).followed by file transfer [save as "text only!"]).

How do I get an accountHow do I get an account — just ask me! I am the — just ask me! I am the

contact person for Mendel. It usually takes a couple of contact person for Mendel. It usually takes a couple of

days for the SCS system administrator to act on my days for the SCS system administrator to act on my

request. Anybody associated with FSU is entitled to an request. Anybody associated with FSU is entitled to an

account and there are NO fees associated with it.account and there are NO fees associated with it.

The Genetics Computer Group — The Genetics Computer Group — the Wisconsin Package for Sequence Analysis.the Wisconsin Package for Sequence Analysis.

Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the Begun in 1982 in Oliver Smithies’ lab at the Genetics Dept. at the

University of Wisconsin, Madison, then a private company for University of Wisconsin, Madison, then a private company for

over 10 years, then acquired by the Oxford Molecular Group over 10 years, then acquired by the Oxford Molecular Group

U.K., and now owned by Pharmacopeia U.S.A. under the new U.K., and now owned by Pharmacopeia U.S.A. under the new

name Accelrys, Inc.name Accelrys, Inc.

The suite contains almost 150 programs designed to work in a The suite contains almost 150 programs designed to work in a

"toolbox" fashion. Several simple programs used in "toolbox" fashion. Several simple programs used in

succession can lead to sophisticated results.succession can lead to sophisticated results.

Also 'internal compatibility,' i.e. once you learn to use one program, Also 'internal compatibility,' i.e. once you learn to use one program,

all programs can be run similarly, and, the output from many all programs can be run similarly, and, the output from many

programs can be used as input for other programs.programs can be used as input for other programs.

Used all over the world by more than 30,000 scientists at over 530 Used all over the world by more than 30,000 scientists at over 530

institutions in 35 countries, so learning it here will most likely be institutions in 35 countries, so learning it here will most likely be

useful anywhere else you may end up.useful anywhere else you may end up.

To answer the always perplexing GCG question — “What To answer the always perplexing GCG question — “What sequence(s)? . . . .” Specifying sequences, GCG style;sequence(s)? . . . .” Specifying sequences, GCG style;in order of increasing power and complexity:in order of increasing power and complexity:1)1) The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX

account. (GCG Reformat and all From- & To- programs)account. (GCG Reformat and all From- & To- programs)

2)2) The sequence is in a local GCG database in which case you ‘point’ to it The sequence is in a local GCG database in which case you ‘point’ to it

by using any of the GCG database logical names. A colon, “by using any of the GCG database logical names. A colon, “::,” always ,” always

sets the logical name apart from either an accession number or a proper sets the logical name apart from either an accession number or a proper

identifier name or a wildcard expression and they are case insensitive.identifier name or a wildcard expression and they are case insensitive.

3)3) The sequence is in a GCG format multiple sequence file, either an MSF The sequence is in a GCG format multiple sequence file, either an MSF

(multiple sequence format) file or an RSF (rich sequence format) file. To (multiple sequence format) file or an RSF (rich sequence format) file. To

specify sequences contained in a GCG multiple sequence file, supply the specify sequences contained in a GCG multiple sequence file, supply the

file name followed by a pair of braces, “file name followed by a pair of braces, “{}{},” containing the sequence ,” containing the sequence

specification, e.g. a wildcard — {specification, e.g. a wildcard — {**}.}.

4)4) Finally, the most powerful method of specifying sequences is in a GCG Finally, the most powerful method of specifying sequences is in a GCG

“list” file. This is merely a list of other sequence specifications and can “list” file. This is merely a list of other sequence specifications and can

even contain other list files within it. The convention to use a GCG list file even contain other list files within it. The convention to use a GCG list file

in a program is to precede it with an at sign, “in a program is to precede it with an at sign, “@@.” Furthermore, attribute .” Furthermore, attribute

information within list files can specify particular sequence aspects.information within list files can specify particular sequence aspects.

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.

Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future

you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The

line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA

51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

The first way —The first way —

‘‘Clean’ GCG format single sequence file after Clean’ GCG format single sequence file after

‘reformat’ (or any of the From… programs)‘reformat’ (or any of the From… programs)

SeqLab’s Editor mode can also SeqLab’s Editor mode can also

“Import” native GenBank format and “Import” native GenBank format and

ABI or LI-COR trace files!ABI or LI-COR trace files!

The logical terms for the second way of running the Wisconsin PackageThe logical terms for the second way of running the Wisconsin PackageSequence databases, nucleic acids:Sequence databases, nucleic acids: Sequence databases, amino acids:Sequence databases, amino acids:

GENBANKPLUSGENBANKPLUS all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GENPEPTGENPEPT GenBank CDS translationsGenBank CDS translations

GBPGBP all of GenBank plus EST and GSS subdivisionsall of GenBank plus EST and GSS subdivisions GPGP GenBank CDS translationsGenBank CDS translations

GENBANKGENBANK all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWISSPROTPLUSSWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

GBGB all of GenBank except EST and GSS subdivisionsall of GenBank except EST and GSS subdivisions SWPSWP all of Swiss-Prot and all of SPTrEMBLall of Swiss-Prot and all of SPTrEMBL

BABA GenBank bacterial subdivisionGenBank bacterial subdivision SWISSPROTSWISSPROT all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

BACTERIALBACTERIAL GenBank bacterial subdivisionGenBank bacterial subdivision SWSW all of Swiss-Prot (fully annotated) all of Swiss-Prot (fully annotated)

ESTEST GenBank EST (Expressed Sequence Tags) subdivisionGenBank EST (Expressed Sequence Tags) subdivision SPTREMBLSPTREMBL Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

GSSGSS GenBank GSS (Genome Survey Sequences) subdivisionGenBank GSS (Genome Survey Sequences) subdivision SPTSPT Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

HTCHTC GenBank High Throughput cDNAGenBank High Throughput cDNA PP all of PIR Proteinall of PIR Protein

HTGHTG GenBank High Throughput GenomicGenBank High Throughput Genomic PIRPIR all of PIR Proteinall of PIR Protein

ININ GenBank invertebrate subdivisionGenBank invertebrate subdivision PROTEINPROTEIN PIR fully annotated subdivisionPIR fully annotated subdivision

INVERTEBRATEINVERTEBRATE GenBank invertebrate subdivisionGenBank invertebrate subdivision PIR1PIR1 PIR fully annotated subdivisionPIR fully annotated subdivision

OMOM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR2PIR2 PIR preliminary subdivisionPIR preliminary subdivision

OTHERMAMMOTHERMAMM GenBank other mammalian subdivisionGenBank other mammalian subdivision PIR3PIR3 PIR unverified subdivisionPIR unverified subdivision

OVOV GenBank other vertebrate subdivision GenBank other vertebrate subdivision PIR4PIR4 PIR unencoded subdivisionPIR unencoded subdivision

OTHERVERTOTHERVERT GenBank other vertebrate subdivision GenBank other vertebrate subdivision NRL_3DNRL_3D PDB 3D protein sequencesPDB 3D protein sequences

PATPAT GenBank patent subdivision GenBank patent subdivision NRLNRL PDB 3D protein sequencesPDB 3D protein sequences

PATENTPATENT GenBank patent subdivision GenBank patent subdivision

PHPH GenBank phage subdivision GenBank phage subdivision Genome databasesGenome databases

PHAGEPHAGE GenBank phage subdivisionGenBank phage subdivision HOMOHOMO NCBI human refseqNCBI human refseq

PLPL GenBank plant subdivision GenBank plant subdivision DANIODANIO Sanger Zebrafish buildSanger Zebrafish build

PLANTPLANT GenBank plant subdivision GenBank plant subdivision

PRPR GenBank primate subdivision GenBank primate subdivision General data files:General data files:

PRIMATEPRIMATE GenBank primate subdivisionGenBank primate subdivision GENMOREDATAGENMOREDATA path to GCG optional data filespath to GCG optional data files

RORO GenBank rodent subdivisionGenBank rodent subdivision GENRUNDATAGENRUNDATA path to GCG default data files path to GCG default data files

RODENTRODENT GenBank rodent subdivisionGenBank rodent subdivision GENTRAINDATAGENTRAINDATA path to GCG training datasetspath to GCG training datasets

STSSTS GenBank (sequence tagged sites) subdivisionGenBank (sequence tagged sites) subdivision

SYSY GenBank synthetic subdivisionGenBank synthetic subdivision

SYNTHETICSYNTHETIC GenBank synthetic subdivisionGenBank synthetic subdivision

TAGSTAGS GenBank EST and GSS subdivisionsGenBank EST and GSS subdivisions

UNUN GenBank unannotated subdivisionGenBank unannotated subdivision

UNANNOTATEDUNANNOTATED GenBank unannotated subdivisionGenBank unannotated subdivision

VIVI GenBank viral subdivisionGenBank viral subdivision

VIRALVIRAL GenBank viral subdivisionGenBank viral subdivision

These are easy — These are easy — they make sense and they make sense and you’ll have a vested you’ll have a vested interest.interest.

The third way — multiple sequence The third way — multiple sequence formats — GCG MSF & RSF formatformats — GCG MSF & RSF format

The trick is to not forget the Braces and ‘wild card,’ e.g. filename{The trick is to not forget the Braces and ‘wild card,’ e.g. filename{**}!}!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

And the forth way, the most powerful And the forth way, the most powerful way by far — the List File formatway by far — the List File format

An example GCG list file of many elongation An example GCG list file of many elongation

1a and Tu factors follows. As with all GCG 1a and Tu factors follows. As with all GCG

data files, two periods separate data files, two periods separate

documentation from data. ..documentation from data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected]

The ‘way’ SeqLab works!The ‘way’ SeqLab works!

LookUp, a Sequence Retrieval System (SRS) LookUp, a Sequence Retrieval System (SRS)

derivative, is used to find sequences of derivative, is used to find sequences of

interest based on interest based on text wordstext words, and database , and database

similaritysimilarity searches find sequences from searches find sequences from

locallocal GCG server databases. GCG server databases.

Advantages: Search output is a legitimate GCG list file, Advantages: Search output is a legitimate GCG list file,

appropriate input to other GCG programs; no need to appropriate input to other GCG programs; no need to

download and then reformat — it’s all GCG.download and then reformat — it’s all GCG.

Disadvantage: DB’s only as new as GCG administrator Disadvantage: DB’s only as new as GCG administrator

(me) maintains them. I update every two months to (me) maintains them. I update every two months to

coincide with NCBI’s full releases.coincide with NCBI’s full releases.

Within the GCG suite —Within the GCG suite —

Let’s build two list files with LookUp —Let’s build two list files with LookUp —

One, elongation factor 1 alpha from humans, andOne, elongation factor 1 alpha from humans, and

two, all proteins in the SwissProt database from the two, all proteins in the SwissProt database from the so-called non-crown ‘primitive’ eukaryotes. so-called non-crown ‘primitive’ eukaryotes.

I’ll use the following search strings:I’ll use the following search strings:

““elongation & factor & alphaelongation & factor & alpha” in the ” in the ““DefinitionDefinition” category and “” category and “HomoHomo” in the ” in the ““OrganismOrganism” field for the first search, and” field for the first search, and

““eukaryota ! ( fungi | metazoa | eukaryota ! ( fungi | metazoa | viridiplantae )viridiplantae )” in the “” in the “OrganismOrganism” ” category for the second search.category for the second search.

These two searches illustrate LookUp’s syntax These two searches illustrate LookUp’s syntax rules, in particular it’s Boolean qualifiers.rules, in particular it’s Boolean qualifiers.

SeqLab — GCG’s X-based GUI!SeqLab — GCG’s X-based GUI!The SeqLab graphical user interface is the The SeqLab graphical user interface is the

merger of Steve Smith’s Genetic Data merger of Steve Smith’s Genetic Data

Environment and GCG’s Wisconsin Package Environment and GCG’s Wisconsin Package

Interface:Interface:

GDE + WPI = SeqLabGDE + WPI = SeqLab

Requires an X-Windowing environment — Requires an X-Windowing environment —

either native on UNIX computers (including either native on UNIX computers (including

LINUX, but not included by Apple in Mac OS LINUX, but not included by Apple in Mac OS

X [v.10+] see Apple’s free X11 package), or X [v.10+] see Apple’s free X11 package), or

emulated with X-Server Software on other emulated with X-Server Software on other

personal computers.personal computers.

QuickTime™ and aTIFF (LZW) decompressor


SeqLab — Editor mode, residue display —SeqLab — Editor mode, residue display —

Structural & functional correspondence —Structural & functional correspondence —

So let’s see what it looks So let’s see what it looks like, SeqLab in action —like, SeqLab in action —

From an X ‘aware’ terminal window I From an X ‘aware’ terminal window I launch the GUI with the command:launch the GUI with the command:

seqlab &seqlab &

The ampersand is not required, but The ampersand is not required, but it allows you to continue to use the it allows you to continue to use the terminal window for system level terminal window for system level commands by running SeqLab as a commands by running SeqLab as a background process.background process.

OK then, how can we see if two OK then, how can we see if two

sequences are similar enough to belong sequences are similar enough to belong

in alignments? So first homology and in alignments? So first homology and

similarity —similarity —

Don’t confuse homology with similarity: there is Don’t confuse homology with similarity: there is

a huge difference! Similarity is a statistic that a huge difference! Similarity is a statistic that

describes how much two (sub)sequences are describes how much two (sub)sequences are

alike according to some set scoring criteria. It alike according to some set scoring criteria. It

can be normalized to ascertain statistical can be normalized to ascertain statistical

significance, but it’s still just a number.significance, but it’s still just a number.

implies an evolutionary relationship, more than just implies an evolutionary relationship, more than just

everything evolving from the same primordial ‘slime.’ To everything evolving from the same primordial ‘slime.’ To

demonstrate homology reconstruct the phylogeny of the demonstrate homology reconstruct the phylogeny of the

organisms or genes of interest. Better yet, show some organisms or genes of interest. Better yet, show some

experimental evidence — structural, morphological, experimental evidence — structural, morphological,

genetic, and/or fossil — that corroborates your assertion.genetic, and/or fossil — that corroborates your assertion.

Percent homology is an invalid concept; something is Percent homology is an invalid concept; something is

either homologous or it is not. Walter Fitch is credited either homologous or it is not. Walter Fitch is credited

with the joke “homology is like pregnancy — you can’t be with the joke “homology is like pregnancy — you can’t be

45% pregnant, just like something can’t be 45% 45% pregnant, just like something can’t be 45%

homologous.” Highly significant similarity can argue for homologous.” Highly significant similarity can argue for

homology; however, the inverse does not hold.homology; however, the inverse does not hold.

Homology, in contrast and by definition —Homology, in contrast and by definition —

One way — Dot Matrices.One way — Dot Matrices.

Provide a ‘Gestalt’ of all possible alignments Provide a ‘Gestalt’ of all possible alignments

between two sequences.between two sequences.

To begin — very simple 0, 1 (match, To begin — very simple 0, 1 (match,

nomatch) identity scoring function.nomatch) identity scoring function.

Put a dot wherever symbols match.Put a dot wherever symbols match.

So, to introduce the concept of So, to introduce the concept of

sequence comparison, a graphical sequence comparison, a graphical

method . . . method . . .

Identities and insertion/deletion events (indels) Identities and insertion/deletion events (indels)

identified (zero:one match score matrix, no window).identified (zero:one match score matrix, no window).

Noise due to random composition contributes to confusion. To ‘clean up’ the Noise due to random composition contributes to confusion. To ‘clean up’ the plot consider a filtered windowing approach. A dot is placed at the middle of a plot consider a filtered windowing approach. A dot is placed at the middle of a window if some ‘stringency’ is met within that defined window size. Then the window if some ‘stringency’ is met within that defined window size. Then the window is shifted one position and the entire process is repeated window is shifted one position and the entire process is repeated (zero:one (zero:one match score, match score, window of size three and a stringency level of two out of threewindow of size three and a stringency level of two out of three).).

Dot matrix analysis requires Dot matrix analysis requires two programs in the two programs in the Wisconsin Package —Wisconsin Package —

Compare generates the data that Compare generates the data that serves as input to DotPlot, which serves as input to DotPlot, which actually draws the matrix.actually draws the matrix.

Let’s see how a couple of the Let’s see how a couple of the elongation factors that we found elongation factors that we found earlier look using this method.earlier look using this method.

SW:EF11_Human vs. SW:EF11_Human vs. SW:EF1a_SchcoSW:EF1a_Schco

We can compare one molecule against another by We can compare one molecule against another by

aligning them. However, a ‘brute force’ approach just aligning them. However, a ‘brute force’ approach just

won’t work. Even without considering the introduction of won’t work. Even without considering the introduction of

gaps, the computation required to compare all possible gaps, the computation required to compare all possible

alignments between two sequences requires time alignments between two sequences requires time

proportional to the product of the lengths of the two proportional to the product of the lengths of the two

sequences. Therefore, if the two sequences are sequences. Therefore, if the two sequences are

approximately the same length (N), this is a Napproximately the same length (N), this is a N22 problem. problem.

To include gaps, we would have to repeat the To include gaps, we would have to repeat the

calculation 2N times to examine the possibility of gaps calculation 2N times to examine the possibility of gaps

at each possible position within the sequences, now a at each possible position within the sequences, now a

NN4N4N problem. There’s no way! We need an algorithm. problem. There’s no way! We need an algorithm.

Exact alignment — but how can we ‘see’ the Exact alignment — but how can we ‘see’ the correspondence of individual residues?correspondence of individual residues?

But —But —Just what the heck is an algorithm ! ?Just what the heck is an algorithm ! ?

Merriam-Webster’s says: “A rule Merriam-Webster’s says: “A rule of procedure for solving a of procedure for solving a problem [often mathematical] that problem [often mathematical] that frequently involves repetition of frequently involves repetition of an operation.”an operation.”

So, you could write an algorithm So, you could write an algorithm for tying your shoe! It’s just a set for tying your shoe! It’s just a set of explicit instructions for doing of explicit instructions for doing some routine task.some routine task.

Enter the Dynamic Programming Algorithm!Enter the Dynamic Programming Algorithm!Computer scientists figured it out long ago; Needleman and Wunsch Computer scientists figured it out long ago; Needleman and Wunsch applied it to the alignment of the full lengths of two sequences in applied it to the alignment of the full lengths of two sequences in 1970. An optimal alignment is defined as an arrangement of two 1970. An optimal alignment is defined as an arrangement of two sequences, 1 of length sequences, 1 of length ii and 2 of length and 2 of length jj, such that:, such that:

1)1) you maximize the number of matching symbols between 1 and you maximize the number of matching symbols between 1 and 2;2;

2)2) you minimize the number of indels within 1 and 2; andyou minimize the number of indels within 1 and 2; and

3)3) you minimize the number of mismatched symbols between 1 you minimize the number of mismatched symbols between 1 and 2.and 2.

Therefore, the actual solution can be represented by:Therefore, the actual solution can be represented by:

SSii-1 -1 jj-1-1 or or max Smax Si-xi-x j-j-11 + w + wx-x-11 or orSSijij = s = sijij + max 2 < + max 2 < xx < < ii max Smax Sii-1 -1 j-yj-y + w + wy-y-11

2 < 2 < yy < < IIWhere SWhere Sij ij is the score for the alignment ending at is the score for the alignment ending at ii in sequence in sequence

1 and 1 and jj in sequence 2, in sequence 2,

ssijij is the score for aligning is the score for aligning ii with with jj,,

wwxx is the score for making a is the score for making a xx long gap in sequence 1, long gap in sequence 1,

wwyy is the score for making a is the score for making a yy long gap in sequence 2, long gap in sequence 2,

allowing gaps to be any length in either sequence.allowing gaps to be any length in either sequence.

An oversimplified example —An oversimplified example —

total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])here}])

Optimum Alignments —Optimum Alignments —There may be more than one best path through the matrix, and optimum doesn’t guarantee biologically correct. Starting at the top and working down, then tracing back, I found one best alignment:

cTATAtAagg| ||||| cg.TAtAaT.

With our example’s scoring scheme this alignment’s final score is 5, the highest bottom-right score in the trace-back path graph, and the sum of six matches minus one interior gap. This is the number optimized by the algorithm, not any type of a percentage! Only one optimal solution will be reported. Do you have any ideas about how others can be discovered, besides alternate trace back paths? Answer — Often if you reverse the solution of the entire process, other solutions will be found!

This was a global solution. Smith Waterman style local solutions (1981) use negative numbers in the match matrix and pick the best diagonal within overall graph gives local.

What about proteins — conservative replacements and similarity as What about proteins — conservative replacements and similarity as opposed to identity. The nitrogenous bases, A, C, T, G, are either the opposed to identity. The nitrogenous bases, A, C, T, G, are either the same or they’re not, but amino acids can be similar, genetically, same or they’re not, but amino acids can be similar, genetically, evolutionarily, and structurally! Enter log-odds scoring matrices.evolutionarily, and structurally! Enter log-odds scoring matrices.

Notice that positive values for identity range from 4 to 11 and negative values for those Notice that positive values for identity range from 4 to 11 and negative values for those

substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a substitutions that rarely occur go as low as –4. The most conserved residue is tryptophan with a

score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for score of 11; cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for

identity.identity.

BLOSUM62 amino acid substitution matrix (the default in many sequence analysis programs).

A C D E F G H I K L M N P Q R S T V W X Y

A 44 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -1 -2

C 0 99 -3 -4-4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2

D -2 -3 66 2 -3 -1 -1 -3 -1 -4-4 -3 1 -1 0 -2 0 -1 -3 -4-4 -1 -3

E -1 -4-4 2 55 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -1 -2

F -2 -2 -3 -3 66 -3 -1 0 -3 0 0 -3 -4-4 -3 -3 -2 -2 -1 1 -1 3

G 0 -3 -1 -2 -3 66 -2 -4-4 -2 -4-4 -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3

H -2 -3 -1 0 -1 -2 88 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 -1 2

I -1 -1 -3 -3 0 -4-4 -3 44 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1

K -1 -3 -1 1 -3 -2 -1 -3 55 -2 -1 0 -1 1 2 0 -1 -2 -3 -1 -2

L -1 -1 -4-4 -3 0 -4-4 -3 2 -2 44 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1

M -1 -1 -3 -2 0 -3 -2 1 -1 2 55 -2 -2 0 -1 -1 -1 1 -1 -1 -1

N -2 -3 1 0 -3 0 1 -3 0 -3 -2 66 -2 0 0 1 0 -3 -4-4 -1 -2

P -1 -3 -1 -1 -4-4 -2 -2 -3 -1 -3 -2 -2 77 -1 -2 -1 -1 -2 -4-4 -1 -3

Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 55 1 0 -1 -2 -2 -1 -1

R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 55 -1 -1 -3 -3 -1 -2

S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 44 1 -2 -3 -1 -2

T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 55 0 -2 -1 -2

V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 44 -3 -1 -1

W -3 -2 -4-4 -3 1 -2 -2 -3 -3 -2 -1 -4-4 -4-4 -2 -3 -3 -2 -3 11 11 -1 2

X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 -1 77

We can imagine screening databases for sequences We can imagine screening databases for sequences

similar to ours using the concepts of dynamic similar to ours using the concepts of dynamic

programming and log-odds scoring matrices and yet to programming and log-odds scoring matrices and yet to

be described algorithmic tricks.be described algorithmic tricks.

But why even bother? But why even bother? Inference Inference

through homology is a through homology is a

fundamental principle of biologyfundamental principle of biology!!

When a sequence is found to fall into a preexisting When a sequence is found to fall into a preexisting

family we may be able to infer function, mechanism, family we may be able to infer function, mechanism,

evolution, perhaps even structure, based on homology evolution, perhaps even structure, based on homology

with its neighbors.with its neighbors.

So, first — So, first — SignificanceSignificance: :

when is any alignment worth when is any alignment worth

anything biologically?anything biologically?

An old statistics trick — An old statistics trick — Monte CarloMonte Carlo simulations: simulations:

Z scoreZ score = [ = [ ( actual score ) - ( mean of randomized scores )( actual score ) - ( mean of randomized scores ) ] ]

( standard deviation of randomized score distribution )( standard deviation of randomized score distribution )

Independent of all that, what is a Independent of all that, what is a

‘good’ alignment?‘good’ alignment?

The Wisconsin Package dynamic The Wisconsin Package dynamic programmings tools —programmings tools —

BestFit — Smith Waterman local BestFit — Smith Waterman local alignments,alignments,

Gap — Needleman Wunsch global Gap — Needleman Wunsch global alignments,alignments,

FrameAlign — nucleotide to protein, either FrameAlign — nucleotide to protein, either local or global.local or global.

I’ll illustrate in SeqLab with same previous I’ll illustrate in SeqLab with same previous example, but at the command line:example, but at the command line:bestfit sw:ef11_human sw:ef1a_schco -shuffle=100bestfit sw:ef11_human sw:ef1a_schco -shuffle=100

The The NormalNormal distributiondistribution — —

Many Z scores measure the distance from the mean Many Z scores measure the distance from the mean

using this simplistic Monte Carlo model assuming a using this simplistic Monte Carlo model assuming a

Gaussian distribution, a.k.a. the Normal distribution Gaussian distribution, a.k.a. the Normal distribution

((http://mathworld.wolfram.com/NormalDistribution.html),http://mathworld.wolfram.com/NormalDistribution.html),

in spite of the fact that ‘sequence-space’ actually in spite of the fact that ‘sequence-space’ actually

follows what is know as the ‘Extreme Value follows what is know as the ‘Extreme Value

distribution.’distribution.’

Regardless, Monte Carlo methods approximate Regardless, Monte Carlo methods approximate

significance estimates pretty well.significance estimates pretty well.

< 2

0 6

50

0

:==

< 2

0 6

50

0

:==

2

2 0

0

:2

2 0

0

: 2

4 3

0

:=2

4 3

0

:= 2

6 2

2 8

:*2

6 2

2 8

:* 2

8 9

8 8

7:*

28

9

8 8

7:*

3

0 2

89

5

28

:*3

0 2

89

5

28

:* 3

2 1

71

4 2

04

2:=

==

*3

2 1

71

4 2

04

2:=

==

* 3

4 5

58

5 5

53

9:=

==

==

==

==

*3

4 5

58

5 5

53

9:=

==

==

==

==

* 3

6 1

24

95

1

13

75

:==

==

==

==

==

==

==

==

==

*==

36

1

24

95

1

13

75

:==

==

==

==

==

==

==

==

==

*==

3

8 2

19

57

1

87

99

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*=

==

==

38

2

19

57

1

87

99

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*=

==

==

4

0 2

88

75

4

0 2

88

75

2

62

23

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*=

==

=2

62

23

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*=

==

= 4

2 3

41

53

4

2 3

41

53

3

20

54

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*=

32

05

4:=

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

*==

==

= 4

4 3

54

27

4

4 3

54

27

3

53

59

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=3

53

59

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

*=

==

* 4

6 3

62

19

4

6 3

62

19

3

60

14

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=3

60

14

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*

==

==

* 4

8 3

36

99

4

8 3

36

99

3

44

79

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=3

44

79

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

*=

* 5

0 3

07

27

5

0 3

07

27

3

14

62

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

= *

31

46

2:=

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

* 5

2 2

72

88

2

76

61

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*

52

2

72

88

2

76

61

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

=*

5

4 2

25

38

2

36

27

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

*5

4 2

25

38

2

36

27

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

* 5

6 1

80

55

1

97

36

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

*

56

1

80

55

1

97

36

:==

==

==

==

==

==

==

==

==

==

==

==

==

==

==

*

5

8 1

46

17

1

62

03

:==

==

==

==

==

==

==

==

==

==

==

==

= *

58

1

46

17

1

62

03

:==

==

==

==

==

==

==

==

==

==

==

==

= *

6

0 1

25

95

1

31

25

:==

==

==

==

==

==

==

==

==

==

=*

60

1

25

95

1

31

25

:==

==

==

==

==

==

==

==

==

==

=*

6

2 1

05

63

1

05

22

:==

==

==

==

==

==

==

==

=*

62

1

05

63

1

05

22

:==

==

==

==

==

==

==

==

=*

6

4 8

62

6 8

36

8:=

==

==

==

==

==

==

*=6

4 8

62

6 8

36

8:=

==

==

==

==

==

==

*= 6

6 6

42

6 6

61

4:=

==

==

==

==

=*

66

6

42

6 6

61

4:=

==

==

==

==

=*

6

8 4

77

0 5

20

3:=

==

==

==

=*

68

4

77

0 5

20

3:=

==

==

==

=*

7

0 4

01

7 4

07

7:=

==

==

=*

70

4

01

7 4

07

7:=

==

==

=*

7

2 2

92

0 3

18

6:=

==

==

*7

2 2

92

0 3

18

6:=

==

==

* 7

4 2

44

8 2

48

4:=

==

=*

74

2

44

8 2

48

4:=

==

=*

7

6 1

69

6 1

93

3:=

==

*7

6 1

69

6 1

93

3:=

==

* 7

8 1

17

8 1

50

3:=

=*

78

1

17

8 1

50

3:=

=*

8

0 9

35

1

16

7:=

*8

0 9

35

1

16

7:=

* 8

2 7

22

8

93

:=*

82

7

22

8

93

:=*

8

4 4

54

7

07

:=*

84

4

54

7

07

:=*

8

6 4

38

5

47

:*8

6 4

38

5

47

:* 8

8 3

22

4

23

:*8

8 3

22

4

23

:* 9

0 2

57

3

28

:*9

0 2

57

3

28

:* 9

2 1

75

2

53

:*

92

1

75

2

53

:*

9

4 2

10

1

96

:*

94

2

10

1

96

:*

9

6 1

02

1

52

:*

96

1

02

1

52

:*

9

8 6

3 1

17

:*

98

6

3 1

17

:*

10

0 5

8 9

1:*

1

00

5

8 9

1:*

1

02

4

0 7

0:*

1

02

4

0 7

0:*

1

04

3

0 5

4:*

1

04

3

0 5

4:*

1

06

1

7 4

2:*

1

06

1

7 4

2:*

1

08

1

4 3

3:*

1

08

1

4 3

3:*

1

10

1

4 2

5:*

1

10

1

4 2

5:*

1

12

1

2 2

0:*

1

12

1

2 2

0:*

1

14

9

1

5:*

1

14

9

1

5:*

1

16

6

1

2:*

1

16

6

1

2:*

1

18

8

9

:*

11

8 8

9

:*

>1

20

1

03

0 7

:*=

>1

20

1

03

0 7

:*=

Based on this known statistical Based on this known statistical

distribution, and robust distribution, and robust

statistical methodology, a statistical methodology, a

realistic realistic ExpectationExpectation function, function,

the the E ValueE Value, can be calculated , can be calculated

from database searches.from database searches.

The ‘take-home’ message is . . .The ‘take-home’ message is . . .

‘‘Sequence-space’ Sequence-space’ (Huh, what’s that?)(Huh, what’s that?)

actually follows the ‘Extreme Value distribution’actually follows the ‘Extreme Value distribution’((http://mathworld.wolfram.com/ExtremeValueDistribution.html).http://mathworld.wolfram.com/ExtremeValueDistribution.html).

The Expectation Value?The Expectation Value?

The higher the E value is, the more probable The higher the E value is, the more probable

that the observed match is due to chance in a that the observed match is due to chance in a

search of the same size database, and the search of the same size database, and the

lower its Z score will be, i.e. is NOT significant.lower its Z score will be, i.e. is NOT significant.

Therefore, the smaller the E value, i.e. the Therefore, the smaller the E value, i.e. the

closer it is to zero, the more significant it is and closer it is to zero, the more significant it is and

the higher its Z score will be! The E value is the higher its Z score will be! The E value is

the number that really matters.the number that really matters.

Rules of thumb for a protein search —Rules of thumb for a protein search —

The Z score represents the number of standard deviations some The Z score represents the number of standard deviations some

particular alignment is from a distribution of random alignments particular alignment is from a distribution of random alignments

(often the Normal distribution).(often the Normal distribution).

They They very roughlyvery roughly correspond to the listed E Values (based on the correspond to the listed E Values (based on the

Extreme Value distribution) for a typical protein sequence similarity Extreme Value distribution) for a typical protein sequence similarity

search. But remember probabilities are dependent on the size and search. But remember probabilities are dependent on the size and

composition of the database and even on how often you search!composition of the database and even on how often you search!

On to the searches —On to the searches —How can you search the databases for How can you search the databases for

similar sequences, if pair-wise alignments similar sequences, if pair-wise alignments

take Ntake N22 time?! time?!

Database searching programs use the two Database searching programs use the two

concepts of dynamic programming and log-odds concepts of dynamic programming and log-odds

scoring matrices; however, dynamic programming scoring matrices; however, dynamic programming

takes far too long when used against most takes far too long when used against most

sequence databases with a ‘normal’ computer. sequence databases with a ‘normal’ computer.

Remember Remember how hugehow huge the databases are! the databases are!

Therefore, the programs use tricks to make things Therefore, the programs use tricks to make things

happen faster. These tricks fall into two main happen faster. These tricks fall into two main

categories, categories, hashinghashing and and heuristicsheuristics..

Corn beef hash? Huh . . .Corn beef hash? Huh . . .Hashing is the process of breaking your sequence into Hashing is the process of breaking your sequence into

small ‘words’ or ‘k-tuples’ (think all chopped up, just like small ‘words’ or ‘k-tuples’ (think all chopped up, just like

corn beef hash) of a set size and creating a ‘look-up’ corn beef hash) of a set size and creating a ‘look-up’

table with those words keyed to position numbers. table with those words keyed to position numbers.

Computers can deal with numbers way faster than they Computers can deal with numbers way faster than they

can deal with strings of letters, and this preprocessing can deal with strings of letters, and this preprocessing

step happens very quickly.step happens very quickly.

Then when any of the word positions match part of an Then when any of the word positions match part of an

entry in the database, that match, the ‘offset,’ is saved. entry in the database, that match, the ‘offset,’ is saved.

In general, hashing reduces the complexity of the search In general, hashing reduces the complexity of the search

problem from Nproblem from N22 for dynamic programming to N, the for dynamic programming to N, the

length of all the sequences in the database.length of all the sequences in the database.

OK. Heuristics . . . What’s that?OK. Heuristics . . . What’s that?Approximation techniques are collectively known as ‘heuristics.’ Approximation techniques are collectively known as ‘heuristics.’

Webster’s defines heuristic as “serving to guide, discover, or Webster’s defines heuristic as “serving to guide, discover, or

reveal; . . . but unproved or incapable of proof.”reveal; . . . but unproved or incapable of proof.”

In database similarity searching techniques the heuristic usually In database similarity searching techniques the heuristic usually

restricts the necessary search space by calculating some sort of a restricts the necessary search space by calculating some sort of a

statistic that allows the program to decide whether further scrutiny statistic that allows the program to decide whether further scrutiny

of a particular match should be pursued. This statistic may miss of a particular match should be pursued. This statistic may miss

things depending on the parameters set — that’s what makes it things depending on the parameters set — that’s what makes it

heuristic. heuristic. ‘Worthwhile’ results at the end are compiled and the ‘Worthwhile’ results at the end are compiled and the

longest alignment within the program’s restrictions is created.longest alignment within the program’s restrictions is created.

The exact implementation varies between the different programs, The exact implementation varies between the different programs,

but the basic idea follows in most all of them.but the basic idea follows in most all of them.

Two predominant versions exist: BLAST and FastTwo predominant versions exist: BLAST and Fast

Both return local alignments, and are not a single program, but Both return local alignments, and are not a single program, but

rather a family of programs with implementations designed to rather a family of programs with implementations designed to

compare a sequence to a database in about every which way compare a sequence to a database in about every which way

imaginable.imaginable.

These include:These include:

1)1) a DNA sequence against a DNA database (not recommended unless a DNA sequence against a DNA database (not recommended unless

forced to do so because you are dealing with a non-translated region of forced to do so because you are dealing with a non-translated region of

the genome — DNA is just too darn noisy, only identity & four bases!),the genome — DNA is just too darn noisy, only identity & four bases!),

2)2) a translated (where the translation is done ‘on-the-fly’ in all six frames) a translated (where the translation is done ‘on-the-fly’ in all six frames)

version of a DNA sequence against a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a translated (‘on-the-fly’ six-frame)

version of the DNA database (not available in the Fast package),version of the DNA database (not available in the Fast package),

3)3) a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a a translated (‘on-the-fly’ six-frame) version of a DNA sequence against a

protein database,protein database,

4)4) a protein sequence against a translated (‘on-the-fly’ six-frame) version of a protein sequence against a translated (‘on-the-fly’ six-frame) version of

a DNA database,a DNA database,

5)5) or a protein sequence against a protein database.or a protein sequence against a protein database.

Many implementations allow for the possibility of frame shifts in Many implementations allow for the possibility of frame shifts in

translated comparisons and don’t penalize the score for doing so.translated comparisons and don’t penalize the score for doing so.

The BLAST and Fast programs — some generalitiesThe BLAST and Fast programs — some generalities

BLAST — Basic Local Alignment BLAST — Basic Local Alignment

Search Tool, developed at NCBI.Search Tool, developed at NCBI.

1)1) Normally NOT a good idea Normally NOT a good idea

to use for DNA against to use for DNA against

DNA searches w/o DNA searches w/o

translation (not optimized);translation (not optimized);

2)2) Pre-filters repeat and “low Pre-filters repeat and “low

complexity” sequence complexity” sequence

regions;regions;

4)4) Can find more than one Can find more than one

region of gapped similarity;region of gapped similarity;

5)5) Very fast heuristic and Very fast heuristic and

parallel implementation;parallel implementation;

6)6) Restricted to precompiled, Restricted to precompiled,

specially formatted specially formatted

databases;databases;

FastA — and its family of relatives, FastA — and its family of relatives,

developed by Bill Pearson at the developed by Bill Pearson at the

University of Virginia.University of Virginia.

1)1) Works well for DNA against Works well for DNA against

DNA searches (within limits DNA searches (within limits

of possible sensitivity);of possible sensitivity);

2)2) Can find only one gapped Can find only one gapped

region of similarity;region of similarity;

3)3) Relatively slow, should often Relatively slow, should often

be run in the background;be run in the background;

4)4) Does not require specially Does not require specially

prepared, preformatted prepared, preformatted

databases.databases.

The algorithms, in brief —The algorithms, in brief —

BLAST:BLAST:

Fast:Fast:

Two word hits on the Two word hits on the same diagonal above same diagonal above some some similaritysimilarity threshold triggers threshold triggers ungapped extension ungapped extension until the score isn’t until the score isn’t improved enough above improved enough above another threshold:another threshold:

the HSP.the HSP.

Find all ungapped Find all ungapped exact exact word hits; maximize the word hits; maximize the ten best continuous ten best continuous regions’ scores: regions’ scores: init1init1..

Combine non-Combine non-overlapping init overlapping init regions on different regions on different diagonals:diagonals:initninitn..

Use dynamic Use dynamic programming ‘in a programming ‘in a band’ for all regions band’ for all regions with with initninitn scores scores better than some better than some threshold: threshold: optopt score.score.

Initiate gapped extensions Initiate gapped extensions using dynamic programming for using dynamic programming for those HSP’s above a third those HSP’s above a third threshold up to the point where threshold up to the point where the score starts to drop below a the score starts to drop below a fourth threshold: yields fourth threshold: yields alignment.alignment.

I’ll illustrate with FastA —I’ll illustrate with FastA —

FastA of human elongation factor 1 alpha FastA of human elongation factor 1 alpha

searched against that list file of primitive searched against that list file of primitive

organism proteins from SwissProt.organism proteins from SwissProt.

I’ll show SeqLab’s implementation, but I’ll show SeqLab’s implementation, but

at the command line it would be:at the command line it would be:

fasta sw:ef11_human @primitive.listfasta sw:ef11_human @primitive.list

Multiple Sequence Analysis:Multiple Sequence Analysis:

Multiple Sequence Alignment.Multiple Sequence Alignment.Dynamic programming’s complexity increases exponentially with the Dynamic programming’s complexity increases exponentially with the

number of sequences being compared. N-dimensional matrix . . . .number of sequences being compared. N-dimensional matrix . . . .

Therefore — Therefore — pairwise, pairwise, progressive dynamic progressive dynamic programming restricts the programming restricts the solution to the solution to the neighborhood of only two neighborhood of only two sequences at a time.sequences at a time.

All sequences are All sequences are compared, pairwise, and compared, pairwise, and then each is aligned to its then each is aligned to its most similar partner or most similar partner or group of partners. Each group of partners. Each group of partners is then group of partners is then aligned to finish the aligned to finish the complete multiple complete multiple sequence alignment.sequence alignment.

PileUp is the Wisconsin PileUp is the Wisconsin Package’s implementation of Package’s implementation of pairwise progressive multiple pairwise progressive multiple sequence alignment.sequence alignment.

Let’s run PileUp on our ‘primitive’ Let’s run PileUp on our ‘primitive’ dataset in SeqLab. At the dataset in SeqLab. At the command line this would be:command line this would be:

pileup @primitive.listpileup @primitive.list

The consensus and motifs —The consensus and motifs —Conserved Conserved regions can be regions can be visualized with a visualized with a sliding window sliding window approach and approach and appear as appear as peaks. peaks.

QuickTime™ and aGraphics decompressor


P-Loop

Let’s Let’s concentrate on concentrate on the first peak the first peak seen here to seen here to simplify matters.simplify matters.

Motifs (a.k.a. signatures)Motifs (a.k.a. signatures)

GHVDHGKS

A consensus isn’t A consensus isn’t necessarily the necessarily the biologically “correct” biologically “correct” combination. combination. Therefore, build Therefore, build one-dimensional one-dimensional ‘pattern descriptors.’‘pattern descriptors.’

PROSITE Database PROSITE Database of protein families of protein families and domains - over and domains - over 1,000 motifs.1,000 motifs.

This motif, the P-This motif, the P-loop, is defined: loop, is defined: (A,G)x4GK(S,T), i.e. (A,G)x4GK(S,T), i.e. either an Alanine or either an Alanine or a Glycine, followed a Glycine, followed by four of anything, by four of anything, followed by an followed by an invariant Glycine-invariant Glycine-Lysine pair, followed Lysine pair, followed by either a Serine or by either a Serine or a Threonine.a Threonine.

Discover motifs in ‘ungapped’ Discover motifs in ‘ungapped’ sequences with the program sequences with the program Motifs in the Wisconsin Motifs in the Wisconsin Package —Package —

Again I’ll show you in SeqLab, Again I’ll show you in SeqLab, but at the command line:but at the command line:

motifs sw:ef11_humanmotifs sw:ef11_human

Enter Enter the the ProfileProfile

But motifs can not convey any degree of the ‘importance’ But motifs can not convey any degree of the ‘importance’ of the residues. of the residues. Use a position specific, two-dimensional Use a position specific, two-dimensional matrix where conserved areas of the alignment receive the matrix where conserved areas of the alignment receive the most importance and variable regions hardly matter!most importance and variable regions hardly matter!

The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and substituted with a tryptophan would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the aspartate 22 is conserved throughout the alignment — the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine high conservation at that position for aspartate equals the most negative score in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However, other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.similar scores, 24 and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.

Cons A B C D E F G H I K L M N P Q R S T V W Y Z Gap LenCons A B C D E F G H I K L M N P Q R S T V W Y Z Gap Len E 11 20 -11 27 33 -21 16 10 -4 10 -9 -6 16 6 18 0 8 17 -3 -29 -15 26 12 12E 11 20 -11 27 33 -21 16 10 -4 10 -9 -6 16 6 18 0 8 17 -3 -29 -15 26 12 12 K 0 27 -40 21 22 -47 -6 7 -13 100 -20 13 27 7 27 53 14 13 -13 5 -40 28 12 12K 0 27 -40 21 22 -47 -6 7 -13 100 -20 13 27 7 27 53 14 13 -13 5 -40 28 12 12! 11! 11 P 13 3 4 3 3 -13 9 2 3 3 -2 -1 1 28 4 3 11 20 9 -21 -16 4 12 12P 13 3 4 3 3 -13 9 2 3 3 -2 -1 1 28 4 3 11 20 9 -21 -16 4 12 12 H -7 26 -6 26 26 -6 -14 99 -18 6 -12 -19 33 13 46 33 -13 -6 -19 -7 20 33 12 12H -7 26 -6 26 26 -6 -14 99 -18 6 -12 -19 33 13 46 33 -13 -6 -19 -7 20 33 12 12 I 3 -7 2 -7 -6 19 -6 -9 43 -7 29 22 -10 -4 -6 -10 -4 6 38 -17 1 -5 12 12I 3 -7 2 -7 -6 19 -6 -9 43 -7 29 22 -10 -4 -6 -10 -4 6 38 -17 1 -5 12 12 N 14 73 -19 47 33 -34 27 33 -20 27 -27 -20 100 0 26 7 22 14 -20 -20 -7 27 12 12N 14 73 -19 47 33 -34 27 33 -20 27 -27 -20 100 0 26 7 22 14 -20 -20 -7 27 12 12 I 1 -10 -1 -10 -8 26 -9 -10 46 -8 34 27 -12 -6 -8 -12 -6 5 40 -12 4 -7 12 12I 1 -10 -1 -10 -8 26 -9 -10 46 -8 34 27 -12 -6 -8 -12 -6 5 40 -12 4 -7 12 12 V 15 2 7 3 1 -1 20 -9 24 -6 14 11 -3 6 -3 -11 4 10 37 -30 -9 -1 12 12V 15 2 7 3 1 -1 20 -9 24 -6 14 11 -3 6 -3 -11 4 10 37 -30 -9 -1 12 12 V 9 -4 7 -5 -4 5 7 -8 29 -4 20 15 -6 4 -7 -9 0 19 36 -21 -2 -5 12 12V 9 -4 7 -5 -4 5 7 -8 29 -4 20 15 -6 4 -7 -9 0 19 36 -21 -2 -5 12 12 I 0 -16 16 -16 -16 55 -24 -24 118 -16 63 47 -24 -16 -24 -24 -8 16 87 -39 8 -16 12 12I 0 -16 16 -16 -16 55 -24 -24 118 -16 63 47 -24 -16 -24 -24 -8 16 87 -39 8 -16 12 12 G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12 H -6 27 -7 27 27 -8 -13 100 -20 7 -13 -20 34 14 48 34 -13 -7 -20 -7 19 34 12 12H -6 27 -7 27 27 -8 -13 100 -20 7 -13 -20 34 14 48 34 -13 -7 -20 -7 19 34 12 12! 21! 21 V 11 -12 12 -12 -12 13 11 -18 67 -12 48 36 -18 5 -12 -18 -6 12 89 -47 -6 -12 12 12V 11 -12 12 -12 -12 13 11 -18 67 -12 48 36 -18 5 -12 -18 -6 12 89 -47 -6 -12 12 12 D 24 87 -39 118 79 -79 55 31 -16 24 -39 -31 55 8 55 0 16 16 -16 -87 -39 71 12 12D 24 87 -39 118 79 -79 55 31 -16 24 -39 -31 55 8 55 0 16 16 -16 -87 -39 71 12 12 S 9 12 11 11 11 -8 8 22 -7 5 -10 -10 14 11 11 9 23 4 -6 1 -2 9 12 12S 9 12 11 11 11 -8 8 22 -7 5 -10 -10 14 11 11 9 23 4 -6 1 -2 9 12 12 G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12G 55 47 16 55 39 -47 118 -16 -24 -8 -39 -24 31 24 16 -24 47 31 16 -79 -55 24 12 12 K 0 27 -40 20 20 -47 -7 7 -14 100 -20 13 27 7 27 55 13 13 -14 8 -40 27 12 12K 0 27 -40 20 20 -47 -7 7 -14 100 -20 13 27 7 27 55 13 13 -14 8 -40 27 12 12 S 19 14 30 10 10 -14 27 -9 -2 10 -17 -12 14 19 -5 3 63 24 -2 7 -19 1 100 100S 19 14 30 10 10 -14 27 -9 -2 10 -17 -12 14 19 -5 3 63 24 -2 7 -19 1 100 100 T 40 20 20 20 20 -30 40 -10 20 20 -10 0 20 30 -10 -10 30 150 20 -60 -30 10 100 100T 40 20 20 20 20 -30 40 -10 20 20 -10 0 20 30 -10 -10 30 150 20 -60 -30 10 100 100 T 8 -4 -9 -4 0 13 1 -6 18 0 23 22 -2 2 -4 -9 0 34 18 -6 -2 -1 100 100T 8 -4 -9 -4 0 13 1 -6 18 0 23 22 -2 2 -4 -9 0 34 18 -6 -2 -1 100 100 T 19 8 10 8 8 -12 19 -6 16 8 1 4 7 14 -6 -6 13 69 18 -32 -14 3 100 100T 19 8 10 8 8 -12 19 -6 16 8 1 4 7 14 -6 -6 13 69 18 -32 -14 3 100 100 G 40 24 10 28 21 -27 61 -8 -11 -4 -19 -11 16 16 9 -14 26 18 9 -44 -28 13 100 100G 40 24 10 28 21 -27 61 -8 -11 -4 -19 -11 16 16 9 -14 26 18 9 -44 -28 13 100 100! 31! 31 H 10 11 -1 11 11 -10 1 34 -8 7 -8 -5 13 11 19 18 0 1 -6 -1 0 14 100 100H 10 11 -1 11 11 -10 1 34 -8 7 -8 -5 13 11 19 18 0 1 -6 -1 0 14 100 100 L -4 -20 -27 -20 -13 50 -21 -10 43 -13 62 53 -17 -13 -7 -17 -15 -2 40 13 12 -9 100 100L -4 -20 -27 -20 -13 50 -21 -10 43 -13 62 53 -17 -13 -7 -17 -15 -2 40 13 12 -9 100 100 * 20 0 0 27 12 3 73 70 65 46 38 0 24 11 5 6 33 85 65 0 0 0* 20 0 0 27 12 3 73 70 65 46 38 0 24 11 5 6 33 85 65 0 0 0

Advanced methodologies — wondrous stuff based on Advanced methodologies — wondrous stuff based on combinations of the previous techniques, e.g.combinations of the previous techniques, e.g.PSI-BLAST uses profile methods to iterate database searches.PSI-BLAST uses profile methods to iterate database searches.

Profiles can be optimized with hidden Markov models (HMMs) or even Profiles can be optimized with hidden Markov models (HMMs) or even

discovered in unaligned sequences using expectation maximization (MEME).discovered in unaligned sequences using expectation maximization (MEME).

Exon and intron structure can be predicted. See e.g. the genefinder at Exon and intron structure can be predicted. See e.g. the genefinder at

http://genomic.http://genomic.sangersanger.ac..ac.ukuk//gfgf//gfgf.html.html and GrailEXP at and GrailEXP at http://grail.http://grail.lsdlsd..ornlornl..

govgov//grailexpgrailexp//..

Secondary structure can often be predicted. See Secondary structure can often be predicted. See http://www.http://www.emblembl--heidelbergheidelberg

.de/.de/predictproteinpredictprotein//predictproteinpredictprotein.html.html, which uses multiple sequence , which uses multiple sequence

alignment profile techniques along with neural net technology. Even three-alignment profile techniques along with neural net technology. Even three-

dimensional “homology modeling” will often lead to remarkably accurate dimensional “homology modeling” will often lead to remarkably accurate

results if the similarity is great enough between your protein and one in which results if the similarity is great enough between your protein and one in which

the structure has been solved through experimental means. See the structure has been solved through experimental means. See

SwissModel at SwissModel at http://www.http://www.expasyexpasy..chch//swissmodswissmod/SWISS-MODEL.html/SWISS-MODEL.html..

Evolutionary relationships can be ascertained using a multiple sequence Evolutionary relationships can be ascertained using a multiple sequence

alignment and the methods of molecular phylogenetics. See the PAUP* and alignment and the methods of molecular phylogenetics. See the PAUP* and

PHYLIP software packages. And if you’re really interested in this topic check PHYLIP software packages. And if you’re really interested in this topic check

out the out the Workshop on Molecular EvolutionWorkshop on Molecular Evolution offered every August at the Woods offered every August at the Woods

Hole Marine Biological Laboratory and/or similar courses worldwide.Hole Marine Biological Laboratory and/or similar courses worldwide.

Finally, what’s the deal with DNA versus Finally, what’s the deal with DNA versus protein for searches and alignment?protein for searches and alignment?

All database similarity searching and sequence All database similarity searching and sequence

alignment, regardless of the algorithm, is far more alignment, regardless of the algorithm, is far more

sensitive at the amino acid level than with DNA. This is sensitive at the amino acid level than with DNA. This is

because proteins have twenty match criteria versus because proteins have twenty match criteria versus

DNA’s four, and those four DNA bases can generally only DNA’s four, and those four DNA bases can generally only

be identical, not similar, to each other; and many DNA be identical, not similar, to each other; and many DNA

base changes (especially third position changes) do not base changes (especially third position changes) do not

change the encoded protein.change the encoded protein.

All of these factors drastically increase the ‘noise’ level of All of these factors drastically increase the ‘noise’ level of

a DNA against DNA search, and give protein searches a a DNA against DNA search, and give protein searches a

much greater ‘look-back’ time, at least doubling it. much greater ‘look-back’ time, at least doubling it.

Therefore, whenever dealing with coding sequence, it is Therefore, whenever dealing with coding sequence, it is

always prudent to work at the protein level!always prudent to work at the protein level!

FOR MORE INFO...FOR MORE INFO...See http://bio.fsu.edu/~stevet/workshop.html and contact me

([email protected]) for further bioinformatics assistance.

Conclusions — A comprehensive sequence analysis software Conclusions — A comprehensive sequence analysis software

suite, such as the Wisconsin Package, expedites suite, such as the Wisconsin Package, expedites

bioinformatics, putting a large assortment of tools all under bioinformatics, putting a large assortment of tools all under

one organizational model with one user interface.one organizational model with one user interface.

The better you understand the chemical, physical, and biological system The better you understand the chemical, physical, and biological system

under study, the better your chance of success in their analysis. Certain under study, the better your chance of success in their analysis. Certain

strategies are inherently more appropriate than others. Making these strategies are inherently more appropriate than others. Making these

types of subjective, discriminatory decisions is one of the most important types of subjective, discriminatory decisions is one of the most important

‘take-home’ messages I can offer!‘take-home’ messages I can offer!

Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence

Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), (1987),

provides a very appropriate conclusion:provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular

system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your

direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not

blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular BiologyJournal of Molecular Biology 215215, 403-410., 403-410.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Generation of Protein Database Search Programs. Nucleic Acids ResearchNucleic Acids Research 2525, 3389-3402., 3389-3402.

Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids ResearchNucleic Acids Research 2020, 2013-2018., 2013-2018.

Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A.Seattle, Washington, U.S.A.

Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Genetics Computer Group (GCG), Inc. (Copyright 1982-2000) Program Manual for the Wisconsin PackageProgram Manual for the Wisconsin Package, Version 10.1, Madison, Wisconsin, USA , Version 10.1, Madison, Wisconsin, USA 53711.53711.

Gribskov, M. and Devereux, J., editors (1992) Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis PrimerSequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A.. W.H. Freeman and Company, New York, N.Y., U.S.A.

Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A.Proc. Natl. Acad. Sci. U.S.A. 8484, 4355-4358., 4355-4358.

Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8989, 10915-10919., 10915-10919.

Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular BiologyJournal of Molecular Biology 4848, 443-453., 443-453.

Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. 1994. Nucleic Acids ResearchNucleic Acids Research 2222, 3470-3473., 3470-3473.

Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A.Proceedings of the National Academy of Sciences U.S.A. 8585, , 2444-2448.2444-2448.

Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular BiologyJournal of Molecular Biology 232232, 584-599., 584-599.

Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. Sequence Analysis. CABIOSCABIOS, , 1010, 671-675., 671-675.

Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and StructureAtlas of Protein Sequences and Structure, (M.O. Dayhoff , (M.O. Dayhoff editor) editor) 55, Suppl. , Suppl. 33, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A., 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A.

Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied MathematicsAdvances in Applied Mathematics 22, 482-489., 482-489.

Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids ResearchNucleic Acids Research 1010, 2471-2484., 2471-2484.

Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A.Smithsonian Institution, Washington D.C., U.S.A.

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids ResearchNucleic Acids Research, , 2222, 4673-4680., 4673-4680.

von Heijne, G. (1987) von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit.Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Academic Press, Inc., San Diego, California, U.S.A.

Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Proceedings of the National Academy of Sciences U.S.A.Sciences U.S.A. 8080, 726-730., 726-730.

Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. ScienceScience 244244, 48-52., 48-52.

ReferencesReferences

the “nuts and bolts” of ‘doing’ bioinformatics with the wisconsin package at fsu steve...

Documents

amino acid sequence

molecular databases

complete genomes

year slide

sequence analysis tools

molecular biology

computational biology

computational techniques