florida state university — bioinformatics workshop #1

30
Sept. 21, 2006, 5:30 Sept. 21, 2006, 5:30 PM PM Florida State University — Florida State University — Bioinformatics Workshop #1 Bioinformatics Workshop #1 An Introduction to Multiple An Introduction to Multiple Sequence Alignment & Analysis thru Sequence Alignment & Analysis thru GCG’s SeqLab GCG’s SeqLab Steven M. Thompson Steven M. Thompson Florida State Florida State University School of University School of Computational Science Computational Science (SCS) (SCS)

Upload: janna

Post on 23-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

An Introduction to Multiple Sequence Alignment & Analysis thru GCG’s SeqLab. Florida State University — Bioinformatics Workshop #1. Steven M. Thompson Florida State University School of Computational Science ( SCS ). Sept. 21, 2006, 5:30 PM. But first a prelude: My definitions —. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Florida State University — Bioinformatics Workshop #1

Sept. 21, 2006, 5:30 PMSept. 21, 2006, 5:30 PM

Florida State University — Florida State University — Bioinformatics Workshop #1Bioinformatics Workshop #1

An Introduction to Multiple Sequence An Introduction to Multiple Sequence Alignment & Analysis thru GCG’s SeqLabAlignment & Analysis thru GCG’s SeqLab

Steven M. ThompsonSteven M. Thompson

Florida State University Florida State University School of Computational School of Computational

Science (SCS)Science (SCS)

Page 2: Florida State University — Bioinformatics Workshop #1

But first a prelude: My definitions —

BiocomputingBiocomputing and and computational biologycomputational biology are synonymous and are synonymous and

describe the use of computers and computational techniques to describe the use of computers and computational techniques to

analyze any biological system, from molecules, through cells, analyze any biological system, from molecules, through cells,

tissues, and organisms, all the way to populations.tissues, and organisms, all the way to populations.

BioinformaticsBioinformatics describes using computational techniques to access, describes using computational techniques to access,

analyze, and interpret the biological information in any of the analyze, and interpret the biological information in any of the

available biological databases.available biological databases.

Sequence analysisSequence analysis is the study of molecular sequence data for the is the study of molecular sequence data for the

purpose of inferring the function, mechanism, interactions, purpose of inferring the function, mechanism, interactions,

evolution, and perhaps structure of biological molecules.evolution, and perhaps structure of biological molecules.

GenomicsGenomics analyzes the context of genes or complete genomes (the analyzes the context of genes or complete genomes (the

total DNA content of an organism) within and across genomes.total DNA content of an organism) within and across genomes.

ProteomicsProteomics is the subdivision of genomics concerned with analyzing is the subdivision of genomics concerned with analyzing

the complete protein complement, i.e. the proteome, of the complete protein complement, i.e. the proteome, of

organisms, both within and between different organisms.organisms, both within and between different organisms.

Page 3: Florida State University — Bioinformatics Workshop #1

from a ‘virtual’ DNA sequence to actual molecular from a ‘virtual’ DNA sequence to actual molecular physical characterization, not the other way ‘round.physical characterization, not the other way ‘round.

Using bioinformatics tools, you can infer all Using bioinformatics tools, you can infer all sorts of functional, evolutionary, and, sorts of functional, evolutionary, and, structural insights into a gene product, structural insights into a gene product, without the need to isolate and purify massive without the need to isolate and purify massive amounts of protein! Eventually you can go on amounts of protein! Eventually you can go on to clone and express the gene based on that to clone and express the gene based on that analysis using PCR techniques.analysis using PCR techniques.

The computer and molecular databases are an The computer and molecular databases are an essential part of this process.essential part of this process.

And a ‘way’ to think about it:And a ‘way’ to think about it:The reverse biochemistry analogy —The reverse biochemistry analogy —

Page 4: Florida State University — Bioinformatics Workshop #1

The exponential growth of molecular sequence databasesYearYear BasePairs BasePairs SequencesSequences

19821982 680338 680338 606 606

19831983 2274029 2274029 2427 2427

19841984 3368765 3368765 4175 4175

19851985 5204420 5204420 5700 5700

19861986 9615371 9615371 9978 9978

19871987 1551477615514776 1458414584

19881988 23800000 23800000 2057920579

19891989 34762585 34762585 2879128791

19901990 49179285 49179285 3953339533

19911991 71947426 71947426 55627 55627

19921992 101008486 101008486 78608 78608

19931993 157152442 157152442 143492143492

19941994 217102462 217102462 215273 215273

19951995 384939485 384939485 555694555694

19961996 651972984 651972984 10212111021211

19971997 1160300687 1160300687 17658471765847

19981998 2008761784 2008761784 28378972837897

19991999 3841163011 3841163011 4864570 4864570

20002000 1110106628811101066288 1010602310106023

20012001 1584992143815849921438 1497631014976310

20022002 2850799016628507990166 22318883 22318883

20032003 3655336848536553368485 3096841830968418

20042004 4457574517644575745176 4060431940604319

20052005 5603773446256037734462 52016762 52016762

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.htmlhttp://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html

& cpu power —& cpu power —

Doubling time ~ 1 year!Doubling time ~ 1 year!

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 5: Florida State University — Bioinformatics Workshop #1

So what; why even bother? So what; why even bother?

Applications:Applications:

Probe/primer, and motif/profile design;Probe/primer, and motif/profile design;

Graphical illustrations;Graphical illustrations;

Comparative ‘homology’ inference;Comparative ‘homology’ inference;

Molecular evolutionary analysis.Molecular evolutionary analysis.

OK — well, how do you do it?OK — well, how do you do it?

Back to multiple sequence Back to multiple sequence alignment — Applicability?alignment — Applicability?

Page 6: Florida State University — Bioinformatics Workshop #1

Dynamic programming’s complexity Dynamic programming’s complexity increases exponentially with the number of increases exponentially with the number of sequences being compared —sequences being compared —

N-dimensional matrix . . . .N-dimensional matrix . . . .complexity=[sequence length]complexity=[sequence length]number of sequencesnumber of sequences

Page 7: Florida State University — Bioinformatics Workshop #1

See:See:

MSA (‘global’ within ‘bounding box’) andMSA (‘global’ within ‘bounding box’) and

PIMA (‘local’ portions only) on the multiple PIMA (‘local’ portions only) on the multiple alignment page at thealignment page at the

Baylor College of Medicine’s Search Baylor College of Medicine’s Search Launcher —Launcher —

http://searchlauncher.bcm.tmc.edu/ — but, — but,

severely limiting restrictions!severely limiting restrictions!

‘‘Global’ heuristic solutions —Global’ heuristic solutions —

Page 8: Florida State University — Bioinformatics Workshop #1

Therefore — Therefore — pairwise, pairwise,

progressive dynamic progressive dynamic

programming restricts the programming restricts the

solution to the neighbor-hood solution to the neighbor-hood

of only two sequences at a of only two sequences at a

time.time.

All sequences are All sequences are

compared, pairwise, and compared, pairwise, and

then each is aligned to its then each is aligned to its

most similar partner or group most similar partner or group

of partners. Each group of of partners. Each group of

partners is then aligned to partners is then aligned to

finish the complete multiple finish the complete multiple

sequence alignment.sequence alignment.

Multiple Sequence Dynamic Programming —Multiple Sequence Dynamic Programming —

Page 9: Florida State University — Bioinformatics Workshop #1

Reliability and the Reliability and the Comparative Approach —Comparative Approach —

explicit homologous correspondence;explicit homologous correspondence;

manual adjustments based on manual adjustments based on knowledge,knowledge,

especially structural, regulatory, and especially structural, regulatory, and functional sites.functional sites.

Therefore, editors like SeqLab andTherefore, editors like SeqLab and

the Ribosomal Database Project:the Ribosomal Database Project:

http://rdp.cme.msu.edu/index.jsp.http://rdp.cme.msu.edu/index.jsp.

Page 10: Florida State University — Bioinformatics Workshop #1

Structural & Functional correspondence in Structural & Functional correspondence in the Wisconsin Package’s SeqLab —the Wisconsin Package’s SeqLab —

Page 11: Florida State University — Bioinformatics Workshop #1

Work with proteins!Work with proteins!If at all possible —If at all possible —

Twenty match symbols versus four, plus Twenty match symbols versus four, plus

similarity! Way better signal to noise.similarity! Way better signal to noise.

Also guarantees no indels are placed within Also guarantees no indels are placed within

codons. So translate, then align.codons. So translate, then align.

Nucleotide sequences will only reliably align if Nucleotide sequences will only reliably align if

they are they are veryvery similarsimilar to each other. And to each other. And

they will require extensive hand editing and they will require extensive hand editing and

careful consideration.careful consideration.

Page 12: Florida State University — Bioinformatics Workshop #1

Beware of aligning apples and Beware of aligning apples and oranges oranges [[and grapefruitand grapefruit]]!!

Parologous Parologous versus versus orthologous;orthologous;

genomic versus genomic versus cDNA;cDNA;

mature versus mature versus precursor.precursor.

Page 13: Florida State University — Bioinformatics Workshop #1

Mask out uncertain areas —Mask out uncertain areas —

Page 14: Florida State University — Bioinformatics Workshop #1

Complications —Complications —Order dependence.Order dependence.

Not that big of a deal.Not that big of a deal.

Substitution matrices and gap penalties.Substitution matrices and gap penalties.

A very big deal!A very big deal!

Regional ‘realignment’ becomes incredibly Regional ‘realignment’ becomes incredibly

important, especially with sequences that important, especially with sequences that

have areas of high and low similarity have areas of high and low similarity

(GCG’ PileUp -InSitu option).(GCG’ PileUp -InSitu option).

Page 15: Florida State University — Bioinformatics Workshop #1

Complications cont. —Complications cont. —

Format hassles!Format hassles!

Specialized format conversion Specialized format conversion tools such as GCG’s tools such as GCG’s SeqConv+ program and SeqConv+ program and PAUPSearch wrapper.PAUPSearch wrapper.

Don Gilbert’s public domain Don Gilbert’s public domain ReadSeq program.ReadSeq program.

Page 16: Florida State University — Bioinformatics Workshop #1

Still more complications —Still more complications —

Indels and missing Indels and missing

data symbols (i.e. data symbols (i.e.

gaps) designation gaps) designation

discrepancy discrepancy

headaches —headaches —

., -, ~, ?, N, or X., -, ~, ?, N, or X

. . . . . Help!. . . . . Help!

Page 17: Florida State University — Bioinformatics Workshop #1

Web resources for pairwise, Web resources for pairwise, progressive multiple alignment —progressive multiple alignment —http://www.techfak.uni-bielefeld.de/bcd/Curric/

MulAli/welcome.html..

http://pbil.univ-lyon1.fr/alignment.html

http://www.ebi.ac.uk/clustalw/

http://searchlauncher.bcm.tmc.edu/

However, problems with very large datasets and huge However, problems with very large datasets and huge

multiple alignments make doing multiple sequence multiple alignments make doing multiple sequence

alignment on the Web impractical after your dataset alignment on the Web impractical after your dataset

has reached a certain size. You’ll know it when has reached a certain size. You’ll know it when

you’re there!you’re there!

Page 18: Florida State University — Bioinformatics Workshop #1

If large datasets become intractable for analysis on the Web, what other resources are available?

Desktop software solutions — public domain Desktop software solutions — public domain

programs are available, but . . . complicated to programs are available, but . . . complicated to

install, configure, and maintain. User must be install, configure, and maintain. User must be

pretty computer savvy. So, pretty computer savvy. So,

commercial software packages are available, e.g. commercial software packages are available, e.g.

MacVector, DS Gene, DNAsis, DNAStar, etc.,MacVector, DS Gene, DNAsis, DNAStar, etc.,

but . . . license hassles, big expense per machine, but . . . license hassles, big expense per machine,

and Internet and/or CD database access all and Internet and/or CD database access all

complicate matters!complicate matters!

Page 19: Florida State University — Bioinformatics Workshop #1

Therefore, UNIX server-based solutions —

Public domain solutions also exist, but now a very cooperative Public domain solutions also exist, but now a very cooperative

systems manager needs to maintain everything for users, so,systems manager needs to maintain everything for users, so,

commercial products, e.g. the Accelrys commercial products, e.g. the Accelrys GCGGCG Wisconsin Package Wisconsin Package

and the and the SeqLabSeqLab Graphical User Interface, simplify matters for Graphical User Interface, simplify matters for

administrators and users.administrators and users.

One license fee for an entire institution and very fast, convenient One license fee for an entire institution and very fast, convenient

database access on local server disks database access on local server disks without the need to without the need to

download and/or reformat sequencesdownload and/or reformat sequences. Connections from any . Connections from any

networked terminal or workstation anywhere/anytime!networked terminal or workstation anywhere/anytime!

Operating system:Operating system: UNIX command line operation hassles; UNIX command line operation hassles;

communications software — ssh and terminal emulation; X11 communications software — ssh and terminal emulation; X11

graphics; file transfer via scp/sftp; and editors — vi, emacs, pico graphics; file transfer via scp/sftp; and editors — vi, emacs, pico

(or desktop word processing followed by file transfer [save as "text (or desktop word processing followed by file transfer [save as "text

only, UNIX line breaks!"]). See the lab tutorial Appendix II.only, UNIX line breaks!"]). See the lab tutorial Appendix II.

Page 20: Florida State University — Bioinformatics Workshop #1

The Genetics Computer Group — The Accelrys Wisconsin Package for Sequence AnalysisThe Accelrys Wisconsin Package for Sequence Analysis

GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the GCG began in 1982 in Oliver Smithies’ Genetics Dept. lab at the

University of Wisconsin, Madison; and then starting in 1990 it University of Wisconsin, Madison; and then starting in 1990 it

became a private company; which was acquired by the Oxford became a private company; which was acquired by the Oxford

Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc., Molecular Group, U.K., in 1997; and then by Pharmacopeia Inc.,

U.S.A., in 2000; and then in 2004 Accelrys, San Diego, U.S.A., in 2000; and then in 2004 Accelrys, San Diego,

California, left Pharmacopeia to become an independent entity.California, left Pharmacopeia to become an independent entity.

The suite contains around 150 programs designed to work in a The suite contains around 150 programs designed to work in a

“toolbox” fashion. Several simple programs used in succession “toolbox” fashion. Several simple programs used in succession

can lead to very sophisticated results.can lead to very sophisticated results.

Also ‘internal compatibility,’ i.e. once you learn to use one program, Also ‘internal compatibility,’ i.e. once you learn to use one program,

all programs can be run similarly, and, the output from many all programs can be run similarly, and, the output from many

programs can be used as input for other programs.programs can be used as input for other programs.

Used all over the world at over 950 institutions, so learning it will Used all over the world at over 950 institutions, so learning it will

likely be useful at other research institutions as well.likely be useful at other research institutions as well.

Page 21: Florida State University — Bioinformatics Workshop #1

To answer the always perplexing GCG question — “What To answer the always perplexing GCG question — “What

sequence(s)? . . . .” Specifying sequences, GCG style;sequence(s)? . . . .” Specifying sequences, GCG style;

in order of increasing power and complexity —in order of increasing power and complexity —

The sequence is in a local GCG format single sequence file in your UNIX The sequence is in a local GCG format single sequence file in your UNIX

account. (GCG Reformat and SeqConv+ programs)account. (GCG Reformat and SeqConv+ programs)

The sequence is in a local GCG database in which case you ‘point’ to it by using The sequence is in a local GCG database in which case you ‘point’ to it by using

any of the GCG database logical names. A colon, “any of the GCG database logical names. A colon, “::,” always sets the logical ,” always sets the logical

name apart from either an accession number or a proper identifier name or a name apart from either an accession number or a proper identifier name or a

wildcard expression and they are case insensitive.wildcard expression and they are case insensitive.

The sequence is in a GCG format multiple sequence file, either an MSF (multiple The sequence is in a GCG format multiple sequence file, either an MSF (multiple

sequence format) file or an RSF (rich sequence format) file. To specify sequence format) file or an RSF (rich sequence format) file. To specify

sequences contained in a GCG multiple sequence file, supply the file name sequences contained in a GCG multiple sequence file, supply the file name

followed by a pair of braces, “followed by a pair of braces, “{}{},” containing the sequence specification, e.g. a ,” containing the sequence specification, e.g. a

wildcard — {wildcard — {**}.}.

Finally, the most powerful method of specifying sequences is in a GCG “list” file. Finally, the most powerful method of specifying sequences is in a GCG “list” file.

This is merely a list of other sequence specifications and can even contain This is merely a list of other sequence specifications and can even contain

other list files within it. The convention to use a GCG list file in a program is to other list files within it. The convention to use a GCG list file in a program is to

precede it with an at sign, “precede it with an at sign, “@@.” Furthermore, attribute information within list .” Furthermore, attribute information within list

files can specify particular sequence aspects.files can specify particular sequence aspects.

Page 22: Florida State University — Bioinformatics Workshop #1

!!NA_SEQUENCE 1.0!!NA_SEQUENCE 1.0

This is a small example of GCG single sequence format.This is a small example of GCG single sequence format.

Always put some documentation on top, so in the futureAlways put some documentation on top, so in the future

you can figure out what it is you're dealing with! Theyou can figure out what it is you're dealing with! The

line with the two periods is converted to the checksum line.line with the two periods is converted to the checksum line.

example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..example.seq Length: 77 July 21, 1999 09:30 Type: N Check: 4099 ..

1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA1 ACTGACGTCA CATACTGGGA CTGAGATTTA CCGAGTTATA CAAGTATACA

51 GATTTAATAG CATGCGATCC CATGGGA51 GATTTAATAG CATGCGATCC CATGGGA

‘‘Clean’ GCG format single sequence file Clean’ GCG format single sequence file

after Reformat or SeqConv+after Reformat or SeqConv+

SeqLab’s Editor mode can also SeqLab’s Editor mode can also

“Import” native GenBank or FastA “Import” native GenBank or FastA

format and ABI or LI-COR trace files!format and ABI or LI-COR trace files!

Page 23: Florida State University — Bioinformatics Workshop #1

Quoting directly from the GCG Program Quoting directly from the GCG Program

Manual:Manual:

““Advantages of Plus “+” Programs:Advantages of Plus “+” Programs:

√√ Plus programs are enhanced to be Plus programs are enhanced to be

able to read sequences in a variety of able to read sequences in a variety of

native formats such as GCG RSF, GCG SSF, native formats such as GCG RSF, GCG SSF,

GCG MSF, GenBank, EMBL, FastA, SwissProt, GCG MSF, GenBank, EMBL, FastA, SwissProt,

PIR, and BSML without conversion.PIR, and BSML without conversion.

√√ Plus programs remove sequence Plus programs remove sequence

length restriction of 350,000 bp.length restriction of 350,000 bp.

If you do not need these features and wish If you do not need these features and wish

to have more interactivity, you might wish to have more interactivity, you might wish

to seek out and run the original program to seek out and run the original program

version.”version.”

Hey, what’s the deal with the new “+” programs?Hey, what’s the deal with the new “+” programs?

Page 24: Florida State University — Bioinformatics Workshop #1

Sequence databases, nucleic acids:Sequence databases, nucleic acids:

GENBANKPLUS:*GENBANKPLUS:* all of GenBank plus EST, HTC, and GSSall of GenBank plus EST, HTC, and GSS SYNTHETIC:*SYNTHETIC:* GenBank syntheticGenBank synthetic

GBP:*GBP:* all of GenBank plus EST, HTC, and GSSall of GenBank plus EST, HTC, and GSS SY:*SY:* GenBank syntheticGenBank synthetic

GENBANK:*GENBANK:* all of GenBank except EST, HTC, and GSSall of GenBank except EST, HTC, and GSS UNANNOTATED:*UNANNOTATED:* GenBank unannotatedGenBank unannotated

GB:*GB:* all of GenBank except EST, HTC, and GSSall of GenBank except EST, HTC, and GSS UN:*UN:* GenBank unannotatedGenBank unannotated

BACTERIAL:*BACTERIAL:* GenBank bacteria and archaeaGenBank bacteria and archaea REFSEQNUC:*REFSEQNUC:* NCBI RefSeq transcriptomesNCBI RefSeq transcriptomes

BA:*BA:* GenBank bacteria and archaeaGenBank bacteria and archaea RS_RNA:*RS_RNA:* NCBI RefSeq transcriptomesNCBI RefSeq transcriptomes

INVERTEBRATE:*INVERTEBRATE:* GenBank invertebrateGenBank invertebrate

IN:*IN:* GenBank invertebrateGenBank invertebrate Genome sequence databases, nucleic acidsGenome sequence databases, nucleic acids

OTHERMAMMAL:*OTHERMAMMAL:* GenBank other mammalGenBank other mammal

OM:*OM:* GenBank other mammalGenBank other mammal HOMO:*HOMO:* NCBI human RefSeq working draftNCBI human RefSeq working draft

OTHERVERTEBRATE:*OTHERVERTEBRATE:* GenBank other vertebrateGenBank other vertebrate PAN:*PAN:* NCBI chimpanzee RefSeq working draftNCBI chimpanzee RefSeq working draft

OV:*OV:* GenBank other vertebrateGenBank other vertebrate DANIO:*DANIO:* Sanger Zebrafish assemblySanger Zebrafish assembly

PHAGE:*PHAGE:* GenBank phageGenBank phage CELEGANS:*CELEGANS:* NCBI nematidode RefSeq assemblyNCBI nematidode RefSeq assembly

PH:*PH:* GenBank phageGenBank phage

PLANT:*PLANT:* GenBank plant and fungiGenBank plant and fungi Sequence databases, amino acids:Sequence databases, amino acids:

PL:*PL:* GenBank plant and fungiGenBank plant and fungi

PRIMATE:*PRIMATE:* GenBank primateGenBank primate UNIPROT:*UNIPROT:* all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL

PR:*PR:* GenBank primate GenBank primate UNI: *UNI: * all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL

RODENT:*RODENT:* GenBank rodentGenBank rodent SWISSPROTPLUS:*SWISSPROTPLUS:* all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL

RO:*RO:* GenBank rodentGenBank rodent SWP:*SWP:* all of Swiss-Prot and all of SPTREMBLall of Swiss-Prot and all of SPTREMBL

VI:*VI:* GenBank viralGenBank viral SWISSPROT:*SWISSPROT:* all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

VIRAL:*VIRAL:* GenBank viralGenBank viral SWISS:*SWISS:* all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

TAGS:*TAGS:* GenBank EST, HTC, and GSSGenBank EST, HTC, and GSS SW:*SW:* all of Swiss-Prot (fully annotated)all of Swiss-Prot (fully annotated)

EST:*EST:* GenBank EST Expressed Sequence TagsGenBank EST Expressed Sequence Tags SPTREMBL:*SPTREMBL:* Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

GSS:*GSS:* GenBank Genome Survey SequencesGenBank Genome Survey Sequences SPT:*SPT:* Swiss-Prot preliminary EMBL translationsSwiss-Prot preliminary EMBL translations

HTC:*HTC:* GenBank High Throughput cDNAGenBank High Throughput cDNA GENPEPT:*GENPEPT:* all of GenBank’s CDS translationsall of GenBank’s CDS translations

HTG:*HTG:* GenBank High Throughput GenomicGenBank High Throughput Genomic GP:*GP:* all of GenBank’s CDS translationsall of GenBank’s CDS translations

PATENT:*PATENT:* GenBank patentGenBank patent REFSEQPROT:*REFSEQPROT:* NCBI RefSeq proteomesNCBI RefSeq proteomes

PAT:*PAT:* GenBank patentGenBank patent RS_PROT:*RS_PROT:* NCBI RefSeq proteomesNCBI RefSeq proteomes

STS:*STS:* GenBank Sequence Tagged SitesGenBank Sequence Tagged Sites

OK, on to logical terms for GCG —OK, on to logical terms for GCG —T

hese are easy — they m

ake sense and you’ll have T

hese are easy — they m

ake sense and you’ll have a vested interest. B

ut beware B

A and P

L . . . .a vested interest. B

ut beware B

A and P

L . . . .

Page 25: Florida State University — Bioinformatics Workshop #1

GCG MSF & RSF format —

The trick is to not forget the Braces and ‘wild card,’ e.g. The trick is to not forget the Braces and ‘wild card,’ e.g.

filename{filename{**}, when specifying!}, when specifying!

!!RICH_SEQUENCE 1.0!!RICH_SEQUENCE 1.0....{{name ef1a_gialaname ef1a_gialadescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listdescrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.listtype PROTEINtype PROTEINlongname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala}sequence-ID Q08046sequence-ID Q08046checksum 7342checksum 7342offset 23offset 23creation-date 07/11/2001 16:51:19creation-date 07/11/2001 16:51:19strand 1strand 1comments ////////////////////////////////////////////////////////////comments ////////////////////////////////////////////////////////////

!!AA_MULTIPLE_ALIGNMENT 1.0!!AA_MULTIPLE_ALIGNMENT 1.0

small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 ..

Name: a49171 Len: 425 Check: 537 Weight: 1.00Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00Name: a46241 Len: 274 Check: 3514 Weight: 1.00

// //////////////////////////////////////////////////// //////////////////////////////////////////////////

This is SeqLab’s native formatThis is SeqLab’s native format

Page 26: Florida State University — Bioinformatics Workshop #1

The List File Format —

!!!SEQUENCE_LIST 1.0!SEQUENCE_LIST 1.0

An example GCG list file of many elongation 1a An example GCG list file of many elongation 1a

and Tu factors follows. As with all GCG data and Tu factors follows. As with all GCG data

files, two periods separate documentation from files, two periods separate documentation from

data. ..data. ..

my-special.pepmy-special.pep begin:24begin:24 end:134end:134

SwissProt:EfTu_EcoliSwissProt:EfTu_Ecoli

Ef1a-Tu.msf{*}Ef1a-Tu.msf{*}

/usr/accounts/test/another.rsf{ef1a_*}/usr/accounts/test/another.rsf{ef1a_*}

@[email protected] ‘way’ SeqLab works!The ‘way’ SeqLab works!

remember the @ sign!remember the @ sign!

Page 27: Florida State University — Bioinformatics Workshop #1

SeqLab — GCG’s X-based GUI!

SeqLab is the merger of Steve Smith’s Genetic Data SeqLab is the merger of Steve Smith’s Genetic Data

Environment and GCG’s Wisconsin Package Interface:Environment and GCG’s Wisconsin Package Interface:

GDE + WPI = SeqLabGDE + WPI = SeqLab

Requires an X11-Windowing environment — either Requires an X11-Windowing environment — either

native on UNIX computers (including LINUX, but not native on UNIX computers (including LINUX, but not

included in default Apple Mac OS X installs, see Apple’s included in default Apple Mac OS X installs, see Apple’s

free X11 package or XDarwin), or with X-server free X11 package or XDarwin), or with X-server

emulation software on Windows personal computers.emulation software on Windows personal computers.

Page 28: Florida State University — Bioinformatics Workshop #1

FOR MORE INFO...FOR MORE INFO...

Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html and Explore my Web Home: http://bio.fsu.edu/~stevet/cv.html and

contact me (contact me (stevetstevet@[email protected]) for further bioinformatics ) for further bioinformatics

assistance and collaboration.assistance and collaboration.

Gunnar von Heijne in his old but quite readable treatise, Gunnar von Heijne in his old but quite readable treatise, Sequence Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987), provides a very appropriate conclusion:(1987), provides a very appropriate conclusion:

““Think about what you’re doing; use your knowledge of the molecular Think about what you’re doing; use your knowledge of the molecular system involved to guide both your interpretation of results and your system involved to guide both your interpretation of results and your direction of inquiry; use as much information as possible; and direction of inquiry; use as much information as possible; and do not do not blindly accept everything the computer offers youblindly accept everything the computer offers you.”.”

He continues:He continues:

““. . . if any lesson is to be drawn . . . it surely is that to be able to make a . . . if any lesson is to be drawn . . . it surely is that to be able to make a useful contribution one must first and foremost be a biologist, and only useful contribution one must first and foremost be a biologist, and only second a theoretician . . . . We have to develop better algorithms, we second a theoretician . . . . We have to develop better algorithms, we have to find ways to cope with the massive amounts of data, and above have to find ways to cope with the massive amounts of data, and above all we have to become better biologists. But that’s all it takes.”all we have to become better biologists. But that’s all it takes.”

Conclusions —Conclusions —

Page 29: Florida State University — Bioinformatics Workshop #1

Many texts are becoming Many texts are becoming

available in the field.available in the field.

To ‘honk-my-own-horn’ a bit, check out:To ‘honk-my-own-horn’ a bit, check out:

Current Protocols in BioinformaticsCurrent Protocols in Bioinformatics

from John Wiley & Sons, Inc.,from John Wiley & Sons, Inc.,

(http://www.does.org/cp/bioinfo.html);(http://www.does.org/cp/bioinfo.html);

and from Horizon and from Horizon

Scientific Press,Scientific Press,

Computational Computational

Genomics: Theory and Genomics: Theory and

ApplicationApplication

((http://http://

www.horizonpress.com/hsp/www.horizonpress.com/hsp/

books/com.html).books/com.html).

AND FOR EVEN MORE INFO...

From Humana Press,From Humana Press,

Introduction to Bioinformatics:Introduction to Bioinformatics:

A Theoretical And Practical ApproachA Theoretical And Practical Approach

(http://www.humanapress.com/(http://www.humanapress.com/

Product.pasp?Product.pasp?

txtCatalog=HumanaBooks&txtCategorytxtCatalog=HumanaBooks&txtCategory

=&txtProductID=1-58829-241-=&txtProductID=1-58829-241-

X&isVariant=0)X&isVariant=0);;

They all asked me to They all asked me to

contribute chapters on contribute chapters on

multiple sequence multiple sequence

alignment and analysis alignment and analysis

using GCG software.using GCG software.

Page 30: Florida State University — Bioinformatics Workshop #1

Now for some practical examples —Now for some practical examples —

Some of my favorite multiple sequence files Some of my favorite multiple sequence files

(RSF) in the (RSF) in the SeqLabSeqLab Editor: Editor:

Human G-Protein coupled TM7 receptorsHuman G-Protein coupled TM7 receptors

andand

Elongation Factor 1Elongation Factor 1..

Now it’s your turn — on to the tutorial.Now it’s your turn — on to the tutorial.