nothing in ( computational ) biology makes sense except in the light of evolution

Nothing in (computational) biology makessense except in the light of evolution

after Theodosius Dobzhansky (1970)

A brief timeline of genomics

Year Event Ref.1962 The first theory of molecular evolution; the Molecular

Clock concept (Linus Pauling and Emile Zukerkandl)[940]

1965 Atlas of Protein Sequences, the first protein database(Margaret Dayhoff and coworkers)

[169]

1970 Needleman-Wunsch algorithm for global proteinsequence alignment

[602]

1976 First RNA genome sequence (MS2 phage)determined directly from RNA (Walter Fierce)

1977 New DNA sequencing methods (Fred Sanger,Walter Gilbert and coworkers); bacteriophageX174 sequence

[549,739]

A brief history and some central principles ofevolutionary (computational) genomics

1977 First software for sequence analysis (Roger Staden) [792]1977 Phylogenetic taxonomy; archaea discovered; the

notion of the three primary kingdoms of lifeintroduced (Carl Woese and coworkers)

[899]

1981 Smith-Waterman algorithm for local proteinsequence alignment

[779]

1981 Human mitochondrial genome sequenced [28]1981 The concept of a sequence motif (Russell Doolittle) [181]1982 GenBank Release 3 made public1982 Phage genome sequenced (Fred Sanger and

coworkers)[738]

1983 The first practical sequence database searchingalgorithm (John Wilbur and David Lipman)

[886]

1985 FASTP/FASTN: fast sequence similarity searching(William Pearson and David Lipman)

[517]

1986 Introduction of Markov models for DNA analysis(Mark Borodovsky and coworkers)

[105]

1987 First profile search algorithm (Michael Gribskov,Andrew McLachlan, David Eisenberg)

[311]

1988 National Center for Biotechnology Information(NCBI) created at NIH/NLM

1988 EMBnet network for database distribution created1990 BLAST: fast sequence similarity searching with

rigorous statistics (Stephen Altschul, David Lipmanand coworkers)

[20]

1991 EST: expressed sequence tag sequencing (CraigVenter and coworkers)

[4]

1994 Hidden Markov Models of multiple alignments(David Haussler and coworkers; Pierre Baldi andcoworkers)

[69,70,469]

1994 SCOP classification of protein structures (Alexei Murzin,Cyrus Chothia and coworkers)

[586]

1995 First bacterial genomes completely sequenced [228,238]1996 First archaeal genome completely sequenced [127]1996 First eukaryotic genome (yeast) completely

sequenced[286]

1997 Introduction of gapped BLAST and PSI-BLAST [22]1997 COGs: Evolutionary classification of proteins from

complete genomes[823]

1998 Worm genome, the first multicellular genome,(nearly) completely sequenced

[834]

1999 Fly genome (nearly) completely sequenced [3]2001 Human genome (nearly) completely sequenced [484,864]

J. Mol Biol 1982 Dec 25;162(4):729-73

Nucleotide sequence of bacteriophage lambda DNA.

Sanger F, Coulson AR, Hong GF, Hill DF, Petersen GB.

The DNA in its circular form contains 48,502 pairs of nucleotides.…Open reading frames were identified and, where possible, ascribed to genes by comparing with the previously determined genetic map.The reading frames for 46 genes were clearly identified…There are about 20 other unidentified reading frames that may code for proteins.…Protein sequence comparison or homology are not mentioned inthis paper...

Non-trivial evolutionary connections and functional predictions for bacteriophage proteins

Geneproduct

Evolutionaryconservation

Structure, domain architecturea Predicted function, reference

A(TerL)

Bacteriophages,herpesviruses

A modified P-loop ATPase domain,distantly related to a vast class ofhelicases

ATPase subunit of the terminase, involved inDNA packaging in phage head

C Bacteria and archaea ClpP protease domain Minor capsid protein, cleaves the scaffold proteinduring maturation

K Bacteria, archaea andeukaryotes

Consists of an N-terminal JAB/MPNdomain (predicted metalloprotease)and a C-terminal NLPC domain(uncharacterized domain found inbacterial lipoproteins)

Tail subunit; predicted protease involved in tailassembly (based on the presence of theJAB/MPN domain) [675]

Ea31 Scattered distributionin bacteria andarchaea

Endo VII-colicin domain Predicted nuclease of the McrA (HNH) family[49]

Ea59 Bacteria, archaea andeukaryotes

P-loop ATPase domain of the ABCclass

Predicted ATPase [292]

Exo(RedX)

Bacteria, archaea,eukaryotes, viruses

exonuclease domain, distantlyrelated to a broad variety of nucleases

A nuclease involved in phage recombination andlate rolling-circle replication

Table 1.2 – continued

CI Bacteria, archaea N-terminal helix-turn-helix DNA-binding domain fused to a C-terminalserine protease domain of theLexA/UmuD family

Transcription repressor of genes required forlytic development

Cro Bacteria, archaea Helix-turn-helix DNA-binding domain Transcription repressor of early genes

O Bacteria, archaea Helix-turn-helix DNA-binding domain DNA-binding protein involved in the initiation ofreplication

Ren Bacteria, archaea Helix-turn-helix DNA-binding domain Protein involved in exclusion of replication ofheterologous genomes in -infected bacteria

Nin290 Bacteria, archaea,eukaryotes

PP-loop ATPase domain Predicted ATP pyrophosphatase, role in phagereplication unknown [100]

Nin221 Bacteria, archaea,eukaryotes

Calcineurin-like serine/threonineprotein phosphatase domain

Protein phosphatase, role in phage replicationunknown [446]

Table 1.2 Non-trivial evolutionary connections and functional predictions forbacteriophage proteins

Geneproduct

Evolutionaryconservation

Structure, domain architecturea Predicted function, reference

A(TerL)

Bacteriophages,herpesviruses

A modified P-loop ATPase domain,distantly related to a vast class ofhelicases

ATPase subunit of the terminase, involved inDNA packaging in phage head

C Bacteria and archaea ClpP protease domain Minor capsid protein, cleaves the scaffold proteinduring maturation

K Bacteria, archaea andeukaryotes

Consists of an N-terminal JAB/MPNdomain (predicted metalloprotease)and a C-terminal NLPC domain(uncharacterized domain found inbacterial lipoproteins)

Tail subunit; predicted protease involved in tailassembly (based on the presence of theJAB/MPN domain) [675]

Ea31 Scattered distributionin bacteria andarchaea

Endo VII-colicin domain Predicted nuclease of the McrA (HNH) family[49]

Ea59 Bacteria, archaea andeukaryotes

P-loop ATPase domain of the ABCclass

Predicted ATPase [292]

Exo(RedX)

Bacteria, archaea,eukaryotes, viruses

exonuclease domain, distantlyrelated to a broad variety of nucleases

A nuclease involved in phage recombination andlate rolling-circle replication

0

10

20

30

40

50

60

70

80

90

100

1994 1996 1998 2000 2002

Bacteria

Archaea

Eukaryotes

Total

Growth of the number of completely sequenced genomes

Figure 1.2. The current state of annotation of some genomes. The data were derived from the original genome sequencing papers

Nothing in (computational) biology makessense except in the light of evolution

after Theodosius Dobzhansky (1970)

Species 1

Species 2

Species 3

Homology: common ancestry of genes or portions thereof(a qualitative notion as opposed to similarity)

Evolution by gene duplication, 1970

Gene duplication with subsequent diversification -the principal path to innovation in evolution

Table 2.2. Expansion of signaling domains in C. elegansa

Species Proteins Ser/Thr/ Tyr kinase

Ser/Thr/Tyr phosphatase

BRCT SH3 VWA WD40

C. elegans 19,100 435 112 26 58 65 127 S. cerevisiae 6,500 116 14 10 24 3 110 E. coli 4,289 3 1 1 1 4 0 B. subtilis 4,100 4 0 1 6 5 0 M. tuberculosis 3,918 13 1 1 0 4 4 Synechocystis 3,169 12 0 1 3 4 2 A. fulgidus 2,420 4 0 0 0 2 0 M. thermoauto-trophicum

1,869 4 0 0 0 2 0

M. jannaschii 1,715 4 2 0 0 3 0 A. aeolicus 1,522 2 0 1 0 1 0

a The data are from ref. [675]. Domain abbreviations are as in the SMART database (see ?3.3): BRCT, BRCA1 C-terminal domain; SH3, Src homology 3 domain; VWA, von Willebrand factor A domain; WD40, Trp,Asp-repeat domain.

Num

ber

of p

rote

ins

in COGs

not in COGs0

1000

2000

3000

4000

5000

6000

Aa

e

Tm

a

Mg

e

Rp

r

Hin

Mth

Afu

Ctr

Mja

Jhp

Tp

a

Bsu

Eco

Hp

y

Ph

o

Mp

n

Cp

n

Ssp Mtu

Bb

u

Sce

The majority of the proteins in each prokaryote, but only ~1/3 of yeast proteins belong to COGs - ancient conserved families

MOST OF THE COGs ARE REPRESENTED ONLY IN A SMALL NUMBER OF CLADES MAJOR ROLE OF HORIZONTAL

GENE TRANSFER AND CLADE-SPECIFIC GENE LOSS IN EVOLUTION

ancestordescendantsspeciation

Gene loss

Non-orthologous displacement: two unrelated (or distantlyrelated) proteins for the same essential function

Gene loss

Figure 2.3. Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT) and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).

153L .GEKLC.VE.PAVIAGIISRESHAG..KVLK....NGWGD...R.......... 3LZT gLDNYRgYS.LGNWVCAAKFESNFN.........tQATNR...N.......... 1AM7 .mvEIN.NQrKAFLDMLAWSEGTDngrQKTRnhgyDVIVGgelftdysdhprkl 1L92 ..........MNIFEMLRIDEG...........lrlKIYKdteG.......... 153L ........GNGFGLMQVDKRSH...............KP........QG..TWN 3LZT .....tdgsTDYGILQINSRWWcndgrtpgsrnlcniPC........SAllSSD 1AM7 vtlnpklkSTGAGRYQLLSRWW...............DayrkqlglkDF..SP. 1L92 ........YYTIG.IGHLLT.........kspslnaakseldkaigrntngvIT 153L .GEVHITQGTTILINF.IKTIQK...KFPS.WTKD..QQLKGGISAYNAGAGNVR 3LZT ITASVNCAKKIVSDG.N........................GMNAWV....... 1AM7 ..KSQDAVALQQIKERgALPM...........idR..GDIRQAIDRCSN....iw 1L92 .KDEAEKLFNQDVDAA.VRGILRnakLKPVyDSLDavRRAAIINMVFQMGETGVA 153L .SYARMDIGT....................THDDYANDVV....ARAQYYKQHGY 3LZT ................................awRNRCK...gTDVQAWIRGCr 1AM7 .aslpGAGY...................gqfEHKA.DSLI....AKFKEAGgtvr 1L92 .gftnslrmlqqkrwdeaavnlaksrwynqTPNRAkrvittfrtgtwDAYK....

Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT) and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).

Only a small fraction of amino acid residues is directlyinvolved in protein function (including enzymatic);the rest of the protein serves largely as structuralscaffold

Significant sequence conservation is evidence of homology

Proteins with different structural folds can perform the same function - non-orthologous displacement

Proteins (domains) with the same fold are most likelyto be homologous

Convergence does not produce significant sequence or structural similarity

nothing in ( computational ) biology makes sense except in the light of evolution

Documents

protein sequence comparison

fast sequence similarity

practical sequence database

worm genome

archaeal genome

fly genome

human genome

multicellular genome