nothing in ( computational ) biology makes sense except in the light of evolution

40
Nothing in ( computational ) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970)

Upload: casta

Post on 06-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Nothing in ( computational ) biology makes sense except in the light of evolution. after Theodosius Dobzhansky (1970). A brief history and some central principles of evolutionary (computational) genomics. J. Mol Biol 1982 Dec 25;162(4):729-73 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nothing in ( computational ) biology makes sense except in the light of evolution

Nothing in (computational) biology makessense except in the light of evolution

after Theodosius Dobzhansky (1970)

Page 2: Nothing in ( computational ) biology makes sense except in the light of evolution

A brief timeline of genomics

Year Event Ref.1962 The first theory of molecular evolution; the Molecular

Clock concept (Linus Pauling and Emile Zukerkandl)[940]

1965 Atlas of Protein Sequences, the first protein database(Margaret Dayhoff and coworkers)

[169]

1970 Needleman-Wunsch algorithm for global proteinsequence alignment

[602]

1976 First RNA genome sequence (MS2 phage)determined directly from RNA (Walter Fierce)

1977 New DNA sequencing methods (Fred Sanger,Walter Gilbert and coworkers); bacteriophageX174 sequence

[549,739]

A brief history and some central principles ofevolutionary (computational) genomics

Page 3: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 4: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 5: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 6: Nothing in ( computational ) biology makes sense except in the light of evolution

1977 First software for sequence analysis (Roger Staden) [792]1977 Phylogenetic taxonomy; archaea discovered; the

notion of the three primary kingdoms of lifeintroduced (Carl Woese and coworkers)

[899]

1981 Smith-Waterman algorithm for local proteinsequence alignment

[779]

1981 Human mitochondrial genome sequenced [28]1981 The concept of a sequence motif (Russell Doolittle) [181]1982 GenBank Release 3 made public1982 Phage genome sequenced (Fred Sanger and

coworkers)[738]

Page 7: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 8: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 9: Nothing in ( computational ) biology makes sense except in the light of evolution

1983 The first practical sequence database searchingalgorithm (John Wilbur and David Lipman)

[886]

1985 FASTP/FASTN: fast sequence similarity searching(William Pearson and David Lipman)

[517]

1986 Introduction of Markov models for DNA analysis(Mark Borodovsky and coworkers)

[105]

1987 First profile search algorithm (Michael Gribskov,Andrew McLachlan, David Eisenberg)

[311]

1988 National Center for Biotechnology Information(NCBI) created at NIH/NLM

1988 EMBnet network for database distribution created1990 BLAST: fast sequence similarity searching with

rigorous statistics (Stephen Altschul, David Lipmanand coworkers)

[20]

1991 EST: expressed sequence tag sequencing (CraigVenter and coworkers)

[4]

Page 10: Nothing in ( computational ) biology makes sense except in the light of evolution

1994 Hidden Markov Models of multiple alignments(David Haussler and coworkers; Pierre Baldi andcoworkers)

[69,70,469]

1994 SCOP classification of protein structures (Alexei Murzin,Cyrus Chothia and coworkers)

[586]

1995 First bacterial genomes completely sequenced [228,238]1996 First archaeal genome completely sequenced [127]1996 First eukaryotic genome (yeast) completely

sequenced[286]

1997 Introduction of gapped BLAST and PSI-BLAST [22]1997 COGs: Evolutionary classification of proteins from

complete genomes[823]

1998 Worm genome, the first multicellular genome,(nearly) completely sequenced

[834]

1999 Fly genome (nearly) completely sequenced [3]2001 Human genome (nearly) completely sequenced [484,864]

Page 11: Nothing in ( computational ) biology makes sense except in the light of evolution

J. Mol Biol 1982 Dec 25;162(4):729-73

Nucleotide sequence of bacteriophage lambda DNA.

Sanger F, Coulson AR, Hong GF, Hill DF, Petersen GB.

The DNA in its circular form contains 48,502 pairs of nucleotides.…Open reading frames were identified and, where possible, ascribed to genes by comparing with the previously determined genetic map.The reading frames for 46 genes were clearly identified…There are about 20 other unidentified reading frames that may code for proteins.…Protein sequence comparison or homology are not mentioned inthis paper...

Page 12: Nothing in ( computational ) biology makes sense except in the light of evolution

Non-trivial evolutionary connections and functional predictions for bacteriophage proteins

Geneproduct

Evolutionaryconservation

Structure, domain architecturea Predicted function, reference

A(TerL)

Bacteriophages,herpesviruses

A modified P-loop ATPase domain,distantly related to a vast class ofhelicases

ATPase subunit of the terminase, involved inDNA packaging in phage head

C Bacteria and archaea ClpP protease domain Minor capsid protein, cleaves the scaffold proteinduring maturation

K Bacteria, archaea andeukaryotes

Consists of an N-terminal JAB/MPNdomain (predicted metalloprotease)and a C-terminal NLPC domain(uncharacterized domain found inbacterial lipoproteins)

Tail subunit; predicted protease involved in tailassembly (based on the presence of theJAB/MPN domain) [675]

Ea31 Scattered distributionin bacteria andarchaea

Endo VII-colicin domain Predicted nuclease of the McrA (HNH) family[49]

Ea59 Bacteria, archaea andeukaryotes

P-loop ATPase domain of the ABCclass

Predicted ATPase [292]

Exo(RedX)

Bacteria, archaea,eukaryotes, viruses

exonuclease domain, distantlyrelated to a broad variety of nucleases

A nuclease involved in phage recombination andlate rolling-circle replication

Page 13: Nothing in ( computational ) biology makes sense except in the light of evolution

Table 1.2 – continued

CI Bacteria, archaea N-terminal helix-turn-helix DNA-binding domain fused to a C-terminalserine protease domain of theLexA/UmuD family

Transcription repressor of genes required forlytic development

Cro Bacteria, archaea Helix-turn-helix DNA-binding domain Transcription repressor of early genes

O Bacteria, archaea Helix-turn-helix DNA-binding domain DNA-binding protein involved in the initiation ofreplication

Ren Bacteria, archaea Helix-turn-helix DNA-binding domain Protein involved in exclusion of replication ofheterologous genomes in -infected bacteria

Nin290 Bacteria, archaea,eukaryotes

PP-loop ATPase domain Predicted ATP pyrophosphatase, role in phagereplication unknown [100]

Nin221 Bacteria, archaea,eukaryotes

Calcineurin-like serine/threonineprotein phosphatase domain

Protein phosphatase, role in phage replicationunknown [446]

Page 14: Nothing in ( computational ) biology makes sense except in the light of evolution

Table 1.2 Non-trivial evolutionary connections and functional predictions forbacteriophage proteins

Geneproduct

Evolutionaryconservation

Structure, domain architecturea Predicted function, reference

A(TerL)

Bacteriophages,herpesviruses

A modified P-loop ATPase domain,distantly related to a vast class ofhelicases

ATPase subunit of the terminase, involved inDNA packaging in phage head

C Bacteria and archaea ClpP protease domain Minor capsid protein, cleaves the scaffold proteinduring maturation

K Bacteria, archaea andeukaryotes

Consists of an N-terminal JAB/MPNdomain (predicted metalloprotease)and a C-terminal NLPC domain(uncharacterized domain found inbacterial lipoproteins)

Tail subunit; predicted protease involved in tailassembly (based on the presence of theJAB/MPN domain) [675]

Ea31 Scattered distributionin bacteria andarchaea

Endo VII-colicin domain Predicted nuclease of the McrA (HNH) family[49]

Ea59 Bacteria, archaea andeukaryotes

P-loop ATPase domain of the ABCclass

Predicted ATPase [292]

Exo(RedX)

Bacteria, archaea,eukaryotes, viruses

exonuclease domain, distantlyrelated to a broad variety of nucleases

A nuclease involved in phage recombination andlate rolling-circle replication

Page 15: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 16: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 17: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 18: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 19: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 20: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 21: Nothing in ( computational ) biology makes sense except in the light of evolution

0

10

20

30

40

50

60

70

80

90

100

1994 1996 1998 2000 2002

Bacteria

Archaea

Eukaryotes

Total

Growth of the number of completely sequenced genomes

Page 22: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 23: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 24: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 25: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 26: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 27: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 28: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 29: Nothing in ( computational ) biology makes sense except in the light of evolution
Page 30: Nothing in ( computational ) biology makes sense except in the light of evolution

Figure 1.2. The current state of annotation of some genomes. The data were derived from the original genome sequencing papers

Page 31: Nothing in ( computational ) biology makes sense except in the light of evolution

Nothing in (computational) biology makessense except in the light of evolution

after Theodosius Dobzhansky (1970)

Page 32: Nothing in ( computational ) biology makes sense except in the light of evolution

Species 1

Species 2

Species 3

Homology: common ancestry of genes or portions thereof(a qualitative notion as opposed to similarity)

Page 33: Nothing in ( computational ) biology makes sense except in the light of evolution

Evolution by gene duplication, 1970

Gene duplication with subsequent diversification -the principal path to innovation in evolution

Page 34: Nothing in ( computational ) biology makes sense except in the light of evolution

Table 2.2. Expansion of signaling domains in C. elegansa

Species Proteins Ser/Thr/ Tyr kinase

Ser/Thr/Tyr phosphatase

BRCT SH3 VWA WD40

C. elegans 19,100 435 112 26 58 65 127 S. cerevisiae 6,500 116 14 10 24 3 110 E. coli 4,289 3 1 1 1 4 0 B. subtilis 4,100 4 0 1 6 5 0 M. tuberculosis 3,918 13 1 1 0 4 4 Synechocystis 3,169 12 0 1 3 4 2 A. fulgidus 2,420 4 0 0 0 2 0 M. thermoauto-trophicum

1,869 4 0 0 0 2 0

M. jannaschii 1,715 4 2 0 0 3 0 A. aeolicus 1,522 2 0 1 0 1 0

a The data are from ref. [675]. Domain abbreviations are as in the SMART database (see ?3.3): BRCT, BRCA1 C-terminal domain; SH3, Src homology 3 domain; VWA, von Willebrand factor A domain; WD40, Trp,Asp-repeat domain.

Page 35: Nothing in ( computational ) biology makes sense except in the light of evolution

Num

ber

of p

rote

ins

in COGs

not in COGs0

1000

2000

3000

4000

5000

6000

Aa

e

Tm

a

Mg

e

Rp

r

Hin

Mth

Afu

Ctr

Mja

Jhp

Tp

a

Bsu

Eco

Hp

y

Ph

o

Mp

n

Cp

n

Ssp Mtu

Bb

u

Sce

The majority of the proteins in each prokaryote, but only ~1/3 of yeast proteins belong to COGs - ancient conserved families

Page 36: Nothing in ( computational ) biology makes sense except in the light of evolution

MOST OF THE COGs ARE REPRESENTED ONLY IN A SMALL NUMBER OF CLADES MAJOR ROLE OF HORIZONTAL

GENE TRANSFER AND CLADE-SPECIFIC GENE LOSS IN EVOLUTION

Page 37: Nothing in ( computational ) biology makes sense except in the light of evolution

ancestordescendantsspeciation

Gene loss

Non-orthologous displacement: two unrelated (or distantlyrelated) proteins for the same essential function

Gene loss

Page 38: Nothing in ( computational ) biology makes sense except in the light of evolution

Figure 2.3. Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT) and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).

Page 39: Nothing in ( computational ) biology makes sense except in the light of evolution

153L .GEKLC.VE.PAVIAGIISRESHAG..KVLK....NGWGD...R.......... 3LZT gLDNYRgYS.LGNWVCAAKFESNFN.........tQATNR...N.......... 1AM7 .mvEIN.NQrKAFLDMLAWSEGTDngrQKTRnhgyDVIVGgelftdysdhprkl 1L92 ..........MNIFEMLRIDEG...........lrlKIYKdteG..........  153L ........GNGFGLMQVDKRSH...............KP........QG..TWN 3LZT .....tdgsTDYGILQINSRWWcndgrtpgsrnlcniPC........SAllSSD 1AM7 vtlnpklkSTGAGRYQLLSRWW...............DayrkqlglkDF..SP. 1L92 ........YYTIG.IGHLLT.........kspslnaakseldkaigrntngvIT  153L .GEVHITQGTTILINF.IKTIQK...KFPS.WTKD..QQLKGGISAYNAGAGNVR 3LZT ITASVNCAKKIVSDG.N........................GMNAWV....... 1AM7 ..KSQDAVALQQIKERgALPM...........idR..GDIRQAIDRCSN....iw 1L92 .KDEAEKLFNQDVDAA.VRGILRnakLKPVyDSLDavRRAAIINMVFQMGETGVA  153L .SYARMDIGT....................THDDYANDVV....ARAQYYKQHGY 3LZT ................................awRNRCK...gTDVQAWIRGCr 1AM7 .aslpGAGY...................gqfEHKA.DSLI....AKFKEAGgtvr 1L92 .gftnslrmlqqkrwdeaavnlaksrwynqTPNRAkrvittfrtgtwDAYK....

 

Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT) and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).

Page 40: Nothing in ( computational ) biology makes sense except in the light of evolution

Only a small fraction of amino acid residues is directlyinvolved in protein function (including enzymatic);the rest of the protein serves largely as structuralscaffold

Significant sequence conservation is evidence of homology

Proteins with different structural folds can perform the same function - non-orthologous displacement

Proteins (domains) with the same fold are most likelyto be homologous

Convergence does not produce significant sequence or structural similarity