nothing in ( computational ) biology makes sense except in the light of evolution
DESCRIPTION
Nothing in ( computational ) biology makes sense except in the light of evolution. after Theodosius Dobzhansky (1970). A brief history and some central principles of evolutionary (computational) genomics. J. Mol Biol 1982 Dec 25;162(4):729-73 - PowerPoint PPT PresentationTRANSCRIPT
Nothing in (computational) biology makessense except in the light of evolution
after Theodosius Dobzhansky (1970)
A brief timeline of genomics
Year Event Ref.1962 The first theory of molecular evolution; the Molecular
Clock concept (Linus Pauling and Emile Zukerkandl)[940]
1965 Atlas of Protein Sequences, the first protein database(Margaret Dayhoff and coworkers)
[169]
1970 Needleman-Wunsch algorithm for global proteinsequence alignment
[602]
1976 First RNA genome sequence (MS2 phage)determined directly from RNA (Walter Fierce)
1977 New DNA sequencing methods (Fred Sanger,Walter Gilbert and coworkers); bacteriophageX174 sequence
[549,739]
A brief history and some central principles ofevolutionary (computational) genomics
1977 First software for sequence analysis (Roger Staden) [792]1977 Phylogenetic taxonomy; archaea discovered; the
notion of the three primary kingdoms of lifeintroduced (Carl Woese and coworkers)
[899]
1981 Smith-Waterman algorithm for local proteinsequence alignment
[779]
1981 Human mitochondrial genome sequenced [28]1981 The concept of a sequence motif (Russell Doolittle) [181]1982 GenBank Release 3 made public1982 Phage genome sequenced (Fred Sanger and
coworkers)[738]
1983 The first practical sequence database searchingalgorithm (John Wilbur and David Lipman)
[886]
1985 FASTP/FASTN: fast sequence similarity searching(William Pearson and David Lipman)
[517]
1986 Introduction of Markov models for DNA analysis(Mark Borodovsky and coworkers)
[105]
1987 First profile search algorithm (Michael Gribskov,Andrew McLachlan, David Eisenberg)
[311]
1988 National Center for Biotechnology Information(NCBI) created at NIH/NLM
1988 EMBnet network for database distribution created1990 BLAST: fast sequence similarity searching with
rigorous statistics (Stephen Altschul, David Lipmanand coworkers)
[20]
1991 EST: expressed sequence tag sequencing (CraigVenter and coworkers)
[4]
1994 Hidden Markov Models of multiple alignments(David Haussler and coworkers; Pierre Baldi andcoworkers)
[69,70,469]
1994 SCOP classification of protein structures (Alexei Murzin,Cyrus Chothia and coworkers)
[586]
1995 First bacterial genomes completely sequenced [228,238]1996 First archaeal genome completely sequenced [127]1996 First eukaryotic genome (yeast) completely
sequenced[286]
1997 Introduction of gapped BLAST and PSI-BLAST [22]1997 COGs: Evolutionary classification of proteins from
complete genomes[823]
1998 Worm genome, the first multicellular genome,(nearly) completely sequenced
[834]
1999 Fly genome (nearly) completely sequenced [3]2001 Human genome (nearly) completely sequenced [484,864]
J. Mol Biol 1982 Dec 25;162(4):729-73
Nucleotide sequence of bacteriophage lambda DNA.
Sanger F, Coulson AR, Hong GF, Hill DF, Petersen GB.
The DNA in its circular form contains 48,502 pairs of nucleotides.…Open reading frames were identified and, where possible, ascribed to genes by comparing with the previously determined genetic map.The reading frames for 46 genes were clearly identified…There are about 20 other unidentified reading frames that may code for proteins.…Protein sequence comparison or homology are not mentioned inthis paper...
Non-trivial evolutionary connections and functional predictions for bacteriophage proteins
Geneproduct
Evolutionaryconservation
Structure, domain architecturea Predicted function, reference
A(TerL)
Bacteriophages,herpesviruses
A modified P-loop ATPase domain,distantly related to a vast class ofhelicases
ATPase subunit of the terminase, involved inDNA packaging in phage head
C Bacteria and archaea ClpP protease domain Minor capsid protein, cleaves the scaffold proteinduring maturation
K Bacteria, archaea andeukaryotes
Consists of an N-terminal JAB/MPNdomain (predicted metalloprotease)and a C-terminal NLPC domain(uncharacterized domain found inbacterial lipoproteins)
Tail subunit; predicted protease involved in tailassembly (based on the presence of theJAB/MPN domain) [675]
Ea31 Scattered distributionin bacteria andarchaea
Endo VII-colicin domain Predicted nuclease of the McrA (HNH) family[49]
Ea59 Bacteria, archaea andeukaryotes
P-loop ATPase domain of the ABCclass
Predicted ATPase [292]
Exo(RedX)
Bacteria, archaea,eukaryotes, viruses
exonuclease domain, distantlyrelated to a broad variety of nucleases
A nuclease involved in phage recombination andlate rolling-circle replication
Table 1.2 – continued
CI Bacteria, archaea N-terminal helix-turn-helix DNA-binding domain fused to a C-terminalserine protease domain of theLexA/UmuD family
Transcription repressor of genes required forlytic development
Cro Bacteria, archaea Helix-turn-helix DNA-binding domain Transcription repressor of early genes
O Bacteria, archaea Helix-turn-helix DNA-binding domain DNA-binding protein involved in the initiation ofreplication
Ren Bacteria, archaea Helix-turn-helix DNA-binding domain Protein involved in exclusion of replication ofheterologous genomes in -infected bacteria
Nin290 Bacteria, archaea,eukaryotes
PP-loop ATPase domain Predicted ATP pyrophosphatase, role in phagereplication unknown [100]
Nin221 Bacteria, archaea,eukaryotes
Calcineurin-like serine/threonineprotein phosphatase domain
Protein phosphatase, role in phage replicationunknown [446]
Table 1.2 Non-trivial evolutionary connections and functional predictions forbacteriophage proteins
Geneproduct
Evolutionaryconservation
Structure, domain architecturea Predicted function, reference
A(TerL)
Bacteriophages,herpesviruses
A modified P-loop ATPase domain,distantly related to a vast class ofhelicases
ATPase subunit of the terminase, involved inDNA packaging in phage head
C Bacteria and archaea ClpP protease domain Minor capsid protein, cleaves the scaffold proteinduring maturation
K Bacteria, archaea andeukaryotes
Consists of an N-terminal JAB/MPNdomain (predicted metalloprotease)and a C-terminal NLPC domain(uncharacterized domain found inbacterial lipoproteins)
Tail subunit; predicted protease involved in tailassembly (based on the presence of theJAB/MPN domain) [675]
Ea31 Scattered distributionin bacteria andarchaea
Endo VII-colicin domain Predicted nuclease of the McrA (HNH) family[49]
Ea59 Bacteria, archaea andeukaryotes
P-loop ATPase domain of the ABCclass
Predicted ATPase [292]
Exo(RedX)
Bacteria, archaea,eukaryotes, viruses
exonuclease domain, distantlyrelated to a broad variety of nucleases
A nuclease involved in phage recombination andlate rolling-circle replication
0
10
20
30
40
50
60
70
80
90
100
1994 1996 1998 2000 2002
Bacteria
Archaea
Eukaryotes
Total
Growth of the number of completely sequenced genomes
Figure 1.2. The current state of annotation of some genomes. The data were derived from the original genome sequencing papers
Nothing in (computational) biology makessense except in the light of evolution
after Theodosius Dobzhansky (1970)
Species 1
Species 2
Species 3
Homology: common ancestry of genes or portions thereof(a qualitative notion as opposed to similarity)
Evolution by gene duplication, 1970
Gene duplication with subsequent diversification -the principal path to innovation in evolution
Table 2.2. Expansion of signaling domains in C. elegansa
Species Proteins Ser/Thr/ Tyr kinase
Ser/Thr/Tyr phosphatase
BRCT SH3 VWA WD40
C. elegans 19,100 435 112 26 58 65 127 S. cerevisiae 6,500 116 14 10 24 3 110 E. coli 4,289 3 1 1 1 4 0 B. subtilis 4,100 4 0 1 6 5 0 M. tuberculosis 3,918 13 1 1 0 4 4 Synechocystis 3,169 12 0 1 3 4 2 A. fulgidus 2,420 4 0 0 0 2 0 M. thermoauto-trophicum
1,869 4 0 0 0 2 0
M. jannaschii 1,715 4 2 0 0 3 0 A. aeolicus 1,522 2 0 1 0 1 0
a The data are from ref. [675]. Domain abbreviations are as in the SMART database (see ?3.3): BRCT, BRCA1 C-terminal domain; SH3, Src homology 3 domain; VWA, von Willebrand factor A domain; WD40, Trp,Asp-repeat domain.
Num
ber
of p
rote
ins
in COGs
not in COGs0
1000
2000
3000
4000
5000
6000
Aa
e
Tm
a
Mg
e
Rp
r
Hin
Mth
Afu
Ctr
Mja
Jhp
Tp
a
Bsu
Eco
Hp
y
Ph
o
Mp
n
Cp
n
Ssp Mtu
Bb
u
Sce
The majority of the proteins in each prokaryote, but only ~1/3 of yeast proteins belong to COGs - ancient conserved families
MOST OF THE COGs ARE REPRESENTED ONLY IN A SMALL NUMBER OF CLADES MAJOR ROLE OF HORIZONTAL
GENE TRANSFER AND CLADE-SPECIFIC GENE LOSS IN EVOLUTION
ancestordescendantsspeciation
Gene loss
Non-orthologous displacement: two unrelated (or distantlyrelated) proteins for the same essential function
Gene loss
Figure 2.3. Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT) and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).
153L .GEKLC.VE.PAVIAGIISRESHAG..KVLK....NGWGD...R.......... 3LZT gLDNYRgYS.LGNWVCAAKFESNFN.........tQATNR...N.......... 1AM7 .mvEIN.NQrKAFLDMLAWSEGTDngrQKTRnhgyDVIVGgelftdysdhprkl 1L92 ..........MNIFEMLRIDEG...........lrlKIYKdteG.......... 153L ........GNGFGLMQVDKRSH...............KP........QG..TWN 3LZT .....tdgsTDYGILQINSRWWcndgrtpgsrnlcniPC........SAllSSD 1AM7 vtlnpklkSTGAGRYQLLSRWW...............DayrkqlglkDF..SP. 1L92 ........YYTIG.IGHLLT.........kspslnaakseldkaigrntngvIT 153L .GEVHITQGTTILINF.IKTIQK...KFPS.WTKD..QQLKGGISAYNAGAGNVR 3LZT ITASVNCAKKIVSDG.N........................GMNAWV....... 1AM7 ..KSQDAVALQQIKERgALPM...........idR..GDIRQAIDRCSN....iw 1L92 .KDEAEKLFNQDVDAA.VRGILRnakLKPVyDSLDavRRAAIINMVFQMGETGVA 153L .SYARMDIGT....................THDDYANDVV....ARAQYYKQHGY 3LZT ................................awRNRCK...gTDVQAWIRGCr 1AM7 .aslpGAGY...................gqfEHKA.DSLI....AKFKEAGgtvr 1L92 .gftnslrmlqqkrwdeaavnlaksrwynqTPNRAkrvittfrtgtwDAYK....
Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT) and lysozymes from E. coli bacteriophages (1AM7) and T4 (1L92).
Only a small fraction of amino acid residues is directlyinvolved in protein function (including enzymatic);the rest of the protein serves largely as structuralscaffold
Significant sequence conservation is evidence of homology
Proteins with different structural folds can perform the same function - non-orthologous displacement
Proteins (domains) with the same fold are most likelyto be homologous
Convergence does not produce significant sequence or structural similarity