uc davis eve161 lecture 10 by @phylogenomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014

Lecture 10:

EVE 161:Microbial Phylogenomics

!Lecture #10:

Era III: Genome Sequencing !

UC Davis, Winter 2014 Instructor: Jonathan Eisen

!1


Where we are going and where we have been

• Previous lecture: !9: rRNA Case Study - Built Environment

• Current Lecture: !10: Genome Sequencing

• Next Lecture: !11: Genome Sequencing II

!2


1st Genome Sequence

Fleischmann et al.

!3


insight progress

NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com 801

analysis of the genomes of two thermophilic bacterial species,Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% ofthe genes in these species were more similar to genes from archaeathan those from bacteria13,14. This led to the suggestion of possibleextensive gene exchanges between these species and archaeal lineages. But before one jumps to this conclusion it is important toconsider the difficulties in inferring the occurrence of gene transfer.For example, the high percentage of genes with best matches toarchaea in A. aeolicus and T. maritima could also be due to a high rateof evolution in the mesophilic bacteria (which would cause thermophilic and archaeal genes to have high levels of similaritydespite their not having a common ancestry) or the loss of these genesfrom mesophilic bacteria15. For T. maritima, many lines of additionalevidence support the assertion of gene transfer, including the obser-vation that many of the archaeal-like genes occur in clusters in thegenome, are in regions of unusual nucleotide composition, andbranch in phylogenetic trees most closely to archaeal genes14. Most ofthe lines of evidence leading to assertions of horizontal gene transfercan have other causes. For example, unusual nucleotide compositioncan also arise from selection16, and differences in phylogenetic treescan be caused by convergence, inaccurate alignments17, long-branchattraction18 or sampling of different species19. It is therefore important to assess the evidence carefully and to find multiple typesof evidence. This has yet to be done systematically, so we believe that itis too early to assign quantitative values to the extent of gene exchangebetween species.

Despite the apparent occurrence of extensive gene transfers in thehistory of microbes, it does seem that there might be a ‘core’ to eachevolutionary lineage that retains some phylogenetic signal. The bestevidence for this comes from the construction of ‘whole genometrees’ based on the presence and absence of particular homologues ororthologues in different complete genomes20. It is important to notethat gene-content trees are averages of patterns produced by phyloge-ny, gene duplication and loss, and horizontal transfer; they are therefore not real phylogenetic trees. Nevertheless, the fact that thesetrees are very similar to phylogenetic trees of genes such as ribosomalRNA and RecA suggests that although horizontal gene transfer might

be extensive, it is somehow constrained by phylogenetic relation-ships. Other evidence for a ‘core’ of particular lineages comes fromthe finding of a conserved core of euryarchaeal genomes21,22 andanother finding that some types of gene might be more prone to genetransfer than others23. It therefore seems likely that horizontal genetransfer has not completely obliterated the phylogenetic signal inmicrobial genomes. Careful studies in which the phylogenetic trees ofsome of these core genes are compared across all genomes need to bedone to see whether or not the core has a consistent phylogeny. Initialstudies suggest that it does, at least for the major microbial groups14.

Although our ability to resolve patterns of the relationshipsamong microbes is still limited, analysis of the genomes of closelyrelated species is revealing much about genome evolution24,25. Forexample, a comparison of the genomes of four chlamydial species hasrevealed the occurrence of frequent tandem gene duplication andgene loss, as well as large chromosomal inversions25. Comparisons ofclosely related species should also reveal much about mutationprocesses, codon usage and other features that evolve rapidly16.

Design of new antimicrobial agents and vaccinesOne of the expected benefits of genome analysis of pathogenic bacte-ria is in the area of human health, particularly in the design of morerapid diagnostic reagents and the development of new vaccines andantimicrobial agents. These goals have become more urgent with thecontinuing spread of antibiotic resistance in important humanpathogens. Moreover, results from the whole-genome analysis ofhuman pathogens has suggested that there are mechanisms for gen-erating antigenic variation in proteins expressed on the cell surfacethat are encoded within the genomes of these organisms. Thesemechanisms include the following: (1) slipped-strand mispairingwithin DNA sequence repeats found in 5!-intergenic regions andcoding sequences as described for H. influenzae2, Helicobacter pylori26

and M. tuberculosis27, (2) recombination between homologous genesencoding outer-surface proteins as described for Mycoplasma genitalium28, Mycoplasma pneumoniae29 and Treponema pallidum30,and (3) clonal variability in surface-expressed proteins as describedfor Plasmodium falciparum31 and possibly Borrelia burgdorferi32.

2. Random sequencing phase

GGG ACTGTTC...

(i) Isolate DNA

(ii) Fragment DNA

(iii) Clone DNA

3. Closure phase

(i) Assemble sequences(i) Sequence DNA(15,000 sequences per Mb)

(ii) Close gaps

(iv) Annotation

(iii) Edit

237 239

238

4. Completegenome sequence

1. Library construction

–1 –1

1

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.

© 2000 Macmillan Magazines Ltd


Complete Genome/Chromosome Progress


From http://genomesonline.org

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR


Why Completeness is Important

• Improves characterization of genome features

• Gene order, replication origins

• Better comparative genomics

• Genome duplications, inversions

• Presence and absence of particular genes can be very important

• Missing sequence might be important (e.g., centromere)

• Allows researchers to focus on biology not sequencing

• Facilitates large scale correlation studies


General Steps in Analysis of Complete Genomes

• Identification/prediction of genes

• Characterization of gene features

• Characterization of genome features

• Prediction of gene function

• Prediction of pathways

• Integration with known biological data

• Comparative genomics


General Steps in Analysis of Complete Genomes

• Structural Annotation • Identification/prediction of genes • Characterization of gene features • Characterization of genome features

• Functional Annotation • Prediction of gene function • Prediction of pathways • Integration with known biological data

• Evolutionary Annotation • Comparative genomics


Structural Annotation I: Genes in Genomes

• Protein coding genes. ! In long open reading frames ! ORFs interrupted by introns in eukaryotes ! Take up most of the genome in prokaryotes, but only a

small portion of the eukaryotic genome

• RNA-only genes ! Transfer RNA ! ribosomal RNA ! snoRNAs (guide ribosomal and transfer RNA

maturation) ! intron splicing ! guiding mRNAs to the membrane for translation ! gene regulation—this is a growing list


Structural Annotation II: Other Features to Find

• Gene control sequences ! Promoters ! Regulatory elements

• Transposable elements, both active and defective ! DNA transposons and retrotransposons ! Many types and sizes

• Other Repeated sequences. ! Centromeres and telomeres ! Many with unknown (or no) function

• Unique sequences that have no obvious function


How to Find ncRNAs

• The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration.

• One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily

• Functional RNAs are characterized by secondary structure caused by base pairing within the molecule.

• Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure.

• The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted

• Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species.

• This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site.


RNA Structure

• RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine.

• The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern

– But pseudoknots seem to be fairly rare. • Essentially, RNA folding programs start

with all possible short sequences, then build to larger ones, adding the contribution of each structural element.

– There is an element of dynamic programming here as well.

– And, “stochastic context-free grammars”, something I really don’t want to approach right now!


Finding tRNAs

• tRNAs have a highly conserved structure, with 3 main stem-and-loop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart.

• Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass.

• In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives.


Bacteria / Archaeal Protein Coding Genes

• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used.

– Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF.

• The stop codons are the same as in eukaryotes: TGA, TAA, TAG – stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use

of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation. • Genes can overlap by a small amount. Not much, but a few codons of overlap is common

enough so that you can’t just eliminate overlaps as impossible. • Cross-species homology works well for many genes. It is very unlikely that non-coding

sequence will be conserved. – But, a significant minority of genes (say 20%) are unique to a given species.

• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon

– however, some aren’t recognizable – genes in operons sometimes don’t always have a separate ribosome binding site for each gene


Composition Methods

• The frequency of various codons is different in coding regions as compared to non-coding regions. – This extends to G-C content, dinucleotide frequencies, and other

measures of composition. Dicodons (groups of 6 bases) are often used

– Well documented experimentally. • The composition varies between different proteins of course, and

it is affected within a species by the amounts of the various tRNAs present – horizontally transferred genes can also confuse things: they tend to

have compositions that reflect their original species. – A second group with unusual compositions are highly expressed

genes.


Eukaryotic Genes Harder to Find

• Some fundamental differences between prokaryotes and eukaryotes:

• There is lots of non-coding DNA in eukaryotes. – First step: find repeated sequences and RNA

genes – Note that eukaryotes have 3 main RNA

polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes.

• most eukaryotic genes are split into exons and introns.

• Only 1 gene per transcript in eukaryotes. • No ribosome binding sites: translation starts at

the first ATG in the mRNA – thus, in eukaryotic genomes, searching for the

transcription start site (TSS) makes sense. • Many fewer eukaryotic genomes have been

sequenced


Exons

• Exon sequences can often be identified by sequence conservation, at least roughly.

• Dicodon statistics, as was used for prokaryotes, also is useful – eukaryotic genomes tend to contain many isochores, regions of

different GC content, and composition statistics can vary between isochores.

• The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them.

• Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score.

– In general, sites are more likely to be correct if predicted by multiple methods

– Experimental data from ESTs can be very helpful here.


Functional Annotation


Functional Classification I: GO

• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other.

• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”.

• There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.)

– For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”

• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained.


Functional Classification II: Enzyme Nomenclature

• Enzyme functions: which reactants are converted to which products – Across many species, the enzymes that perform a specific function are usually

evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions.

– Often, two or more gene products in a genome will have the same E.C. number. • Enzyme functions are given unique numbers by the Enzyme Commission.

– E.C. numbers are four integers separated by dots. The left-most number is the least specific

– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose components indicate the following groups of enzymes:

• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a

polypeptide • EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide

• Top level E.C. numbers: – E.C. 1: oxidoreductases (often dehydrogenases): electron transfer – E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between

molecules. – E.C. 3: hydrolases: splitting a molecule by adding water to a bond. – E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule – E.C. 5: isomerases: rearrangements of atoms within a molecule – E.C. 6: ligases: joining two molecules using energy from ATP


Functional Prediction

• BLAST searches • HMM models of specific genes or gene families (Pfam, TIGRfam,

FIGfam). • Sequence motifs and domains. If the gene is not a good match to

previously known genes, these provide useful clues. • Cellular location predictions, especially for transmembrane proteins. • Genomic neighbors, especially in bacteria, where related functions

are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region).

• Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too. – Also, experimental data about an organism’s capacities can be used to

decide whether the relevant functions are present in the genome.


Functional Prediction II: Membrane Spanning

• Integral membrane proteins contain amino acid sequences that go through the membrane one or several times.

– There are also peripheral membrane proteins that stick to the hydrophilic head groups by ionic and polar interactions

– There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group.

• There are 2 main protein structures that cross membranes.

– Most are alpha helices, and in proteins that span multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids.

– Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane.


Functional Prediction by Phylogeny

• Key step in genome projects

• More accurate predictions help guide experimental and computational analyses

• Many diverse approaches

• All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve


Functional Prediction

• Identification of motifs ! Short regions of sequence similarity that are indicative

of general activity ! e.g., ATP binding

• Homology/similarity based methods ! Gene sequence is searched against a databases of

other sequences ! If significant similar genes are found, their functional

information is used

• Problem ! Genes frequently have similarity to hundreds of motifs

and multiple genes, not all with the same function


Helicobacter pylori


H. pylori genome - 1997

“The ability of H. pylori to perform mismatch repair is suggested by the presence of methyl transferases, mutS and uvrD. However, orthologues of MutH and MutL were not identified.”


MutL ??

From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html

http://asajj.roswellpark.org/huberman/dna_repair/mmr.html


Phylogenetic Tree of MutS Family

Aquae Trepa

FlyXenlaRatMouseHumanYeastNeucrArath

BorbuStrpyBacsu

SynspEcoliNeigo

ThemaTheaqDeira

Chltr

SpombeYeast

YeastSpombeMouseHumanArath

YeastHumanMouseArath

StrpyBacsu

CelegHumanYeast MetthBorbu

AquaeSynspDeira Helpy

mSaco

YeastCelegHuman

Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. ��30

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=9722651&query_hl=2


MutS Subfamilies

Aquae Trepa

FlyXenlaRatMouse

HumanYeast

NeucrArath

BorbuStrpy

BacsuSynsp

EcoliNeigo

ThemaTheaqDeira

Chltr

SpombeYeast

YeastSpombe

MouseHumanArath


StrpyBacsu

CelegHumanYeast

MetthBorbu

AquaeSynsp

Deira Helpy

mSaco

YeastCeleg

Human

MSH4

MSH5 MutS2

MutS1

MSH1

MSH3

MSH6

MSH2




Overlaying Functions onto Tree

Aquae Trepa

Rat

FlyXenla

MouseHumanYeast

NeucrArath

BorbuSynsp

Neigo

ThemaStrpy

Bacsu

Ecoli

TheaqDeiraChltr

SpombeYeast

YeastSpombe

MouseHuman

Arath


StrpyBacsu

HumanCelegYeast

MetthBorbu

AquaeSynsp

Deira Helpy

mSaco

YeastCeleg

Human

MSH4

MSH5MutS2

MutS1

MSH1

MSH3

MSH6

MSH2




MutS Subfamilies

• MutS1 Bacterial MMR • MSH1 Euk - mitochondrial MMR • MSH2 Euk - all MMR in nucleus • MSH3 Euk - loop MMR in nucleus • MSH6 Euk - base:base MMR in nucleus !

• MutS2 Bacterial - function unknown • MSH4 Euk - meiotic crossing-over • MSH5 Euk - meiotic crossing-over


Functional Prediction Using Tree

Aquae Trepa

FlyXenlaRatMouse

HumanYeast

NeucrArath

BorbuStrpy

BacsuSynspEcoli

Neigo

ThemaTheaqDeira

Chltr

SpombeYeast

YeastSpombe

MouseHumanArath


MSH1 Mitochondrial Repair

MSH3 - Nuclear RepairOf Loops

MSH6 - Nuclear Repair Of Mismatches

MutS1 - Bacterial Mismatch and Loop Repair

StrpyBacsu

CelegHumanYeast

MetthBorbu

AquaeSynsp

Deira Helpy

mSaco

YeastCeleg

Human

MSH4 - Meiotic Crossing Over

MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions

MSH2 - Eukaryotic Nuclear Mismatch and Loop Repair




Table 3. Presence of MutS Homologs in Complete Genomes Sequences

Species # of MutSHomologs

WhichSubfamilies?

MutLHomologs

BacteriaEscherichia coli K12 1 MutS1 1Haemophilus influenzae Rd KW20 1 MutS1 1Neisseria gonorrhoeae 1 MutS1 1Helicobacter pylori 26695 1 MutS2 -Mycoplasma genitalium G-37 - - -Mycoplasma pneumoniae M129 - - -Bacillus subtilis 169 2 MutS1,MutS2 1Streptococcus pyogenes 2 MutS1,MutS2 1Mycobacterium tuberculosis - - -Synechocystis sp. PCC6803 2 MutS1,MutS2 1Treponema pallidum Nichols 1 MutS1 1Borrelia burgdorferi B31 2 MutS1,MutS2 1Aquifex aeolicus 2 MutS1,MutS2 1Deinococcus radiodurans R1 2 MutS1,MutS2 1

ArchaeaArchaeoglobus fulgidus VC-16, DSM4304 - - -Methanococcus janasscii DSM 2661 - - -Methanobacterium thermoautotrophicum ΔH 1 MutS2 -

EukaryotesSaccharomyces cerevisiae 6 MSH1-6 3+Homo sapiens 5 MSH2-6 3+


Blast Search of H. pylori “MutS”

Score E Sequences producing significant alignments: (bits) Value sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25 sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10 sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09 sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08 sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07 sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07

• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs

• Based on this TIGR predicted this species had mismatch repair

Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.



High Mutation Rate in H. pylori

Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.



PHYLOGENENETIC PREDICTION OF GENE FUNCTION

IDENTIFY HOMOLOGS

OVERLAY KNOWNFUNCTIONS ONTO TREE

INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST

1 2 3 4 5 6

3 5

3

1A 2A 3A 1B 2B 3B

2A 1B

1A

3A

1B2B

3B

ALIGN SEQUENCES

CALCULATE GENE TREE

12

4

6

CHOOSE GENE(S) OF INTEREST

2A

2A

5

3

Species 3Species 1 Species 2

1

1 2

2

2 31

1A 3A

1A 2A 3A

1A 2A 3A

4 6

4 5 6

4 5 6

2B 3B

1B 2B 3B

1B 2B 3B

ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)

Duplication?

EXAMPLE A EXAMPLE B

Duplication?

Duplication?

Duplication

5

METHOD

Ambiguous

Based on Eisen, 1998 Genome Res 8: 163-167.

Phylogenomics



2

3

14

5

6


Chemosynthetic Symbionts

Eisen et al. 1992

Eisen et al. 1992. J. Bact.174: 3416

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC206016/


Carboxydothermus hydrogenoformans

• Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon

Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005 PLoS

Genetics 1: e65. )


Homologs of Sporulation Genes

Wu et al. 2005 PLoS Genetics 1: e65.

http://www.ncbi.nlm.nih.gov/entrez/utils/lofref.fcgi?PrId=4656&uid=16311624&db=pubmed&url=http://genetics.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pgen.0010065


Carboxydothermus sporulates




Non-Homology Predictions: Phylogenetic Profiling

• Step 1: Search all genes in organisms of interest against all other genomes !

• Ask: Yes or No, is each gene found in each other species !

• Cluster genes by distribution patterns (profiles)


Sporulation Gene Profile




B. subtilis new sporulation genes


Functional Prediction III: Colocalization

• Operon structure is often maintained over fairly large taxonomic regions.

– Sometimes gene order is altered, and sometimes one or more enzymes are missing.

– But in general, this phenomenon allows recognition or verification that widely diverged enzymes do in fact have the same function.

• This is an operon that contains part of the glycolytic pathway.

– 1: phosphoclycerate mutase – 2: triosephosphate isomerase – 3: enolase – 4: phosphoglycerate kinase – 5: glyceraldehyde 3-phosphate

dehydrogenase – 6: central glycolytic gene regulator


Metabolic Predictions


Comparative Genomics

Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !50


Using the Core

!51


insight progress

800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com

means that database searches must be repeated regularly to keepannotation accurate and up to date.

One possible solution to the annotation problem is to bring moreof the resources of the scientific community to bear on each genome.No single centre can annotate all the functions of a living organism;experts from many different areas of biology should be encouraged tocontribute to the annotation process. One possible model would befor geographically separated experts to deposit annotation to a central repository, which might also take on a curatorial or editorialrole. An alternative model is one in which annotation resides in manydifferent locations (as it does today), but in which new electroniclinks are created that allow scientists to locate rapidly all the informa-tion about a gene, genome or function. This latter model scales moreeasily and avoids the problem of overdependence on a single source.

What have we learned from genome analysis?Comparison of the results from 24 completed prokaryotic genomesequences, containing more than 50 Mbp of DNA sequence and54,000 predicted open reading frames (ORFs), has revealed that genedensity in the microbes is consistent across many species, with aboutone gene per kilobase (Table 2). Almost half of the ORFs in eachspecies are of unknown biological function. When the function ofthis large subset of genes begins to be explained, it is likely that entire-ly novel biochemical pathways will be identified that might be rele-vant to medicine and biotechnology. Perhaps even more unexpectedis the observation that about a quarter of the ORFs in each speciesstudied so far are unique, with no significant sequence similarity toany other available protein sequence. Although this might at presentbe an artefact of the small number of microbial species studied bywhole-genome analysis, it nevertheless supports the idea that there istremendous biological diversity between microorganisms. Takentogether, these data indicate that much of microbial biology has yet tobe understood and suggest that the idea of a ‘model’ organism in themicrobial world might not be appropriate, given the vast differencesbetween even related species.

Our molecular picture of evolution for the past 20 years has beendominated by the small-subunit ribosomal RNA phylogentic tree

that proposes three non-overlapping groups of living organisms: thebacteria, the archaea and the eukaryotes8. Although the archaea possess bacterial cell structures, it has been suggested that they sharea common ancestor exclusive of bacteria.

Analysis of complete genome sequences is beginning to providegreat insight into many questions about the evolution of microbes.One such area has encompassed the occurrence of genetic exchangesbetween different evolutionary lineages, a phenomenon known ashorizontal, or lateral, gene transfer. The occurrence of horizontalgene transfer, such as that involving genes from organellar genomesto the nucleus, or of antibiotic resistance genes between bacterialspecies, has been well established for many years (see, for example,ref. 9). This phenomenon causes problems for studying the evolutionof species because it means that some species are chimaeric, with different histories for different genes. Before the availability of complete genome sequences, studies of horizontal gene transfer hadbeen limited because of the incompleteness of the data sets beinganalysed. Analyses of complete genome sequences have led to manyrecent suggestions that the extent of horizontal gene exchange ismuch greater than was previously realized10–12. For example, an

Table 1 Results of a BLAST search of a newly sequenced M. tuberculosisgene against a comprehensive protein database

Gene ID Similarity (%) Length (bp) Gene name E-value*

GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2e!15(Klebsiella pneumoniae)

EGAD:22614 46.2 1,191 Gluconokinase 1.4e!13(Bacillus subtilis)

EGAD:20418 43.0 1,302 Xylulose kinase 4.8e!13(Lactobacillus pentosus)

EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7e!12FGGY family (Archaeoglobus fulgidus)

GP:2895855 42.7 1,263 Xylulokinase 1.0e!07(Lactobacillus brevis)

EGAD:10899 45.4 1,296 Xylulose kinase 2.1e!06(Escherichia coli)

*E-value is a statistical measure of the significance of a BLAST search result.

Table 2 Genome features from 24 microbial genome sequencing projects

Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs

Aeropyrum pernix K1 1.67 1,885 (89%)

A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)

A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)

B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)

B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)

Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)

Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)

C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)

Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)

E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)

H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)

H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)

Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)

Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)

M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)

M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)

M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)

N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)

Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)

Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)

Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)

T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)

T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)

Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)

50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)

© 2000 Macmillan Magazines Ltd


After the Genomes

• Better analysis and annotation

• Comparative genomics

• Functional genomics (Experimental analysis of gene function on a genome scale)

• Genome-wide gene expression studies

• Proteomics

• Genome wide genetic experiments

uc davis eve161 lecture 10 by @phylogenomics

Education

uc davis eve161 course

jonathan eisen winter

gene slides

jonathan eisenslides

genome sequenceslides

eukaryotic genome rna

genome sequencing iislides

types of gene