uc davis eve161 lecture 10 by @phylogenomics
TRANSCRIPT
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Lecture 10:
EVE 161:Microbial Phylogenomics
!Lecture #10:
Era III: Genome Sequencing !
UC Davis, Winter 2014 Instructor: Jonathan Eisen
!1
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Where we are going and where we have been
• Previous lecture: !9: rRNA Case Study - Built Environment
• Current Lecture: !10: Genome Sequencing
• Next Lecture: !11: Genome Sequencing II
!2
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
1st Genome Sequence
Fleischmann et al.
!3
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
insight progress
NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com 801
analysis of the genomes of two thermophilic bacterial species,Aquifex aeolicus and Thermotoga maritima, revealed that 20–25% ofthe genes in these species were more similar to genes from archaeathan those from bacteria13,14. This led to the suggestion of possibleextensive gene exchanges between these species and archaeal lineages. But before one jumps to this conclusion it is important toconsider the difficulties in inferring the occurrence of gene transfer.For example, the high percentage of genes with best matches toarchaea in A. aeolicus and T. maritima could also be due to a high rateof evolution in the mesophilic bacteria (which would cause thermophilic and archaeal genes to have high levels of similaritydespite their not having a common ancestry) or the loss of these genesfrom mesophilic bacteria15. For T. maritima, many lines of additionalevidence support the assertion of gene transfer, including the obser-vation that many of the archaeal-like genes occur in clusters in thegenome, are in regions of unusual nucleotide composition, andbranch in phylogenetic trees most closely to archaeal genes14. Most ofthe lines of evidence leading to assertions of horizontal gene transfercan have other causes. For example, unusual nucleotide compositioncan also arise from selection16, and differences in phylogenetic treescan be caused by convergence, inaccurate alignments17, long-branchattraction18 or sampling of different species19. It is therefore important to assess the evidence carefully and to find multiple typesof evidence. This has yet to be done systematically, so we believe that itis too early to assign quantitative values to the extent of gene exchangebetween species.
Despite the apparent occurrence of extensive gene transfers in thehistory of microbes, it does seem that there might be a ‘core’ to eachevolutionary lineage that retains some phylogenetic signal. The bestevidence for this comes from the construction of ‘whole genometrees’ based on the presence and absence of particular homologues ororthologues in different complete genomes20. It is important to notethat gene-content trees are averages of patterns produced by phyloge-ny, gene duplication and loss, and horizontal transfer; they are therefore not real phylogenetic trees. Nevertheless, the fact that thesetrees are very similar to phylogenetic trees of genes such as ribosomalRNA and RecA suggests that although horizontal gene transfer might
be extensive, it is somehow constrained by phylogenetic relation-ships. Other evidence for a ‘core’ of particular lineages comes fromthe finding of a conserved core of euryarchaeal genomes21,22 andanother finding that some types of gene might be more prone to genetransfer than others23. It therefore seems likely that horizontal genetransfer has not completely obliterated the phylogenetic signal inmicrobial genomes. Careful studies in which the phylogenetic trees ofsome of these core genes are compared across all genomes need to bedone to see whether or not the core has a consistent phylogeny. Initialstudies suggest that it does, at least for the major microbial groups14.
Although our ability to resolve patterns of the relationshipsamong microbes is still limited, analysis of the genomes of closelyrelated species is revealing much about genome evolution24,25. Forexample, a comparison of the genomes of four chlamydial species hasrevealed the occurrence of frequent tandem gene duplication andgene loss, as well as large chromosomal inversions25. Comparisons ofclosely related species should also reveal much about mutationprocesses, codon usage and other features that evolve rapidly16.
Design of new antimicrobial agents and vaccinesOne of the expected benefits of genome analysis of pathogenic bacte-ria is in the area of human health, particularly in the design of morerapid diagnostic reagents and the development of new vaccines andantimicrobial agents. These goals have become more urgent with thecontinuing spread of antibiotic resistance in important humanpathogens. Moreover, results from the whole-genome analysis ofhuman pathogens has suggested that there are mechanisms for gen-erating antigenic variation in proteins expressed on the cell surfacethat are encoded within the genomes of these organisms. Thesemechanisms include the following: (1) slipped-strand mispairingwithin DNA sequence repeats found in 5!-intergenic regions andcoding sequences as described for H. influenzae2, Helicobacter pylori26
and M. tuberculosis27, (2) recombination between homologous genesencoding outer-surface proteins as described for Mycoplasma genitalium28, Mycoplasma pneumoniae29 and Treponema pallidum30,and (3) clonal variability in surface-expressed proteins as describedfor Plasmodium falciparum31 and possibly Borrelia burgdorferi32.
2. Random sequencing phase
GGG ACTGTTC...
(i) Isolate DNA
(ii) Fragment DNA
(iii) Clone DNA
3. Closure phase
(i) Assemble sequences(i) Sequence DNA(15,000 sequences per Mb)
(ii) Close gaps
(iv) Annotation
(iii) Edit
237 239
238
4. Completegenome sequence
1. Library construction
–1 –1
1
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
Figure 1 Diagram depicting the steps in a whole-genome shotgun sequencing project.
© 2000 Macmillan Magazines Ltd
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Complete Genome/Chromosome Progress
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
From http://genomesonline.org
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Why Completeness is Important
• Improves characterization of genome features
• Gene order, replication origins
• Better comparative genomics
• Genome duplications, inversions
• Presence and absence of particular genes can be very important
• Missing sequence might be important (e.g., centromere)
• Allows researchers to focus on biology not sequencing
• Facilitates large scale correlation studies
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
General Steps in Analysis of Complete Genomes
• Identification/prediction of genes
• Characterization of gene features
• Characterization of genome features
• Prediction of gene function
• Prediction of pathways
• Integration with known biological data
• Comparative genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
General Steps in Analysis of Complete Genomes
• Structural Annotation • Identification/prediction of genes • Characterization of gene features • Characterization of genome features
• Functional Annotation • Prediction of gene function • Prediction of pathways • Integration with known biological data
• Evolutionary Annotation • Comparative genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Annotation I: Genes in Genomes
• Protein coding genes. ! In long open reading frames ! ORFs interrupted by introns in eukaryotes ! Take up most of the genome in prokaryotes, but only a
small portion of the eukaryotic genome
• RNA-only genes ! Transfer RNA ! ribosomal RNA ! snoRNAs (guide ribosomal and transfer RNA
maturation) ! intron splicing ! guiding mRNAs to the membrane for translation ! gene regulation—this is a growing list
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Structural Annotation II: Other Features to Find
• Gene control sequences ! Promoters ! Regulatory elements
• Transposable elements, both active and defective ! DNA transposons and retrotransposons ! Many types and sizes
• Other Repeated sequences. ! Centromeres and telomeres ! Many with unknown (or no) function
• Unique sequences that have no obvious function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
How to Find ncRNAs
• The most universal genes, such as tRNA and rRNA, are very conserved and thus easy to detect. Finding them first removes some areas of the genome from further consideration.
• One easy approach to finding common RNA genes is just looking for sequence homology with related species: a BLAST search will find most of them quite easily
• Functional RNAs are characterized by secondary structure caused by base pairing within the molecule.
• Determining the folding pattern is a matter of testing many possibilities to find the one with the minimum free energy, which is the most stable structure.
• The free energy calculations are in turn based on experiments where short synthetic RNA molecules are melted
• Related to this is the concept that paired regions (stems) will be conserved across species lines even if the individual bases aren’t conserved. That is, if there is an A-U pairing on one species, the same position might be occupied by a G-C in another species.
• This is an example of concerted evolution: a deleterious mutation at one site is cancelled by a compensating mutation at another site.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
RNA Structure
• RNA differs from DNA in having fairly common G-U base pairs. Also, many functional RNAs have unusual modified bases such as pseudouridine and inosine.
• The pseudoknot, pairing between a loop and a sequence outside its stem, is especially difficult to detect: computationally intense and not subject to the normal situation that RNA base pairing follows a nested pattern
– But pseudoknots seem to be fairly rare. • Essentially, RNA folding programs start
with all possible short sequences, then build to larger ones, adding the contribution of each structural element.
– There is an element of dynamic programming here as well.
– And, “stochastic context-free grammars”, something I really don’t want to approach right now!
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Finding tRNAs
• tRNAs have a highly conserved structure, with 3 main stem-and-loop structures that form a cloverleaf structure, and several conserved bases. Finding such sequences is a matter of looking in the DNA for the proper features located the proper distance apart.
• Looking for such sequences is well-suited to a decision tree, a series of steps that the sequence must pass.
• In addition, a score is kept, rating how well the sequence passed each step. This allows a more stringent analysis later on, to eliminate false positives.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Bacteria / Archaeal Protein Coding Genes
• Bacteria use ATG as their main start codon, but GTG and TTG are also fairly common, and a few others are occasionally used.
– Remember that start codons are also used internally: the actual start codon may not be the first one in the ORF.
• The stop codons are the same as in eukaryotes: TGA, TAA, TAG – stop codons are (almost) absolute: except for a few cases of programmed frameshifts and the use
of TGA for selenocysteine, the stop codon at the end of an ORF is the end of protein translation. • Genes can overlap by a small amount. Not much, but a few codons of overlap is common
enough so that you can’t just eliminate overlaps as impossible. • Cross-species homology works well for many genes. It is very unlikely that non-coding
sequence will be conserved. – But, a significant minority of genes (say 20%) are unique to a given species.
• Translation start signals (ribosome binding sites; Shine-Dalgarno sequences) are often found just upstream from the start codon
– however, some aren’t recognizable – genes in operons sometimes don’t always have a separate ribosome binding site for each gene
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Composition Methods
• The frequency of various codons is different in coding regions as compared to non-coding regions. – This extends to G-C content, dinucleotide frequencies, and other
measures of composition. Dicodons (groups of 6 bases) are often used
– Well documented experimentally. • The composition varies between different proteins of course, and
it is affected within a species by the amounts of the various tRNAs present – horizontally transferred genes can also confuse things: they tend to
have compositions that reflect their original species. – A second group with unusual compositions are highly expressed
genes.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Eukaryotic Genes Harder to Find
• Some fundamental differences between prokaryotes and eukaryotes:
• There is lots of non-coding DNA in eukaryotes. – First step: find repeated sequences and RNA
genes – Note that eukaryotes have 3 main RNA
polymerases. RNA polymerase 2 (pol2) transcribes all protein-coding genes, while pol1 and pol3 transcribe various RNA-only genes.
• most eukaryotic genes are split into exons and introns.
• Only 1 gene per transcript in eukaryotes. • No ribosome binding sites: translation starts at
the first ATG in the mRNA – thus, in eukaryotic genomes, searching for the
transcription start site (TSS) makes sense. • Many fewer eukaryotic genomes have been
sequenced
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Exons
• Exon sequences can often be identified by sequence conservation, at least roughly.
• Dicodon statistics, as was used for prokaryotes, also is useful – eukaryotic genomes tend to contain many isochores, regions of
different GC content, and composition statistics can vary between isochores.
• The initial and terminal exons contain untranslated regions, and thus special methods are needed to detect them.
• Predicting splice junctions is a matter of collecting information about the sequences surrounding each possible GT/AC pair, then running this information through some combination of decision tree, Markov models, discriminant analysis, or neural networks, in an attemp to massage the data into giving a reliable score.
– In general, sites are more likely to be correct if predicted by multiple methods
– Experimental data from ESTs can be very helpful here.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Annotation
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification I: GO
• The Gene Ontology (GO) consortium (http://www.geneontology.org/) is an attempt describe gene products with a structured controlled vocabulary, a set of invariant terms that have a known relationship to each other.
• Each GO term is given a number of the form GO:nnnnnnn (7 digits), as well as a term name. For example, GO:0005102 is “receptor binding”.
• There are 3 root terms: biological process, cellular component, and molecular function. A gene product will probably be described by GO terms from each of these “ontologies”. (ontology is a branch of philosophy concerned with the nature of being, and the basic categories of being and their relationships.)
– For instance, cytochrome c is described with the molecular function term “oxidoreductase activity”, the biological process terms “oxidative phosphorylation” and “induction of cell death”, and the cellular component terms “mitochondrial matrix” and “mitochondrial inner membrane”
• The terms are arranged in a hierarchy that is a “directed acyclic graph” and not a tree. This means simply that each term can have more than one parent term, but the direction of parent to child (i.e. less specific to more specific) is always maintained.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Classification II: Enzyme Nomenclature
• Enzyme functions: which reactants are converted to which products – Across many species, the enzymes that perform a specific function are usually
evolutionarily related. However, this isn’t necessarily true. There are cases of two entirely different enzymes evolving similar functions.
– Often, two or more gene products in a genome will have the same E.C. number. • Enzyme functions are given unique numbers by the Enzyme Commission.
– E.C. numbers are four integers separated by dots. The left-most number is the least specific
– For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4", whose components indicate the following groups of enzymes:
• EC 3 enzymes are hydrolases (enzymes that use water to break up some other molecule) • EC 3.4 are hydrolases that act on peptide bonds • EC 3.4.11 are those hydrolases that cleave off the amino-terminal amino acid from a
polypeptide • EC 3.4.11.4 are those that cleave off the amino-terminal end from a tripeptide
• Top level E.C. numbers: – E.C. 1: oxidoreductases (often dehydrogenases): electron transfer – E.C. 2: transferases: transfer of functional groups (e.g. phosphate) between
molecules. – E.C. 3: hydrolases: splitting a molecule by adding water to a bond. – E.C. 4: lyases: non-hydrolytic addition or removal of groups from a molecule – E.C. 5: isomerases: rearrangements of atoms within a molecule – E.C. 6: ligases: joining two molecules using energy from ATP
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• BLAST searches • HMM models of specific genes or gene families (Pfam, TIGRfam,
FIGfam). • Sequence motifs and domains. If the gene is not a good match to
previously known genes, these provide useful clues. • Cellular location predictions, especially for transmembrane proteins. • Genomic neighbors, especially in bacteria, where related functions
are often found together in operons and divergons (genes transcribed in opposite directions that use a common control region).
• Biochemical pathway/subsystem information. If an organism has most of the genes needed to perform a function, any missing functions are probably present too. – Also, experimental data about an organism’s capacities can be used to
decide whether the relevant functions are present in the genome.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction II: Membrane Spanning
• Integral membrane proteins contain amino acid sequences that go through the membrane one or several times.
– There are also peripheral membrane proteins that stick to the hydrophilic head groups by ionic and polar interactions
– There are also some that have covalently bound hydrophobic groups, such as myristoylate, a 14 carbon saturated fatty acid that is attached to the N-terminal amino group.
• There are 2 main protein structures that cross membranes.
– Most are alpha helices, and in proteins that span multiple times, these alpha helices are packed together in a coiled-coil. Length = 15-30 amino acids.
– Less commonly, there are proteins with membrane spanning “beta barrels”, composed of beta sheets wrapped into a cylinder. An example: porins, which transport water across the membrane.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction by Phylogeny
• Key step in genome projects
• More accurate predictions help guide experimental and computational analyses
• Many diverse approaches
• All improved both by “phylogenomic” type analyses that integrate evolutionary reconstructions and understanding of how new functions evolve
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction
• Identification of motifs ! Short regions of sequence similarity that are indicative
of general activity ! e.g., ATP binding
• Homology/similarity based methods ! Gene sequence is searched against a databases of
other sequences ! If significant similar genes are found, their functional
information is used
• Problem ! Genes frequently have similarity to hundreds of motifs
and multiple genes, not all with the same function
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Helicobacter pylori
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
H. pylori genome - 1997
“The ability of H. pylori to perform mismatch repair is suggested by the presence of methyl transferases, mutS and uvrD. However, orthologues of MutH and MutL were not identified.”
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
MutL ??
From http://asajj.roswellpark.org/huberman/dna_repair/mmr.html
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Phylogenetic Tree of MutS Family
Aquae Trepa
FlyXenlaRatMouseHumanYeastNeucrArath
BorbuStrpyBacsu
SynspEcoliNeigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombeMouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeast MetthBorbu
AquaeSynspDeira Helpy
mSaco
YeastCelegHuman
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. ���30
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
MutS Subfamilies
Aquae Trepa
FlyXenlaRatMouse
HumanYeast
NeucrArath
BorbuStrpy
BacsuSynsp
EcoliNeigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
StrpyBacsu
CelegHumanYeast
MetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4
MSH5 MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. ���31
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Overlaying Functions onto Tree
Aquae Trepa
Rat
FlyXenla
MouseHumanYeast
NeucrArath
BorbuSynsp
Neigo
ThemaStrpy
Bacsu
Ecoli
TheaqDeiraChltr
SpombeYeast
YeastSpombe
MouseHuman
Arath
YeastHumanMouseArath
StrpyBacsu
HumanCelegYeast
MetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4
MSH5MutS2
MutS1
MSH1
MSH3
MSH6
MSH2
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. ���32
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
MutS Subfamilies
• MutS1 Bacterial MMR • MSH1 Euk - mitochondrial MMR • MSH2 Euk - all MMR in nucleus • MSH3 Euk - loop MMR in nucleus • MSH6 Euk - base:base MMR in nucleus !
• MutS2 Bacterial - function unknown • MSH4 Euk - meiotic crossing-over • MSH5 Euk - meiotic crossing-over
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction Using Tree
Aquae Trepa
FlyXenlaRatMouse
HumanYeast
NeucrArath
BorbuStrpy
BacsuSynspEcoli
Neigo
ThemaTheaqDeira
Chltr
SpombeYeast
YeastSpombe
MouseHumanArath
YeastHumanMouseArath
MSH1 Mitochondrial Repair
MSH3 - Nuclear RepairOf Loops
MSH6 - Nuclear Repair Of Mismatches
MutS1 - Bacterial Mismatch and Loop Repair
StrpyBacsu
CelegHumanYeast
MetthBorbu
AquaeSynsp
Deira Helpy
mSaco
YeastCeleg
Human
MSH4 - Meiotic Crossing Over
MSH5 - Meiotic Crossing Over MutS2 - Unknown Functions
MSH2 - Eukaryotic Nuclear Mismatch and Loop Repair
Based on Eisen, 1998 Nucl Acids Res 26: 4291-4300. ���34
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014TIGR
Table 3. Presence of MutS Homologs in Complete Genomes Sequences
Species # of MutSHomologs
WhichSubfamilies?
MutLHomologs
BacteriaEscherichia coli K12 1 MutS1 1Haemophilus influenzae Rd KW20 1 MutS1 1Neisseria gonorrhoeae 1 MutS1 1Helicobacter pylori 26695 1 MutS2 -Mycoplasma genitalium G-37 - - -Mycoplasma pneumoniae M129 - - -Bacillus subtilis 169 2 MutS1,MutS2 1Streptococcus pyogenes 2 MutS1,MutS2 1Mycobacterium tuberculosis - - -Synechocystis sp. PCC6803 2 MutS1,MutS2 1Treponema pallidum Nichols 1 MutS1 1Borrelia burgdorferi B31 2 MutS1,MutS2 1Aquifex aeolicus 2 MutS1,MutS2 1Deinococcus radiodurans R1 2 MutS1,MutS2 1
ArchaeaArchaeoglobus fulgidus VC-16, DSM4304 - - -Methanococcus janasscii DSM 2661 - - -Methanobacterium thermoautotrophicum ΔH 1 MutS2 -
EukaryotesSaccharomyces cerevisiae 6 MSH1-6 3+Homo sapiens 5 MSH2-6 3+
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Blast Search of H. pylori “MutS”
Score E Sequences producing significant alignments: (bits) Value sp|P73625|MUTS_SYNY3 DNA MISMATCH REPAIR PROTEIN 117 3e-25 sp|P74926|MUTS_THEMA DNA MISMATCH REPAIR PROTEIN 69 1e-10 sp|P44834|MUTS_HAEIN DNA MISMATCH REPAIR PROTEIN 64 3e-09 sp|P10339|MUTS_SALTY DNA MISMATCH REPAIR PROTEIN 62 2e-08 sp|O66652|MUTS_AQUAE DNA MISMATCH REPAIR PROTEIN 57 4e-07 sp|P23909|MUTS_ECOLI DNA MISMATCH REPAIR PROTEIN 57 4e-07
• Blast search pulls up Syn. sp MutS#2 with much higher p value than other MutS homologs
• Based on this TIGR predicted this species had mismatch repair
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
High Mutation Rate in H. pylori
Based on Eisen et al. 1997 Nature Medicine 3: 1076-1078.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
PHYLOGENENETIC PREDICTION OF GENE FUNCTION
IDENTIFY HOMOLOGS
OVERLAY KNOWNFUNCTIONS ONTO TREE
INFER LIKELY FUNCTIONOF GENE(S) OF INTEREST
1 2 3 4 5 6
3 5
3
1A 2A 3A 1B 2B 3B
2A 1B
1A
3A
1B2B
3B
ALIGN SEQUENCES
CALCULATE GENE TREE
12
4
6
CHOOSE GENE(S) OF INTEREST
2A
2A
5
3
Species 3Species 1 Species 2
1
1 2
2
2 31
1A 3A
1A 2A 3A
1A 2A 3A
4 6
4 5 6
4 5 6
2B 3B
1B 2B 3B
1B 2B 3B
ACTUAL EVOLUTION(ASSUMED TO BE UNKNOWN)
Duplication?
EXAMPLE A EXAMPLE B
Duplication?
Duplication?
Duplication
5
METHOD
Ambiguous
Based on Eisen, 1998 Genome Res 8: 163-167.
Phylogenomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
2
3
14
5
6
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Chemosynthetic Symbionts
Eisen et al. 1992
Eisen et al. 1992. J. Bact.174: 3416
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Carboxydothermus hydrogenoformans
• Isolated from a Russian hotspring • Thermophile (grows at 80°C) • Anaerobic • Grows very efficiently on CO (Carbon
Monoxide) • Produces hydrogen gas • Low GC Gram positive (Firmicute) • Genome Determined (Wu et al. 2005 PLoS
Genetics 1: e65. )
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Homologs of Sporulation Genes
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Carboxydothermus sporulates
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Non-Homology Predictions: Phylogenetic Profiling
• Step 1: Search all genes in organisms of interest against all other genomes !
• Ask: Yes or No, is each gene found in each other species !
• Cluster genes by distribution patterns (profiles)
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Sporulation Gene Profile
Wu et al. 2005 PLoS Genetics 1: e65.
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
B. subtilis new sporulation genes
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Functional Prediction III: Colocalization
• Operon structure is often maintained over fairly large taxonomic regions.
– Sometimes gene order is altered, and sometimes one or more enzymes are missing.
– But in general, this phenomenon allows recognition or verification that widely diverged enzymes do in fact have the same function.
• This is an operon that contains part of the glycolytic pathway.
– 1: phosphoclycerate mutase – 2: triosephosphate isomerase – 3: enolase – 4: phosphoglycerate kinase – 5: glyceraldehyde 3-phosphate
dehydrogenase – 6: central glycolytic gene regulator
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Metabolic Predictions
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Comparative Genomics
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014 !50
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
Using the Core
!51
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
insight progress
800 NATURE | VOL 406 | 17 AUGUST 2000 | www.nature.com
means that database searches must be repeated regularly to keepannotation accurate and up to date.
One possible solution to the annotation problem is to bring moreof the resources of the scientific community to bear on each genome.No single centre can annotate all the functions of a living organism;experts from many different areas of biology should be encouraged tocontribute to the annotation process. One possible model would befor geographically separated experts to deposit annotation to a central repository, which might also take on a curatorial or editorialrole. An alternative model is one in which annotation resides in manydifferent locations (as it does today), but in which new electroniclinks are created that allow scientists to locate rapidly all the informa-tion about a gene, genome or function. This latter model scales moreeasily and avoids the problem of overdependence on a single source.
What have we learned from genome analysis?Comparison of the results from 24 completed prokaryotic genomesequences, containing more than 50 Mbp of DNA sequence and54,000 predicted open reading frames (ORFs), has revealed that genedensity in the microbes is consistent across many species, with aboutone gene per kilobase (Table 2). Almost half of the ORFs in eachspecies are of unknown biological function. When the function ofthis large subset of genes begins to be explained, it is likely that entire-ly novel biochemical pathways will be identified that might be rele-vant to medicine and biotechnology. Perhaps even more unexpectedis the observation that about a quarter of the ORFs in each speciesstudied so far are unique, with no significant sequence similarity toany other available protein sequence. Although this might at presentbe an artefact of the small number of microbial species studied bywhole-genome analysis, it nevertheless supports the idea that there istremendous biological diversity between microorganisms. Takentogether, these data indicate that much of microbial biology has yet tobe understood and suggest that the idea of a ‘model’ organism in themicrobial world might not be appropriate, given the vast differencesbetween even related species.
Our molecular picture of evolution for the past 20 years has beendominated by the small-subunit ribosomal RNA phylogentic tree
that proposes three non-overlapping groups of living organisms: thebacteria, the archaea and the eukaryotes8. Although the archaea possess bacterial cell structures, it has been suggested that they sharea common ancestor exclusive of bacteria.
Analysis of complete genome sequences is beginning to providegreat insight into many questions about the evolution of microbes.One such area has encompassed the occurrence of genetic exchangesbetween different evolutionary lineages, a phenomenon known ashorizontal, or lateral, gene transfer. The occurrence of horizontalgene transfer, such as that involving genes from organellar genomesto the nucleus, or of antibiotic resistance genes between bacterialspecies, has been well established for many years (see, for example,ref. 9). This phenomenon causes problems for studying the evolutionof species because it means that some species are chimaeric, with different histories for different genes. Before the availability of complete genome sequences, studies of horizontal gene transfer hadbeen limited because of the incompleteness of the data sets beinganalysed. Analyses of complete genome sequences have led to manyrecent suggestions that the extent of horizontal gene exchange ismuch greater than was previously realized10–12. For example, an
Table 1 Results of a BLAST search of a newly sequenced M. tuberculosisgene against a comprehensive protein database
Gene ID Similarity (%) Length (bp) Gene name E-value*
GP:2905647 44.8 1,191 D-Arabinitol kinase 6.2e!15(Klebsiella pneumoniae)
EGAD:22614 46.2 1,191 Gluconokinase 1.4e!13(Bacillus subtilis)
EGAD:20418 43.0 1,302 Xylulose kinase 4.8e!13(Lactobacillus pentosus)
EGAD:105114 43.4 1,320 Carbohydrate kinase, 4.7e!12FGGY family (Archaeoglobus fulgidus)
GP:2895855 42.7 1,263 Xylulokinase 1.0e!07(Lactobacillus brevis)
EGAD:10899 45.4 1,296 Xylulose kinase 2.1e!06(Escherichia coli)
*E-value is a statistical measure of the significance of a BLAST search result.
Table 2 Genome features from 24 microbial genome sequencing projects
Organism Genome No. of ORFs Unknown Unique size (Mbp) (% coding) function ORFs
Aeropyrum pernix K1 1.67 1,885 (89%)
A. aeolicus VF5 1.50 1,749 (93%) 663 (44%) 407 (27%)
A. fulgidus 2.18 2,437 (92%) 1,315 (54%) 641 (26%)
B. subtilis 4.20 4,779 (87%) 1,722 (42%) 1,053 (26%)
B. burgdorferi 1.44 1,738 (88%) 1,132 (65%) 682 (39%)
Chlamydia pneumoniae AR39 1.23 1,134 (90%) 543 (48%) 262 (23%)
Chlamydia trachomatis MoPn 1.07 936 (91%) 353 (38%) 77 (8%)
C. trachomatis serovar D 1.04 928 (92%) 290 (32%) 255 (29%)
Deinococcus radiodurans 3.28 3,187 (91%) 1,715 (54%) 1,001 (31%)
E. coli K-12-MG1655 4.60 5,295 (88%) 1,632 (38%) 1,114 (26%)
H. influenzae 1.83 1,738 (88%) 592 (35%) 237 (14%)
H. pylori 26695 1.66 1,589 (91%) 744 (45%) 539 (33%)
Methanobacterium thermotautotrophicum 1.75 2,008 (90%) 1,010 (54%) 496 (27%)
Methanococcus jannaschii 1.66 1,783 (87%) 1,076 (62%) 525 (30%)
M. tuberculosis CSU#93 4.41 4,275 (92%) 1,521 (39%) 606 (15%)
M. genitalium 0.58 483 (91%) 173 (37%) 7 (2%)
M. pneumoniae 0.81 680 (89%) 248 (37%) 67 (10%)
N. meningitidis MC58 2.24 2,155 (83%) 856 (40%) 517 (24%)
Pyrococcus horikoshii OT3 1.74 1,994 (91%) 859 (42%) 453 (22%)
Rickettsia prowazekii Madrid E 1.11 878 (75%) 311 (37%) 209 (25%)
Synechocystis sp. 3.57 4,003 (87%) 2,384 (75%) 1,426 (45%)
T. maritima MSB8 1.86 1,879 (95%) 863 (46%) 373 (26%)
T. pallidum 1.14 1,039 (93%) 461 (44%) 280 (27%)
Vibrio cholerae El Tor N1696 4.03 3,890 (88%) 1,806 (46%) 934 (24%)
50.60 52,462 (89%) 22,358 (43%) 12,161 (23%)
© 2000 Macmillan Magazines Ltd
Slides for UC Davis EVE161 Course Taught by Jonathan Eisen Winter 2014
After the Genomes
• Better analysis and annotation
• Comparative genomics
• Functional genomics (Experimental analysis of gene function on a genome scale)
• Genome-wide gene expression studies
• Proteomics
• Genome wide genetic experiments