Introduction to Genomics and the Tree of Life
Friday, October 22, 2010 (part 1)Monday, October 25, 2010 (part 2)
Genomics260.605.01J. Pevsner
Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomics, 2nd edition by J Pevsner (© 2009 by Wiley-Blackwell).
These images and materials may not be used without permission from the publisher (instructors, email me at [email protected]).
Visit http://www.bioinfbook.org
Copyright notice
We meet 3 times a week, from 10:30 to 11:50 am:
W4013 (lecture/discussion and occasional computer lab)
Announcements: where/when we meet
Textbook: Bioinformatics and Functional Genomics (2nd edition, Wiley-Blackwell, 2009) by J. Pevsner, ISBN 978-0-470-08585-1.
• We’ll cover chapters 13-20 in this course• For those who don’t want to buy a copy, I will share pdfs of
all the chapters with the class• You can buy a copy at the website www.bioinfbook.organd get a nice discount ($80). It’s $80 at Amazon.com.• The JHU bookstore may have copies.• Welch Library may have copies
Book’s website: www.bioinfbook.orgCourse website: http://www.bioinfbook.org/genomics.php or visit www.bioinfbook.org/chapter 13
Announcements: book, website
Outline of this course
Introduction to genomicsVirusesBacteria and archaea (Egbert Hoiczyk)Eukaryotes
The eukaryotic chromosomeFungi; yeast functional genomics (Jef Boeke)Protozoans (David Sullivan)Nematodes (Al Scott)Mosquitoes (George Dimopoulos)Rodents: mouse and ratPrimatesThe human genome (Dave Valle)Human disease
Outline of today’s lecture
Introduction: 5 perspectives, history of life
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Five approaches to genomics
As we survey the tree of life, consider these perspectives:
Approach I: cataloguing genomic informationGenome size; number of chromosomes; GC
content; isochores; number of genes; repetitive DNA; unique features of each genome
Approach V: Bioinformatics aspectsAlgorithms, databases, websites
Approach IV: Human disease relevance
Approach III: function; biological principles; evolutionHow genome size is regulated; polyploidization; birth and death of genes; neutral theory of
evolution; positive and negative selection; speciation
Approach II: cataloguing comparative genomic informationOrthologs and paralogs; COGs; lateral gene transfer
Page 519
Two projects for this course
Option [1] Select a genome and describe it in detail.
Option [2] Select a gene and describe it in detail.
For each, follow the five approaches just outlined, and apply the principles that we learn in this course.
Reading: Webb Miller et al. (2004) Comparative genomics
IntroductionLessons learned form comparative genomics What have we learned about genes by comparing genomic
sequences? What have we learned about regulation? About 5% of the human genome is under purifying selection Positively regulated regions Mechanisms and history of mammalian evolution Nonuniformity of neutral evolutionary rates within species Nonuniformity of evolution along the branches of phylogenyLearning more form existing data Choice of species Choice of toolsFuture of comparative genomics
Levels of analysis in genomics
level topics databasesDNA genes, chromosomes GenBankRNA ESTs, ncRNA UniGene, GEOprotein ORFs, composition UniProtcomplexes binary, multimeric BINDpathways COGs, KEGGorganellesorgansindividuals variation and disease HapMapspecies speciation TaxBrowser; SGDgenus JAX mouse phylum FishBasekingdom TOL
Definitions of terms
Genomics is the study of genomes (the DNA comprising an organism) using the tools of bioinformatics.
Bioinformatics is the study protein, genes, and genomes using computer algorithms and databases.
Systematics is the scientific study of the kinds and diversity of organisms and of any and all relationships among them.
Classification is the ordering of organisms into groups on the basis of their relationships. The relationships may be evolutionary (phylogenetic) or may refer to similarities of phenotype (phenetic).
Taxonomy is the theory and practice of classifying organisms.
Outline of today’s lecture
Introduction: 5 perspectives, history of life: trees
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Fig. 13.1Page 521
Pace (2001) described a tree of life based on small subunit rRNA sequences.
This tree shows the mainthree branches describedby Woese and colleagues.
Ernst Haeckel (1834-1919), a supporter of Darwin,published a tree of life (1879) including Monera(formless clumps, later named bacteria).
Introduction: Systematics
Page 520
plants
animals
monera
fungi
protistsprotozoa
invertebrates
vertebrates
mammalsFive kingdom
system(Haeckel, 1879)
Page 516
Chatton (1937) distinguished prokaryotes (bacteriathat lack nuclei) from eukaryotes (having nuclei).
Whittaker (1969) and others described the five-kingdomsystem: animals, plants, protists, fungi, and monera.
In the 1970s and 1980s, Carl Woese and colleaguesdescribed the archaea, thus forming a tree of lifewith three main branches: archaea, bacteria, eukaryotes.
Introduction: Systematics
Page 520
Whittaker RH (1969) New concepts of kingdoms or organisms. Evolutionary relations are better represented by new classifications than by the traditional two kingdoms. Science. 163(863):150-60.
Whittaker (1969): The two-kingdom system as it might have appeared in the early 1900s
Plantae Animalia
Whittaker RH (1969) New concepts of kingdoms or organisms. Evolutionary relations are better represented by new classifications than by the traditional two kingdoms. Science. 163(863):150-60.
The Copeland four-kingdom system of the 1930s-1950s
Monera
Metaphyta
Metazoa
Protoctista
Pro
kar
yoti
cE
uk
aryo
tic
Un
icel
lula
rM
ult
icel
lula
r
Whittaker RH (1969) New concepts of kingdoms or organisms. Evolutionary relations are better represented by new classifications than by the traditional two kingdoms. Science. 163(863):150-60.
Whittaker (1969): The five-kingdom system
Plantae Fungi Animalia
Monera
Protista
Levels:prokaryotic (Monera)eukaryotic unicellulareukaryotic multicellular
Historically, trees were generated primarily usingcharacters provided by morphological data. Molecularsequence data are now commonly used, includingsequences (such as small-subunit RNAs) that arehighly conserved.
Visit the European Small Subunit Ribosomal RNAdatabase for 20,000 SSU rRNA sequences.
Molecular sequences as basis of trees
Page 523
Pace (2001) described a tree of life based on small subunit rRNA sequences.
This tree shows the mainthree branches describedby Woese and colleagues. It is the best currently accepted model of the tree of life.
Fig. 13.1Page 521
http://www.zo.utexas.edu/faculty/antisense/Download.html
Tree of life from David Hillis’ lab (based on ~3000 rRNAs)
animalsplants
fungi
protists
bacteriaarchaea
you are here
10-10
http://www.zo.utexas.edu/faculty/antisense/Download.html
you are here
Tree of life from David Hillis’ lab (based on ~3000 rRNAs)
10-10
Ribosomal RNA Database
Ribosomal Database Projecthttp://rdp.cme.msu.edu/index.jsp
Santos, S. R. and Ochman H. Identification and phylogenetic sorting of bacterial lineages with universally conserved genes and proteins. Environmental Microbiology. 2004. Jul(6)7:754-9.
►Download fusA (translation elongation factor 2 [EF-2])►Obtain DNA in the fasta format►Align by ClustalW in MEGA►Create a neighbor-joining tree
Page 524
10-10
European Small Subunit Ribosomal RNA database(http://www.psb.ugent.be/rRNA/ssu/) 10-10
Bac
ant
hrac
is S
tern
e fu
sA
Bac
thur
ing
9727
fusA
Bac
ant
hrac
is A
mes
fusA
Bac
ant
hrac
is 0
581
fusA
Bac
cer
eus
1098
7 fu
sA
Bac
cer
eus
1457
9 fu
sA
Bac
sub
tilis
fusA
Bac
hal
odur
ans
fusA
List
inno
cua
Clip
1126
2 fu
sA
List
mon
ocyt
o 4b
F23
65 fu
sA
List
mon
ocyt
o EG
De
fusA
Oce
anob
ac ih
eyen
sis H
TE83
1 fu
sA
Staph
yl ep
ider
mi 1
2228
fusA
Staph
y aur
eus M
W2
fusA
Staphy
aure
us M
u50 f
usA
Staphy aureus N
315 fusA
Lactobac j
ohnsonii N
CC533 fusA
Lactobac p
lantarum WCFS1 fu
sA
Entero faeca
lis V583 fu
sA
Strep m
utans UA159 fusA
Lactococ lactis Il1403 fusA
Strep agalactiae NEM316 fusA
Strep agalactiae 2603VR fusA
Strep pneumoniae R6 fusA
Strep pneumoniae TIGR4 fusA
Strep pyogenes M1 GAS fusA
Strep pyogenes MGAS8232 fusA
Strep pyogenes MGAS315 fusAStrep pyogenes SSI1 fusAOnion yel phytoplasm OYM fusAMycoplas mobile 163K fusAMycoplas pulmonis UAB CTIP fusAMycoplas mycoides PG1 fusA
Mycoplas penetrans HF2 fusA
Ureaplasma parvum 700970 fusA
Mycoplas galli R fusA
Mycoplas genita G37 fusA
Mycoplas pneumon M129 fusA
Thermoanaero tengcongensis fusA
Fuso nucleatum ATCC25586 fusA
Clost perfringens 13 fusA
Clost acetobutylicum 824 fusA
Clost tetani E88 fusA
Parachlamydia UWE25 fusA
Chlamy muridarum fusA
Chlamy tracho DUW3CX fusA
Chlamydo caviae GPIC fusA
Chlamydo pneumon J138 fusA
Chlamydo pneumon CWL029 fusA
Chlamydo pneumon AR39 fusA
Chlamydo pneum
on TW183 fusA
Prochloro marinus CCM
P1375 fusA
Prochloro marinus CCM
P1986 fusA
Nostoc PCC7120 fusA
Synechocystis PCC6803 fusA
Gloeo violaceus PC
C7421 fusA
Thermosynecho elongatus BP1 fusA
Prochloro m
arinus MIT 9313 fusA
Synechococcus sp W
H8102 fusA
Hel
ico
pylo
ri 26
695
fusA
Hel
ico
pylo
ri J9
9 fu
sA
Hel
ico
hepa
ticus
514
49 fu
sA
Wol
inel
la s
ucci
noge
n D
SM
1740
fusA
Cam
pylo
jeju
ni N
CT
C11
168
fusA
Buc
h ap
hidi
AP
S fu
sA
Buc
h ap
hidi
Sg
fusA
Buc
h ap
hidi
Bp
fusA
Can
di B
loch
man
flor
i fus
A
Wig
gles
wor
thia
fusA
Nitr
o eu
ropa
ea 1
9718
fusA
Cox
iella
bur
netii
RS
A49
3 fu
sAX
ylel
la fa
stid
iosa
9a5
c fu
sAX
ylel
la fa
stid
iosa
Tem
ecu1
fusA
Vib
rio v
ulni
ficus
CM
CP
6 fu
sA
Vib
rio v
ulni
ficus
YJ0
16 fu
sA
Vib
rio p
arah
aem
olyt
RIM
D22
1063
3 fu
sA
Vib
rio c
hole
rae
N16
961
fusA
She
wan
ella
one
iden
sis
MR
1 fu
sA
Aci
neto
bact
er A
DP
1 fu
sA
Nei
s m
enin
git M
C58
fusA
Nei
s m
enin
git Z
2491
fusA
Hae
mo
ducr
eyi 3
5000
HP
fusA
Pas
teu
mul
toci
da P
m70
fusA
Hae
mo
influ
RdK
W20
fusA
Phot
o lu
min
es T
TO1
fusA
Yers
inia
pes
tis C
O92
fusA
Yersin
ia p
estis
KIM
fusA
Yersin
ia pe
stis 9
1001
fusA
Erwini
a ca
roto
vora
SCRI1
043
fusA
Salmon
enter
Typ
hi CT18
fusA
Salmon enter T
yphi T
y2 fu
sA
Salmon ty
phimuriu
m LT2 fusA
E coli O
157 H7 fusA
E coli O157 H7 EDL933 fusA
E coli CFT073 fusA
E coli K12 fusA
Shigella flexneri 2457T fusA
Shigella flexneri 301 fusA
Lepto inter lai 56601 fusA
Lepto inter Copen Fio L1130 fusA
Pirellula 1 fusA
Aquifex aeolicus fusA
Thermotoga maritima MSB8 fusA
Bacteroides thetaio VPI5482 fusA
Porphyro gingiv W83 fusA
Geo sulfur PCA fusAChloro tepidum TLS fusA
Bordet bronchi RB50 fusABordet pertussis TohamaI fusABordet parapert 12822 fusARalstonia solan GMI1000 fusA
Chromo violaceum 12472 fusA
Xanthomonas axonopodis 306 fusA
Xanthomonas campestris 33913 fusA
Pseudo aeruginosa PA01 fusA
Pseudo putida KT2440 fusA
Pseudo syringae DC3000 fusA
Desulfo vulgaris Hilden fusA
Agro tumefaciens C58 fusA
Sinorhiz meliloti 1021 fusA
Mesorhiz loti MAFF303099 fusA
Bruc suis 1330 fusA
Caulo crescentus CB15 fusA
Bradyrhiz japonicum USDA110 fusA
Rhodopseudo palustris CGA009 fusA
Deino radiodurans R1 fusA
Thermus therm
ophilus HB27 fusA
Coryne efficiens YS314 fusA
Coryne gluta 13032 fusA
Coryne diphtheriae N
CTC
13129 fusA
Bifido longum
fusA
Streptom
y avermitilis M
A4680 fusA
Streptom
y coelicol A3 2 fusA
Mycobac leprae T
N fusA
Mycobac avium
k10 fusA
Mycobac bovis A
F212297 fusA
Mycobac tubercu C
DC
1551 fusA
Mycobac tubercu H
37Rv fusA
Treponem
a denticola 35405 fusA
Treponem
a pallidum N
ichols fusA
Borrelia burgdorferi B
31 fusA
Bdello bacter H
D100 fusA
Tropherym
a whipplei T
W08 27 fusA
Tropherym
a whipplei T
wist fusA
Bart henselae H
oust1 fusAB
art quintana fusAW
olbachia fusAR
icket conorii Malish 7 fusA
Ricket prow
azekii MadridE
fusA
0.05Rickettsia Treponema
Mycobacterium
Aquifex aeolicus
Yersinia pestis
Clostridium
Mycoplasma
Bac. antracis
Neighbor-joining tree of ~150 fusA (GTPase) DNA sequences
Fig. 15.1Page 603
Eukaryotes(Baldauf et al. 2000)
Fig. 18.1Page 730
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
History of life on earth
4.55 BYA formation of earth (violent 100 MY period)4.4-3.8 BYA last ocean-evaporating impacts3.9 BYA oldest dated rocks3.8 BYA sun brightened to 70% of today’s luminosity
Ammonia, methane, or carbon dioxide atmosphere.Earliest life: RNA, protein
Source: Schopf J.W. (ed.), Life’s Origins (U. Calif. Press, 2002) Page 521
History of life on earth: two major eons
Source: Schopf J.W. (ed.), Life’s Origins (U. Calif. Press, 2002)
Precambrian eon Phanerozoic eon
Extends from the formation of the planetto the appearance of fossils of hard-shelled animals 550 MYA
From Cambrian explosion to the present
1 BYA234
4 3 2 1 0
Billions of years ago (BYA)
Origin oflife
Origin ofeukaryotes insects
Fungi/animalPlant/animal
Hadean eon
Archean eon Proterozoic eon Phanerozoic eon
Earliestfossils
Page 522
1000 100 0500
InsectsCambrianexplosion
Age of Reptiles ends
Land plants
Proterozoic eon Phanerozoic eon
deuterostome/protostome
echinoderm/chordate
Millions of years ago (MYA)
Page 522
Millions of years ago (MYA)
Dinosaurs extinct;Mammalian radiation
Human/chimpdivergence
100 10 050
Mass extinction
Page 522
Millions of years ago (MYA)
Homo sapiens/Chimp divergence
Emergence ofHomo erectus
Earlieststone tools
10 1 05
AustralepithecusLucy
Page 522
Homo erectusemerges in Africa
MitochondrialEve
1,000,000 100,000 0500,000
Years ago
Page 523
Years ago
Neanderthal and Homo erectus disappear
Emergence ofanatomically
modern H. sapiens
100,000 10,000 050,000
Page 523
Years ago
“Ice Man”from Alps Aristotle
10,000 1,000 05,000
Earliestpyramids
Page 523
Years ago
algebra calculusDarwin,MendelGutenberg
1,000 100 0500
Page 523
Page 524
Today’s continents derive from earlier land masses (Laurasia, Gondwana), affecting evolution of species
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
We will next summarize the major achievements ingenome sequencing projects from a chronologicalperspective.
Chronology of genome sequencing projects
Page 525
1976: first viral genomeFiers et al. sequence bacteriophage MS2 (3,569 base pairs,Accession NC_001417).
1977:Sanger et al. sequence bacteriophage X174.This virus is 5,386 base pairs (encoding 11 genes).See accession J02482; NC_001422.
Chronology of genome sequencing projects
Page 527
Fig. 13.5Page 528
Entrez nucleotide record for bacteriophage X174 (graphics display)
1981Human mitochondrial genome16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)Today (10/10), over 2200 mitochondrial genomes sequenced
1986Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb)
Chronology of genome sequencing projects
Page 527
mitochondrion
chloroplast
Lackmitochondria (?)
http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/organelles.html
Entrez Genomes organelle resource at NCBI
10-10
There are ~2500 eukaryotic organelles (10/10)
http://www-lecb.ncifcrf.gov/mitoDat/
MitoDat: resource for organelle genomes
“This database is dedicated to the nuclear genes specifying the enzymes, structural proteins, and other proteins, many still not identified, involved in mitochondrial biogenesis and function. MitoDat highlights predominantly human nuclear-encoded mitochondrial proteins.”
Not updated recently.
10-10
http://www.mitomap.org/
MitoMap: resource for organelle genomes
10-10
It is possible to map mutations in human mitochondrial DNA that are responsible for disease
1995: first genome of a free-living organism, the bacterium Haemophilus influenzae
Chronology of genome sequencing projects
Page 530
1995: genome of the bacterium Haemophilus influenzae is sequenced
Fig. 13.7Page 531
How to find information about a genome: NCBI All databases Genome follow link to Bacteria
Overview of bacterialcomplete genomes(2000) n=30
Overview of bacterialcomplete genomes(2010) n=3,330
Fig. 12.9Page 411
You can find functionalannotation through theCOGs database
(Clusters ofOrthologousGenes)
Click the circle to access the genomesequence
You can find functionalannotation through theCOGs database (Clusters ofOrthologous Genes)
Entrez Genome view of H. influenzae (October 2009)
Click the circle to access the genomesequence
Genes are color-codedaccording to theCOGs scheme
1996: first eukaryotic genome
The complete genome sequence of the budding yeastSaccharomyces cerevisiae was reported. We willdescribe this genome soon.
Also in 1996, TIGR reported the sequence of the firstarchaeal genome, Methanococcus jannaschii.
Chronology of genome sequencing projects
Page 532
1996: a yeast genome is sequenced
To learn about a genome of interest, visit NCBI TaxBrowser Genome Projects
To learn about a genome of interest, follow theTaxBrowser Genome Projects links
Size (in megabases), number of chromosomes are given here
To place the sequencingof the yeast genomein context, these are theeukaryotes…
Tree of eukaryotes(Baldauf et al. 2000)
Fungi
1997:More bacteria and archaeaEscherichia coli4.6 megabases, 4200 proteins (38% of unknown function)
1998: first multicellular organismNematode Caenorhabditis elegans 97 Mb; 19,000 genes.
1999: first human chromosomeChromosome 22 (49 Mb, 673 genes)
Chronology of genome sequencing projects
Page 532
See the article by Webb Miller et al. (2004), “Comparative genomics” for a discussion of annotation and analysis progress made since 1998
1999: Human chromosome 22 sequenced
1999: Human chromosome 22 sequenced
49 MB701 genes
2000:Fruitfly Drosophila melanogaster (13,000 genes)
Plant Arabidopsis thaliana
Human chromosome 21
2001: draft sequence of the human genome(public consortium and Celera Genomics)
Chronology of genome sequencing projects
Page 534
To explore human chromosome 21 at NCBIFind MapViewerChoose humanClick chromosome 21
2000
2001 draft human genome sequence2002 S. pombe (just 4,800 genes)2004 “finished” human genome2007 first individual human genome2009 1000 Genomes Project
Outline of Monday’s lecture (Chapter 13)
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes