completed genomes: viruses and bacteria monday, october 20, 2003 introduction to bioinformatics...
Post on 21-Dec-2015
217 views
TRANSCRIPT
Completed Genomes:Viruses and Bacteria
Monday, October 20, 2003
Introduction to BioinformaticsME:440.714J. Pevsner
Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby J Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by Wiley.
These images and materials may not be usedwithout permission from the publisher.
Visit http://www.bioinfbook.org
Copyright notice
We are now beginning the last third of the course:
Today: completed genomes (Chapters 12-14)Wednesday: Fungi. Exam #2 is due at the start of class.
Next Monday: Functional genomics (Jef Boeke)Next Wednesday: Pathways (Joel Bader)
Monday Nov. 3: Eukaryotic genomesWednesday Nov. 5: Human genome
Monday Nov. 10: Human diseaseWednesday Nov. 12: Final exam (in class)
Announcements
Genome projects (Chapter 12)chronological overviewmajor issues and themes
Introduction to viruses (Chapter 13)classificationbioinformatics challenges and resources
Introduction to bacteria and archaea (Chapter 14)classificationbioinformatics challenges and resources
Outline of today’s lecture
A genome is the collection of DNA that comprisesan organism. Today we have assembled the sequenceof hundreds of genomes. We will begin by introducingthe “tree of life” in an effort to make a comprehensivesurvey of life forms.
Introduction to genomes
Page 397
Ernst Haeckel (1834-1919), a supporter of Darwin,published a tree of life (1879) including Moner(formless clumps, later named bacteria).
Chatton (1937) distinguished prokaryotes (bacteriathat lack nuclei) from eukaryotes (having nuclei).
Whittaker and others described the five-kingdomsystem: animals, plants, protists, fungi, and monera.
In the 1970s and 1980s, Carl Woese and colleaguesdescribed the archaea, thus forming a tree of lifewith three main branches.
Introduction: Systematics
Page 399
plants
animals
monera
fungi
protistsprotozoa
invertebrates
vertebrates
mammalsFive kingdom
system(Haeckel, 1879)
Page 396
Fig. 12.1Page 400
Pace (2001) described a tree of life based on small subunit rRNA sequences.
This tree shows the mainthree branches describedby Woese and colleagues.
Historically, trees were generated primarily usingcharacters provided by morphological data. Molecularsequence data are now commonly used, includingsequences (such as small-subunit RNAs) that arehighly conserved.
Visit the European Small Subunit Ribosomal RNAdatabase for 20,000 SSU rRNA sequences.
Molecular sequences as basis of trees
Page 401
Genomes that span the tree of life are being sequenced at a rapid rate. There are several web-basedresources that document the progress, including:
GNN Genome News Networkhttp://www.genomenewsnetwork.org/main.shtml
GOLD Genomes Online Databasehttp://wit.integratedgenomics.com/GOLD/
PEDANT Protein Extraction, Description & Analysis Toolhttp://pedant.gsf.de/
Genome sequencing projects
Page 405
There are three main resources for genomes:
EBI European Bioinformatics Institutehttp://www.ebi.ac.uk/genomes/
NCBI National Center for Biotechnology Informationhttp://www.ncbi.nlm.nih.gov
TIGR The Institute for Genomic Researchhttp://www.tigr.org
Genome sequencing projects
Page 405
We will next summarize the major achievements ingenome sequencing projects from a chronologicalperspective.
Chronology of genome sequencing projects
Page 404
1977: first viral genomeSanger et al. sequence bacteriophage X174.This virus is 5386 base pairs (encoding 11 genes).See accession J02482.
1981Human mitochondrial genome16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)Today, over 400 mitochondrial genomes sequenced
1986Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb)
Chronology of genome sequencing projects
Page 406
1995: first genome of a free-living organism, the bacterium Haemophilus influenzae
Chronology of genome sequencing projects
Page 409
Fig. 12.9Page 411
You can find functionalannotation through theCOGs database
(Clusters ofOrthologousGenes)
Fig. 12.10Page 412
Click the circle to access the genomesequence
Genes are color-codedaccording to theCOGs scheme
1996: first eukaryotic genome
The complete genome sequence of the budding yeastSaccharomyces cerevisiae was reported. We willdescribe this genome on Wednesday.
Also in 1996, TIGR reported the sequence of the firstarchaeal genome, Methanococcus jannaschii.
Chronology of genome sequencing projects
Page 413
1997:More bacteria and archaeaEscherichia coli4.6 megabases, 4200 proteins (38% of unknown function)
1998: first multicellular organismNematode Caenorhabditis elegans 97 Mb; 19,000 genes.
1999: first human chromosomeChromosome 22 (49 Mb, 673 genes)
Chronology of genome sequencing projects
Page 413
2000:Fruitfly Drosophila melanogaster (13,000 genes)
Plant Arabidopsis thaliana
Human chromosome 21
2001: draft sequence of the human genome(public consortium and Celera Genomics)
Chronology of genome sequencing projects
Page 415
Completed genome projects (current)
Eukaryotes: 10 In progress (partial):Anopheles gambiae Danio rerio (zebrafish) Arabidopsis thaliana Glycine max (soybean) Caenorhabditis elegans Hordeum vulgare (barley) Drosophila melanogaster Leishmania major Encephalitozoon cuniculi Rattus norvegicus Guillardia theta nucleomorphMus musculusPlasmodium falciparum Saccharomyces cerevisiae (yeast)Schizosaccharomyces pombe
Viruses: 1419 Bacteria: 139Archaea: 36
Page 417
[1] Selection of genomes for sequencing
[2] Sequence one individual genome, or several?
[3] How big are genomes?
[4] Genome sequencing centers
[5] Sequencing genomes: strategies
[6] When has a genome been fully sequenced?
[7] Repository for genome sequence data
[8] Genome annotation
Overview of genome analysis
Page 418
[1] Selection of genomes for sequencing is basedon criteria such as:
• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture
Overview of genome analysis
Page 419
[1] Selection of genomes for sequencing is basedon criteria such as:
• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture
Ongoing projects:Chicken Fungi (many)Chimpanzee Honey beeCow Sea urchinDog (recent publication) Rhesus macaque
Overview of genome analysis
Page 419
[2] Sequence one individual genome, or several?
Try one…
--Each genome center may study one
chromosome from an organism
--It is necessary to measure polymorphisms
(e.g. SNPs) in large populations (November 5)
For viruses, thousands of isolates may be sequenced.
For the human genome, cost is the impediment.
Overview of genome analysis
Page 419
[3] How big are genomes?
Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb)
Bacterial genomes: 0.5 Mb to 13 Mb
Eukaryotic genomes: 8 Mb to 686 Mb(discussed further on Monday, November 3)
Overview of genome analysis
Page 420
viruses
plasmids
bacteria
fungi
plants
algae
insects
mollusks
reptiles
birds
mammals
Genome sizes in nucleotide base pairs
104 108105 106 107 10111010109
The size of the humangenome is ~ 3 X 109 bp;almost all of its complexityis in single-copy DNA.
The human genome is thoughtto contain ~30,000-40,000 genes.
bony fish
amphibians
http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt
[4] 20 Genome sequencing centers contributedto the public sequencing of the human genome.
Many of these are listed at the Entrez genomes site.(See Table 17.6, page 625.)
Overview of genome analysis
Page 421
[5] There are two main stragies for sequencing genomes
Whole Genome Shotgun (from the NCBI website)
An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of thesefragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' (WGS) method isapplied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome.
Overview of genome analysis
Page 421
Hierarchical shotgun methodAssemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished.
A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region.
Overview of genome analysis
Page 421
[6] When has a genome been fully sequenced?
A typical goal is to obtain five to ten-fold coverage.
Finished sequence: a clone insert is contiguouslysequenced with high quality standard of error rate0.01%. There are usually no gaps in the sequence.
Draft sequence: clone sequences may contain severalregions separated by gaps. The true order andorientation of the pieces may not be known.
Overview of genome analysis
Page 422
[7] Repository for genome sequence data
Raw data from many genome sequencing projectsare stored at the trace archive at NCBI or EBI
(main NCBI page, bottom right)
Overview of genome analysis
Page 425
[8] Genome annotation
Information content in genomic DNA includes:
-- repetitive DNA elements
-- nucleotide composition (GC content)
-- protein-coding genes, other genes
These topics will be discussed in detail on
November 3 (eukaryotic genomes)
Overview of genome analysis
Page 425
20 30 40 50 60 70 80
GC content (%)
Vertebrates
Invertebrates
Plants
Bacteria
3
5
10
Nu
mb
er o
f sp
ecie
sin
eac
h G
C c
lass
5
10
5
GC content varies across genomes
Fig. 12.16Page 428
Viruses are small, infectious, obligate intracellularparasites. They depend on host cells to replicate.Because they lack the resources for independentexistence, they exist on the borderline of the definitionof life.
The virion (virus particle) consists of a nucleic acidgenome surrounded by coat proteins (capsid) that maybe enveloped in a host-derived lipid bilayer.
Viral genomes consist of either RNA or DNA. They may be single-, double, or partially double stranded. The genomes may be circular, linear, or segmented.
Introduction to viruses
Page 437
Viruses have been classified by several criteria:
-- based on morphology (e.g. by electron microscopy)
-- by type of nucleic acid in the genome
-- by size (rubella is about 2 kb; HIV-1 about 9 kb; poxviruses are several hundred kb). Mimivirus (for Mimicking microbe) has a double-stranded circular genome of 800 kb.
-- based on human disease
Page 438
Introduction to viruses
Fig. 13.2Page 440
The International Committee on Taxonomy of Viruses(ICTV) offers a website, accessible via NCBI’s Entrez site
http://www.ncbi.nlm.nih.gov/ICTVdb/
Vaccine-preventable viral diseases include:
Hepatitis A Hepatitis BInfluenzaMeaslesMumpsPoliomyelitisRubellaSmallpox
Page 441
Introduction to viruses
Some of the outstanding problems in virology include:
-- Why does a virus such as HIV-1 infect one species (human) selectively?
-- Why do some viruses change their natural host? In 1997 a chicken influenza virus killed six people.
-- Why are some viral strains particularly deadly?
-- What are the mechanisms of viral evasion of the host immune system?
-- Where did viruses originate?
Bioinformatic approaches to viruses
Page 439-441
The unique nature of viruses presents special challengesto studies of their evolution.
• viruses tend not to survive in historical samples• viral polymerases of RNA genomes typically lack proofreading activity• viruses undergo an extremely high rate of replication• many viral genomes are segmented; shuffling may occur• viruses may be subjected to intense selective pressures (host immune respones, antiviral therapy)• viruses invade diverse species• the diversity of viral genomes precludes us from making comprehensive phylogenetic trees of viruses
Diversity and evolution of viruses
Page 441
Herpesviruses are double-stranded DNA viruses thatinclude herpes simplex, cytomegalovirus, and Epstein-Barr.
Phylogenetic analysis suggests three major groups that originated about 180-220 MYA.
Bioinformatic approaches to herpesvirus
Page 442
Consider human herpesvirus 9 (HHV-8). Its genome isabout 140,000 base pairs and encodes about 80 proteins.
We can explore this virus at the NCBI website.Try NCBI Entrez Genomes viruses dsDNA
Bioinformatic approaches to herpesvirus
Page 442
Consider human herpesvirus 9 (HHV-8). Its genome isabout 140,000 base pairs and encodes about 80 proteins.
Microarrays have been used to define changes in viral geneexpression at different stages of infection (Paulose-Murphyet al., 2001). Conversely, gene expression changes havebeen measured in human cells following viral infection.
Bioinformatic approaches to herpesvirus
Page 442
Fig. 13.11Page 450
Paulose-Murphy et al. (2001)described HHV-8 viral genesthat are expressed at different times post infection
Human Immunodeficiency Virus (HIV) is the cause ofAIDS. At the end of the year 2002, 42 million people wereinfected. HIV-1 and HIV-2 are primate lentiviruses.
The HIV-1 genome is 9181 bases in length. Note that there are almost 100,000 Entrez nucleotide recordsfor this genome (but only one RefSeq entry).
Phylogenetic analyses suggest that HIV-2 appeared asa cross-species contamination from a simian virus,SIVsm (sooty mangebey). Similarly, HIV-1 appeared from simian immunodeficiency virus of the chimpanzee(SIVcpz).
Bioinformatic approaches to HIV
Page 446
Two major resources are NCBI and the Los AlamosNational Laboratory (LANL) databases.
See http://hiv-web.lanl.gov/
LANL offers-- an HIV BLAST server-- Synonymous/non-synonymous analysis program-- a multiple alignment program-- a PCA-like tool-- a geography tool
Bioinformatic approaches to HIV
Page 453
Bacteria and archaea constitute two of the three mainbranches of life. Together they are the prokaryotes.
We can classify prokaryotes based on six criteria:[1] morphology[2] genome size[3] lifestyle[4] relevance to human disease[5] molecular phylogeny (rRNA)[6] molecular phylogeny (other molecules)
Bacteria and archaea: genome analysis
Page 466
Fig. 14.2Page 470
M. genitalium hasone of the smallestbacterial genomesizes. View itsgenome atwww.tigr.org
We may distinguish six prokaryotic lifestyles:[1] Extracellular (e.g. E. coli)[2] Facultatively intracellular (Mycobacterium tuberculosis)[3] Extremophilic (e.g. M. jannaschi)[4] epicellular bacteria (e.g. Mycoplasma pneumoniae)[5] obligate intracellular and symbiotic (B. aphidicola)[6] obligate intracellular and parasitic (Rickettsia)
Bacteria and archaea: lifestyles
Page 472
DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae
Nature 406, 477- 483 (2000)
Four main features of genomic DNA are useful:[1] Open reading frame length[2] Consensus for ribosome binding (Shine-Dalgarno)[3] Pattern of codon usage[4] Homology of putative gene to other genes
Bacteria and archaea: finding genes
Page 480
COGs database: distribution of COGs by number of species
COGs database: distribution of COGs by number of clades...
How can whole genomes be compared?
-- molecular phylogeny
-- You can BLAST (or PSI-BLAST) all the DNA and/or
protein in one genome against another
-- TaxPlot and COG for bacterial (and for
some eukaryotic) genomes
-- PipMaker, MUMmer and other programs align large
stretches of genomic DNA from multiple species