genome biology and biotechnology 2. the genome structures of invertebrates prof. m. zabeau...
TRANSCRIPT
Genome Biology and Genome Biology and BiotechnologyBiotechnology
2. The genome structures of invertebrates2. The genome structures of invertebrates
Prof. M. ZabeauProf. M. ZabeauDepartment of Plant Systems Biology Department of Plant Systems Biology
Flanders Interuniversity Institute for Biotechnology (VIB)Flanders Interuniversity Institute for Biotechnology (VIB)University of GentUniversity of Gent
International course 2005International course 2005
Sequenced genomes of Sequenced genomes of invertebratesinvertebrates
¤ Nematodes– Caenorhabditis elegans (1998)– Caenorhabditis briggsae (2003)
¤ Insects– Drosophila melanogaster – fruit fly (2000)– Drosophila pseudoobscura – fruit fly (2005)– Anopheles gambiae - mosquito (2002)– Bombyx mori - silkworm (2004)
¤ Tunicates: ancestral vertebrate genome– Ciona intestinalis (2002)
Phylogeny of the invertebratesPhylogeny of the invertebrates
550 MY
~800 MY
>1000 MY
Genome Sequence of the Nematode Genome Sequence of the Nematode C. elegansC. elegans
¤ Paper presents– The first complete genome sequence of a multicellular
organism• The initial sequence covered 97-Mbp (6 gaps) • The complete sequence (June 2003) comprises 100,2Mbp
without gaps
The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Protein coding GenesProtein coding Genes
¤ First large-scale genome sequence annotation– The gene structure predictions based on EST and protein
similarities• Only 40% of the predicted genes had a confirming EST match
¤ The first annotation predicted 19,099 genes– An average density of 1 predicted gene per 5 kb– 27% of the genome resides in predicted exons
– Each gene has an average of five introns– WormBase: updated and manually curated gene set
• Currently contains 18,808 genes
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
RNA genes and repetitive sequencesRNA genes and repetitive sequences
¤ RNA genes– rRNA genes: occur in long tandem arrays – tRNA genes: 659 tRNA genes occur widely dispersed – Noncoding RNA genes: in dispersed multigene families– Micro RNA genes (miRNA)
• ~100 identified to date
¤ Repetitive Sequences– Dispersed repeat sequences
• Most of them are associated with transposons of C. Elegans which are probably no longer active in the genome
– Local repeat sequences• Tandem, inverted, or simple sequence repeats
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Chromosome Structure and Chromosome Structure and OrganizationOrganization
¤ The genome structure is remarkably uniform– Gene density is fairly constant across the chromosomes– No localized centromeres
• Like in yeast, but in contrast to all other eukaryotes
¤ Differences between the central portion and the arms of the chromosomes– The conserved eukaryotic genes are in the central portion– Repetitive DNA is more prevalent in the arms– Meiotic recombination is much higher on the chromosome
arms– suggest that DNA in the arms might be evolving more
rapidly than in the central regions
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
Distribution of sequence elements on Distribution of sequence elements on Chromosome IChromosome I
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
TTAGGC repeats
Tandem repeats
Inverted repeats
Yeast similarities
EST matches
Predicted genes
Central part armarm
ConclusionsConclusions
¤ The complete sequence of the C. elegans genome has – provided a basis for the discovery of all the genes of a
multicellular eukaryotic organism• First inventory of eukaryotic genes
¤ C. elegans is a very effective model organism for – eukaryotic gene analysis: widely used for functional
genomics– human disease gene research– nematode pest control research
Reprinted from: The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
The Genome Sequence of Caenorhabditis The Genome Sequence of Caenorhabditis briggsae: A Platform for Comparative briggsae: A Platform for Comparative
GenomicsGenomics
¤ Paper presents– high-quality draft (> 10-fold coverage) sequence of C.
briggsae– Comparative genome analysis of C. briggsae and C. elegans
• The two species diverged ~ 100 million years ago • morphologically indistinguishable• same chromosome number (5) and genome size (104 and
100Mb)
– Comparisons of the genomes of related species allows • More precise annotation of protein-coding genes• Discovery of noncoding genes, regulatory sequences and
“unknown” functional elements
Stein et. al., PLoS Biol 1: 166-192 (2003)
Colinearity of the Colinearity of the C. briggsae and C. elegans C. briggsae and C. elegans GenomesGenomes
¤ Alignment of sequences– ~80% Collinearity
• inversions and translocations
– blocks of synteny • orthologous genes
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Annotation of Protein-Coding GenesAnnotation of Protein-Coding Genes
¤ Concordance of gene predictions refines gene models– C. elegans gene annotation improvement
• >6,000 (30%) genes exon addition, deletion or alterations• 1,300 new genes • 18,808 protein-coding genes C. elegans • 19,507 protein-coding genes C. briggsae
Most concordant
Comparison of Protein-Coding Genes Comparison of Protein-Coding Genes
¤ ~65% are orthologs in C. briggsae /C. elegans– gene pairs with a one-to-one correspondence in the two
species• have a common ancestor• have similar gene and coding sequence lengths • show ~80% percent identity at the protein level
¤ ~25% are paralogs in C. briggsae /C. elegans– proteins with multiple BLASTP matches in the other species
• Evolving gene families
¤ ~5% are orphans in C. briggsae /C. elegans– proteins that have no BLASTP matches in the other species
• 807 in C. elegans and 1061 in C. briggsae genes • Novel genes or pseudogenes?
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Conservation of Operon Structure Conservation of Operon Structure
¤ C. elegans is unusual among animals in having operons– co-transcribed genes that make a polycistronic pre-mRNA
• subsequently separated into single-gene mRNAs by trans-splicing
– ~15% of C. elegans genes are encoded in ~1000 operons • contain 2–8 genes
– 96% of the operons are preserved intact in C. briggsae genome
¤ C. elegans operons comprise – co-regulated genes encoding proteins with related functions– specific functional classes of genes
• Transcription• RNA splicing• translation• RNA degradation
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Repetitive sequencesRepetitive sequences
¤ The different genome sizes result from– Differences in repeat content
• 23.3 Mbp of the C. briggsae genome (104 Mbp) • 16.5 Mbp of the C. elegans genome (100.3 Mbp)
¤ Repeated DNA families– comprise DNA transposons or tandem arrays– Not orthologous between the two genomes
• suggests that most repeat elements in the two genomes postdate the divergence of the two species
– Accumulation of new repetitive elements is balanced by deletions so that
• genome sizes remain similar
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
Chromosome Structure and Chromosome Structure and OrganizationOrganization
¤ The centers contain orthologous (1) and essential genes (2)– Very long synteny blocks
¤ The arms contain orphan genes (3) and repetitive elements (4)– Short synteny blocks– The arms of the chromosomes are evolving more rapidly than the centers
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
1
2
3
4
ConclusionsConclusions
¤ C. briggsae/C. elegans comparison shows that– despite large differences at the genomic level, C. briggsae and C.
elegans are morphologically almost indistinguishable – Many protein families are very dynamic
• ~200 families have expanded or contracted by > 2-fold• several hundred families are either novel or have diverged
extensively – share only ~ 50% of the non-coding sequence
¤ Sequencing of additional species is necessary to– identify candidate cis-regulatory elements based on sequence
conservation • the noise level in a two-way comparison is too high
Reprinted from: Stein et. al., PLoS Biol 1: 166-192 (2003)
The Genome Sequence of Drosophila The Genome Sequence of Drosophila melanogastermelanogaster
¤ Draft sequence – (2000)– Whole-genome shotgun sequencing
• Sequence contained 128 physical gaps and 1630 sequence gaps
– Some regions were of poor sequence quality
– Demonstrated that whole-genome shotgun sequencing can be used for large eukaryotic genomes
• Adams et. al., Science, 287, 2185 (2000)
¤ Finished sequence – (2002)– BAC clone sequencing and gap filling– Sequence contains 7 physical gaps and 37 sequence gaps– Very accurate sequence: error rate of < 1/100.000
• Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002)
The The DrosophilaDrosophila Genome Genome
¤ The (female) Drosophila genome is ~176 Mb in size– Euchromatic part: 117 Mb completely sequenced– heterochromatic part: partly (~20Mb) sequenced
(unassembled)• Female: estimated at ~59 Mb • Male: the 40Mb Y chromosome is completely heterochromatic
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Euchromatin and HeterochromatinEuchromatin and Heterochromatin
¤ Euchromatin– Gene rich portion of the genome– Condenses during mitosis and de-condenses there after – Portion of the genome that can be cloned stably in BACs
¤ Heterochromatin– Consists mainly of simple sequence repeats (sattelite
DNAs), transposable elements, and tandem arrays of rRNA genes
– Remains condensed after mitosis– Gene poor portion of the genome– Contains elements required for centromere function
¤ Euchromatin - heterochromatin transition– is gradual at the molecular level
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
EuchromatEuchromatic Genomeic GenomeSequence Sequence
Reprinted from: Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002)
Transposons
centromere
Gene Content of the Drosophila Gene Content of the Drosophila GenomeGenome
¤ Annotation of the draft genome sequence – Predicted 13,601 genes
• >10,000 genes (>75%) supported by EST and protein matches• This annotation was incomplete
– Large number of sequence gaps and sequencing errors
¤ Annotation of the finished genome sequence– Predicted same number of genes: 13,676
• Majority (85%) of the gene models revised
– Improved: a collection of 250.000 ESTs and full length cDNAs– Found only 17 pseudogenes ( much less than in C. elegans )– Heterochromatic part may contain ~500 genes
• The 20Mb sequenced contains ~300 protein coding genes
– Reannotation reveals many complex gene models • genes that do not fit the simple 5’UTR – exons – 3’UTR
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Complex Gene modelsComplex Gene models
¤ Alternatively splicing or alternative polyadenylation – At least ~20% of genes have >1 predicted transcript
• 65% encode two or more protein products • 35% differ in the UTRs - most have different 5’UTRs:
alternative promoters
Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Complex Gene modelsComplex Gene models
¤ Dicistronic genes: 2 non-overlapping coding regions on one mRNA– 31 dicistronic gene pairs found represent an underestimate
Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
Complex Gene modelsComplex Gene models
¤ Overlapping genes– overlap of mRNAs on opposite strands: 15% of the genes
¤ Nested genes– genes included within introns of other genes: 15% of the
genes
Reprinted from: Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)
ConclusionsConclusions
¤ The Drosophila genome sequence reveals – genes and proteins common to all multicellular organisms
• proteins involved in transcription control and metabolism are very similar to their human counterparts
¤ Drosophila provides an experimental platform for – the study of of human disease genes involved in
• DNA replication and repair• Metabolism of drugs and toxins.
Reprinted from: Adams et. al., Science, 287, 2185 (2000)
Comparative genome sequencing of Comparative genome sequencing of Drosophila pseudoobscuraDrosophila pseudoobscura: Chromosomal, : Chromosomal,
gene, and gene, and ciscis-element evolution -element evolution
¤ Paper presents– High quality draft genome sequence of a second Drosophila
species Drosophila pseudoobscura– Comparison with the genome sequence of D. melanogaster
• Evolutionary distance is well suited to study – Conserved and diverged genes– Conserved regulatory elements– Mechanisms of genome rearrangement
Richards et. al., Genome Res. 15: 1-18 (2005)
The The D. pseudoobscuraD. pseudoobscura genome genome
¤ The euchromatic part is estimated at 131 Mb– ~17% larger than that of D. melanogaster– the additional sequence is
• primarily found in the intergenic regions• only partly caused by expansion of repeated DNA
¤ The two species show a very high gene synteny– Synteny blocks were identified
• on the basis of conservation of protein order• ~10.500/14.000 genes are true orthologs
– All synteny blocks are short and extremely mixed • extensive genome rearrangement in the two Drosophila
lineages
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
The synteny between The synteny between D. pseudoobscuraD. pseudoobscura and and D. melanogasterD. melanogaster
¤ The great majority of syntenic blocks are found – on the same chromosome arms in the two species– Chromosomal rearrangements in the two species
• Almost exclusively paracentric inversions
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
Intraspecific inversion Intraspecific inversion breakpointsbreakpoints
¤ Repetitive sequences at the inversion breakpoints – Frequently comprise a breakpoint motif – Only found in D. pseudoobscura
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
breakpoint motifs
Conservation of gene Conservation of gene segmentssegments
¤ Sequence conservation in noncoding regions– Is insufficient for the identification of regulatory sequences– Multiple genome sequence alignments will be needed
Reprinted from: Richards et. al., Genome Res. 15: 1-18 (2005)
The Genome Sequence of the Malaria The Genome Sequence of the Malaria Mosquito Mosquito Anopheles gambiaeAnopheles gambiae
¤ The papers present– Draft genome sequence of the PEST strain of A. gambiae – A comparison of the genomes and proteomes of Anopheles
and Drosophila• Two very different diptera that diverged ~250MY ago
Sequence: Holt et. al., Science. 298: 129-149 (2002)
Comparison: Zdobnov et. al., Science, 298, 149 (2002)
Reprinted from: Holt et. al., Science. 298: 129-149 (2002)
The Mosquito Genome SequenceThe Mosquito Genome Sequence
¤ The draft genome spans 278 Mb– Covers the entire genome including the heterochromatic
DNA – Mosquito have larger genomes than Drosophila
• estimates from 250 to 500 Mb• Transposable elements constitute ~16% of the genome
– Drosophila experienced a recent genome size reduction
¤ The predicted number of genes is ~14.000– Very similar to Drosophila
¤ The comparison of the Anopheles and Drosophila genomes and proteomes reveals – considerable similarities and numerous differences– Reflects selection and adaptation to different ecologies and
life strategies
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Similarity at the protein levelSimilarity at the protein level
¤ Identified 4 proteins classes– True orthologs: ~45%
(~6.000)• Exhibit 1:1 relationship• Genes with conserved
function
– Paralogs: ~12%• Duplicated genes
– Homologs: ~~25%• Unclear relationship
– Orphans: 11% to 18%• New genes • Rapidly evolving genes
The core of conserved proteinsThe core of conserved proteins
¤ Dynamics of Gene Structure in a span of 250MY– Exon lengths and intron frequencies are similar – introns in Drosophila have half the length of Anopheles
• systematic reduction of noncoding regions in Drosophila– Only 50% of the introns are perfectly conserved
• one intron gain or loss per gene per 125 My – Intron sequences diverge rapidly
• sequence similarity in <2% of the equivalent introns
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Family expansions and reductionsFamily expansions and reductions
¤ Increases and decreases in protein families– Related to adaptations to life
strategies and environment
¤ Expansions or reductions are– Uneven: a single gene in one
species has many paralogs in the other
– More frequent in Anopheles– Examples:
• Cuticular proteins • Innate immunity genes
– FBN-like (fibrinogen) proteins massively expanded in Anopheles
Reprinted from: Zdobnov et. al., Science, 298, 149 (2002)
Genome RearrangementsGenome Rearrangements
¤ Microsynteny– 34% of the orthologs map to
~1000 microsynteny blocks• 2-3 genes per block (cfr.
fish-human)
¤ Macrosynteny– Both species have 5 five
major chromosomal arms – Clear 1:1 homologies
between the chromosomal arms
• Inversions much more frequent than translocations
The Draft Genome of The Draft Genome of Ciona intestinalisCiona intestinalis:: Insights into chordate and vertebrate originsInsights into chordate and vertebrate origins
¤ Paper presents– Draft genome sequence of Ciona intestinalis, an ancestral
chordate– Chordates appear in the fossil record at the Cambrian
explosion• ~ 550 million years ago
Dehal et. al., Science, 298, 2157-2167 (2002)
550 MY
Tunicates
Reprinted from: Dehal et. al., Science, 298, 2157-2167 (2002)
Ciona intestinalisCiona intestinalis
¤ Tessile, hermaphroditic marine invertebrates ¤ Adults are simple filter feeders
– Encased in a fibrous tunic
Adult Juvenile showing the internal structures: •ds, digestive system•es, endostyle•ht, heart•os, neuronal complex; •pg, pharyngeal gill.
Reprinted from: Dehal et. al., Science, 298, 2157-2167 (2002)
Gene content and global Gene content and global comparisonscomparisons
¤ Predicted ~ 16.000 gene models– 75% of the predicted genes are supported by EST evidence– Genes are compact and densely packed: one gene per
7.5 kb
¤ Global comparisons– 60% of the genes have a detectable fly or worm homolog– 20% of the genes have no clear homolog
• tunicate- specific genes– 17% of the genes have a vertebrate homolog but no
detectable fly or worm homolog• Many are single-copy genes for the vertebrate gene
families – signalling and regulatory processes in development
– The gene content is a reasonable approximation of the ancestral chordate
Future PerspectivesFuture Perspectives
¤ Invertebrate genomes are sequenced at a rapid pace– Worms: 10 species of medical and agricultural importance
• Schistosoma, Ancylostoma, Ascaris, Globodera, Meloidogyne – Insects: ~20 species of primarily agricultural importance
• Mosquito’s, honey bee, lepidoptera and > 10 Drosophila species
– Protozoa: several species of medical importance• Trypanosoma, Theileria, Plasmodium, Leishmania,…
– Broad range of species• Sponge, sea urchin, Daphnia, Hydra, snail, lamprey,…
¤ Source: GOLDTM Genomes OnLine Database – http://www.genomesonline.org/
Recommended readingRecommended reading
¤ The nematode genome sequence• The C. elegans Sequencing Consortium, Science, 282, 2012 (1998)
¤ The Drosophila genome sequence • Adams et. al., Science, 287, 2185 (2000)
Further reading Further reading
¤ Nematode genomes– C. briggsae:
• Stein et. al., PLoS Biol 1: 166-192 (2003)
¤ Insect genomes– Finished Drosophila genome sequence:
• Celniker et al., Genome Biol. ; 3: research 0079.1–0079.14 (2002) – Annotation of the Drosophila genome :
• Misra et. al., Genome Biology, 3: research 0083.1-0083.22 (2002)– Draft Drosophila pseudoobscura genome sequence
• Richards et. al., Genome Res. 15: 1-18 (2005)– Draft mosquito genome sequence
• Holt et. al., Science. 298: 129-149 (2002)• Zdobnov et. al., Science, 298, 149 (2002)
¤ Ciona genome• Dehal et. al., Science, 298, 2157-2167 (2002)