genome biology and biotechnology

51
Genome Biology and Genome Biology and Biotechnology Biotechnology 5. The genome structures of plants 5. The genome structures of plants Prof. M. Zabeau Prof. M. Zabeau Department of Plant Systems Biology Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology Flanders Interuniversity Institute for Biotechnology (VIB) (VIB) University of Gent University of Gent International course 2005 International course 2005

Upload: rafael-houston

Post on 30-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

Genome Biology and Biotechnology. 5. The genome structures of plants. Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute for Biotechnology (VIB) University of Gent International course 2005. Sequenced genomes of invertebrates and plants. - PowerPoint PPT Presentation

TRANSCRIPT

Genome Biology and Genome Biology and BiotechnologyBiotechnology

5. The genome structures of plants5. The genome structures of plants

Prof. M. ZabeauProf. M. ZabeauDepartment of Plant Systems Biology Department of Plant Systems Biology

Flanders Interuniversity Institute for Biotechnology (VIB)Flanders Interuniversity Institute for Biotechnology (VIB)University of GentUniversity of Gent

International course 2005International course 2005

Sequenced genomes of invertebrates and Sequenced genomes of invertebrates and plantsplants

¤ Completed plant genomes– Arabidopsis thaliana– Oryza sativa (rice)

• Draft genome sequences• Finished chromosomes

¤ Genome sequencing in progress– Polar (draft sequence completed)– Medicago (in progress)– Tomato (in progress)– Maize (started)

Phylogeny of the flowering plantsPhylogeny of the flowering plants

Monocots

Dicots

~250 MY

Analysis of the genome sequence of the Analysis of the genome sequence of the flowering plant flowering plant Arabidopsis thalianaArabidopsis thaliana

¤ Plants and animals evolved independently from unicellular eukaryotes, representing contrasting life forms– The worm and fly genomes revealed the common genetic

basis of developmental and physiological processes in multicellular organisms

– The genome sequence of a plant provides a glimpse of the genetic basis of differences between plants and other eukaryotes

– The genome sequence represents the most accurately sequenced genomes (error rate < 1:100.000)

The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

The Arabidopsis Genome The Arabidopsis Genome SequenceSequence

¤ The complete genome size is estimated at ~125 Mb– The total length of the sequenced region is 115,409 Mb – The unsequenced centromeres and rRNA repeat (chr. 2 & 4)

regions are estimated at 10 Mb

¤ General features such as gene density and repeat distribution are – very consistent across the five chromosomes

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

RepresentatioRepresentation of the n of the

ArabidopsisArabidopsis ChromosomesChromosomes

Chr.129,1 Mb

Chr.219,6 Mb

Chr.323,2 Mb

Chr.417,5 Mb

Chr.526,0 Mb

rDNA repeat

centromeretelomere telomere

Protein genesESTs

Transposons

Mitoch./Chloropl.

RNA genes

density

Representation of Representation of ArabidopsisArabidopsis Chromosome 1Chromosome 1

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Pericentromeric region

Coding Gene ContentCoding Gene Content

¤ AGI annotation predicted 25.489 genes– Non-homogeneous annotation: performed by different groups

¤ Re-annotation estimates 28.000 to 29.000 genes– Larger than C. elegans (19.099) and D. melanogaster (13.601)– Larger gene set results from numerous gene duplications

¤ MIPS classification of Arabidopsis proteins in 12 functional categories (cfr yeast) – ~70% classified according to sequence similarity to proteins of

known function in all organisms • 9% experimentally characterized

– ~30% not be assigned to functional categories• Representing 10.000 “unknown genes”

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Functional Analysis of Functional Analysis of ArabidopsisArabidopsis GenesGenes

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Comparison of Functional CategoriesComparison of Functional Categories

¤ Comparison of Arabidopsis genes with those of the complete genomes reveals: – High conservation of eukaryotic gene function

• >50% of the genes involved in protein synthesis have counterparts in the other eukaryotic genomes

– Independent evolution of many plant gene families • transcription factors: only 8–23% of Arabidopsis proteins

involved in transcription have related genes in other eukaryotic genomes

– Acquisition of bacterial genes • from the cyanobacterial ancestor of the plastid: in the order of

1.000 genes have been translocated over time from the organelle to the genome.

– Genes with high similarity to Synechochistis

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

RNA Gene ContentRNA Gene Content

¤ rRNA Genes – Nucleolar organizers (NORs) on chromosomes 2 and 4 contain

• 350–400 repeats of 10 kb encoding the 18S, 5.8S and 25S rRNA genes comprising 3.5–4.0 Mb

¤ 5S rRNA genes – Tandem arrays in the centromeric regions of chr 3, 4 and 5

¤ tRNA genes: dispersed orginization – 589 cytoplasmic tRNAs, 27 organelle-derived tRNAs and 13

pseudogenes

¤ Spliceosomal RNAs, small nucleolar RNAs (snoRNAs) – Several copies occur dispersed on all chromosomes

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Genome Duplication in Arabidopsis Genome Duplication in Arabidopsis

¤ The Arabidopsis genome exhibits traces of extensive duplications – >75% of the Arabidopsis genes are duplicated

• The fact that most genes are duplicated explains the higher gene number than in other organisms

– Segmental duplications• Segmental duplications were first described in yeast • Identified 24 large duplicated segments of > 100 kb

– These duplicated regions encompass 58% of the genome– Tandem gene arrays

• Tandem arrays of genes are common in all genomes• 1,528 tandem arrays containing 4,140 individual genes

– 17% of all genes of Arabidopsis are arranged in tandem arrays

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Genome Organization and Genome Organization and DuplicationDuplication

¤ First analysis of segmental duplications– Detection of collinear clusters of genes using TBLASTX

• This approach detects the “ obvious” duplications

– The proportion of homologous genes in each duplicated segment varies widely

• Extensive gene loss or gain of genes after the segmental duplication occurred

– Sequence conservation/divergence of the duplicated genes varies greatly

• Duplications vary in age

– suggesting several different large-scale duplication events

• Duplications occurred between 75 to 200 million years ago– Earliest duplication coincides with the radiation of the

flowering land plants

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Overall View of the Duplicated Overall View of the Duplicated RegionsRegions

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

Implications of Genomic DuplicationsImplications of Genomic Duplications

¤ What does the duplication in the Arabidopsis genome tell us about the evolution of the species? – Polyploidy occurs widely in plants but not in animals

• The hypothesis is that Arabidopsis had a tetraploid ancestor(s)

– The majority of the Arabidopsis genome is represented in duplicated segments

• Suggests that the duplicated segments arose from whole genome duplications

– The long period of time (75 to 200 My) provided ample opportunity for

• the divergence of the functions of the duplicated genes

– Duplicated genes often have redundant functions • Majority of insertion mutants in Arabidopsis have no obvious

phenotypic effect

Reprinted from: The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

The Origin of Genomic DuplicationsThe Origin of Genomic Duplications

¤ First detailed analysis of the duplications: – Vision et al, Science 290: 2114 (2000)

– Identified 103 duplicated segments with >=7 matching ORFs

• 81% of the Arabidopsis genes fall within at least one block

– The ages of the duplicated blocks were estimated from average extent of amino acid substitution

– The number of duplication events was estimated from the distribution of the estimated block ages

• Single polyploidization event will produce a unimodal distribution of ages with homogeneity among blocks

• Independent duplication events will produce a multimodal distribution

Reprinted from:

Age Classes of Duplicated BlocksAge Classes of Duplicated Blocks

¤ Distribution of divergence suggests 4 duplication events– Classes C through F yield age estimates of 100, 140, 170, and

200 Mya• Age class C , the most recent, comprises 50% of the duplicated

segments• Age class F predates the divergence of monocots and dicots, 180 to

220 Mya

Reprinted from: Vision et al, Science 290: 2114 (2000)

The Origin of Genomic DuplicationsThe Origin of Genomic Duplications

¤ Recent study of the Arabidopsis genome duplications – Simillion et al, PNAS 99, 13627 (2003)

– More refined algorithms detect degenerated block duplications

• Degeneration results from extensive gene loss and subsequent reshufflings of gene order

• Algorithms detect hidden duplications missed in earlier studies

– Study revealed a much larger number of duplications• 304 nonhidden duplications and 53 hidden duplications

– Comprising 82% of all genes in Arabidopsis– >70% of the genes are lost from the duplicated segments

Nonhidden and Hidden DuplicationsNonhidden and Hidden DuplicationsNonhidden

Hidden

Reprinted from: Simillion et al, PNAS 99, 13627 (2003)

Multiplication levels of the Multiplication levels of the DuplicationsDuplications

¤ Chromosomal segments exhibit multiple duplications– Multiplication numbers vary from 5 to 8

Reprinted from: Simillion et al, PNAS 99, 13627 (2003)

ConclusionsConclusions

¤ High multiplication levels– Suggest multiple rounds of whole genome duplication– Observed many duplications with multiplication levels of 5 -

8 • Indicating a maximum of three rounds of duplications

¤ Dating based on silent substitutions– Accurate for the youngest duplication

• dated 75 million years ago

– Less reliable for the two older age classes• dated 163 and 221 million years ago

¤ Results suggest three whole genome duplication or polyploidization events– The oldest one may have occurred before the

monocot/dicot split

Reprinted from: Simillion et al, PNAS 99, 13627 (2003)

The grass genomesThe grass genomes

¤ Grasses are the primary food source– Wheat, rice, maize barley, sorghum…

¤ Grass genomes vary widely in size

Species Genome size (Mb) ploidyRice 430 diploid

Sorghum 735 diploid

Maize 2.360 allotetraploid

Barley 4.900 diploid

wheat 17.000 hexaploid

Reprinted from: Moore et. al., Curr. Biol. 5, 737−739 (1995)

Macro synteny of the grass genomesMacro synteny of the grass genomes

The rice The rice genome sequence genome sequence

¤ Draft genome sequences (2002) – whole genome shotgun sequences– Oryza sativa L. ssp. japonica – Syngenta

• fragmented sequence covers 78% in > 42.000 contigs – Goff et. al., Science, 296, 5565 (2002)

– Oryza sativa L. ssp. indica - Beijing Genomics Institute• very fragmented sequence covers 69% in >110.000 contigs

– Yu et. al., Science, 296, 79 (2002)

¤ Finished genome sequence (2005) map-based genome sequence– Oryza sativa L. ssp. japonica -The International Rice Genome

Sequencing Project• finished quality sequence that covers 95% of the 389 Mb genome

– including all of the euchromatin and two centromeres – International Rice Genome Sequencing Project, Nature 436: 793-800 (

2005)

Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Maps of the twelve rice Maps of the twelve rice chromosomes chromosomes

¤ The size of the rice genome was estimated at 389 Mb– the sequence covers 95% of the genome and 98.9% of the

euchromatin

centromeres

Maps of the Centromeric Region of Rice Maps of the Centromeric Region of Rice Chr 8Chr 8

¤ Centromeres contain – highly repetitive 155−165 bp CentO satellite DNA– centromere-specific retrotransposons

Reprinted from: Wu, J., et al. Plant Cell 2004;16:967-976

155-bp CentOSatellite DNA

transposons

BACs

Annotation Map of the Centromeric Region of Chr Annotation Map of the Centromeric Region of Chr 88

Reprinted from: Wu, J., et al. Plant Cell 2004;16:967-976

genes

transposons

Reprinted from: The Rice Chromosome 10 Sequencing Consortium, Science. 300: 1566-1569 (2003)

Distribution of features on rice Distribution of features on rice chromosome 10chromosome 10

Protein coding genesProtein coding genes

¤ Predicted 37,544 protein-coding genes – density of one gene per 9.9 kb – 22,840 (61%) genes are supported by ESTs or full-length

cDNAs– 4,500 additional genes match entries in the Swiss-Prot

database– ~10.000 are predicted ab initio

¤ Rice – Arabidopsis homologies– 90% of the predicted Arabidopsis proteins have a rice protein

homologue– 71% of the predicted rice proteins have a Arabidopsis protein

homologue• Unique rice genes match unknown or hypothetical proteins• interesting differences between the genome content of these two

groups of angiosperms remain to be discoveredReprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800

(2005)

Reprinted from: Goff et. al., Science, 296, 5565 (2002)

Classification of the predicted rice Classification of the predicted rice genesgenes

¤ Functional classification – # of genes in the functional classes is very similar to

Arabidopsis

Other gene featuresOther gene features

¤ Tandem gene families– 29% of the genes arranged in tandem repeats

• Compared to 17% of genes in Arabidopsis

¤ Non-coding RNA genes– rDNA repeats are located in the nucleolar organizer on chr 9– A total of 763 tRNA genes– Identified 158 MicroRNAs (miRNAs)

• MicroRNAs regulate gene expression by interacting with the target messenger RNAs

¤ Organellar insertions in the nuclear genome– 421−453 chloroplast insertions – 909−1,191 mitochondrial insertions

• several successive transfer events have occurred

Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Transposable elements Transposable elements

¤ Transposon content is at least 35%– More divergent elements were identified using profile HMM– Much larger than Arabidopsis

Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Intraspecific sequence Intraspecific sequence polymorphism polymorphism

¤ Comparison of orthologous sequences of ssp. indica and ssp. Japonica– Aligned 308 Mb (79%) of the genome– Identified 80,127 different sites

Reprinted from: International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)

Gene duplication in riceGene duplication in rice

Duplicated segments

Genome duplication in riceGenome duplication in rice

¤ Extensive gene duplication – 9 duplicated blocks account for 62% of the rice genes

• blocks have retained 16% to 25% of the duplicate copies

– retention of duplicated gene copies is greater than predicted • suggests that gene loss is not random

¤ Phylogenetic Dating of the genome duplication– Ks values suggest a single duplication event

• except the chromosome 11-12 duplication, which was more recent

– The Ks peak for the rice duplicates corresponds to 70 MY– The time of divergence of the cereals is estimated at 50 MYA– a polyploidization event occurred 70 MY ago

• before the divergence of the major cereals

Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)

Genomic Duplications in Angiosperm Genomic Duplications in Angiosperm EvolutionEvolution

Reprinted from: Paterson et al., PNAS 101: 9903-9908 (2004)

monocots

dicots

Comparison of rice and grass Comparison of rice and grass genomesgenomes

¤ Synteny between rice and Arabidopsis– Limited to relatively short segments comprising few genes

• Successive rounds of genome duplications in the two lineages (Arabidopsis 2; rice 1) have blurred the ancestral synteny

¤ Macro synteny of the grass genomes is confirmed at the sequence level– 98% of the genes found in the different grasses have a rice

homolog• Rice is a model system for the larger cereal genomes

Micro synteny of the grass genomesMicro synteny of the grass genomes

¤ Collinear arrangement of genes is interrupted by– Intergenic retrotransposon blocks

Reprinted from: Ramakrishna et al., Genetics, 162, 1389 (2002)

The maize genomeThe maize genome

¤ Large (2.365 MB) and complex genome– Unusually high repetitive DNA content (>80%)

¤ Stepwise sequencing approach designed to the meet the challenge– Sequencing the gene-rich fraction

• Enrichment of Gene-Coding Sequences by Genome Filtration– Whitelaw et. al., Science, 301, 2118-2120 (2003)

– High resolution physical map of 300:000 BAC clones• BAC end sequencing: completed

– Sequence composition and genome organization of maize • Messing et al., PNAS 101: 14349-14354 (2004)

• BAC skim sequencing: in progress – Low pass sequencing of minimal tiling path BACs

¤ Expect the complete genome sequence by 2007– Martienssen et al., Curr. Op. in plant biol., 7: 102 – 107 (2004)

Structure of the Maize genomeStructure of the Maize genome

¤ The maize genome is 6 times larger than that of rice– ~60% of the genome comprises highly repetitive sequences

• >90% are LTR–retrotransposons inserted in the last 3 to 6 MY– 10 - 100 -kb tracts of nested insertions separate genic

regions

Reprinted from: SanMiguel et al., Nat Genet. 20: 43 (1998)

Duplicated genes in maizeDuplicated genes in maize

¤ A conservative estimate predicts 59,000 genes– A very large fraction of duplicated genes

¤ Two interesting aspects of the gene organization – Despite the fact that the genome was duplicated 5-10 My ago

• the tetraploidization was followed by a heavy loss of duplicate genes

– <50% of the duplicates are retained (cfr. yeast)– Tandem gene amplification is unusually high

• ~1/3 of the genes consist of tandemly arrayed gene families

¤ The maize genome illustrates the exceptional dynamics of genome evolution in plants

Reprinted from: Messing et al., PNAS 101: 14349-14354 (2004)

Reprinted from: Messing et al., PNAS 101: 14349-14354 (2004)

Origin of rice, maize and sorghum Origin of rice, maize and sorghum

Genome duplication

Enrichment of Gene-Coding Sequences in Enrichment of Gene-Coding Sequences in Maize by Genome FiltrationMaize by Genome Filtration

¤ Paper presents– Two methodologies that enrich for genic sequences for

sequencing complex genomes • Methylation filtering • High C0t selection

– Combination of the two techniques resulted in a six-fold reduction in the effective genome size

– Powerful technologies for sequencing repeat-rich genomes

Whitelaw et. al., Science, 301, 2118-2120 (2003)

Reprinted from: Whitelaw et. al., Science, 301, 2118-2120 (2003)

Enrichment of Gene-Coding Enrichment of Gene-Coding SequencesSequences

¤ Methylation filtering – hypermethylated sequences are excluded with the use of

bacterial restriction systems that cleave methylated sequences

• In plants two methylases will methylate C residues in CG and CNG

• Methylation is restricted to primarily repeated DNA sequences

¤ High C0t (HC) selection– allows separation of DNA fractions into low-copy (High C0t)

or high-copy (Low C0t) sequences• The repetitive DNA renatures first• The double-stranded DNA can be separated from lower copy

number, unrenatured DNA.

– The low–copy number fraction is enriched in genes

Sorghum Genome Sequencing by Sorghum Genome Sequencing by Methylation FiltrationMethylation Filtration

¤ Paper presents– Sequence from the hypomethylated portion of the sorghum

genome obtained by applying methylation filtration• 96% of the genes have been sequence tagged, with an

average coverage of 65% across their length

– MF preferentially captures exons and introns, promoters, microRNAs, and simple sequence repeats

– MF preferentially minimizes interspersed repeats– MF provides a robust view of the functional parts of the

genome.

Bedell et al., PLoS Biol. 3: e13 (2005)

Reprinted from: Bedell et al., PLoS Biol. 3: e13 (2005)

Genome Reduction in SorghumGenome Reduction in Sorghum

Plant and animal genome evolutionPlant and animal genome evolution

¤ Animal genomes– Marked conservation of synteny over long evolutionary times

• Evolution proceeds mainly through expansion/contraction of gene families through tandem duplication

– Total number of genes remains more or less constant• Increased gene diversity through ehanced alternative splicing• Balanced gene birth and death

¤ Plant genomes– Genomes evolve at a more rapid pace, driven by successive

rounds of whole genome duplication events• Duplication events followed by massive gene losses, with

retention of substantial fractions (~30%) of the duplicated genes– With subsequent neo-functionalization of duplicated genes

(?)

– Marked tendency towards increased number of genes• Alternative splicing is much less common

Gene Content versus Genome SizeGene Content versus Genome Size

Yeast

C. elegansfl y

Arabidopsis

human

0

10.000

20.000

30.000

40.000

50.000

60.000

10 100 1.000 10.000

Million base pairs

# of genes

rice

maize

fungi

fish

Future PerspectivesFuture Perspectives

¤ Different plant genomes projects ongoing or planned (currently totaling ~30)– Grasses: 5 species

• Maize, barley, sorghum, oat and grass – Flowering plants: > 10 species

• Tomato, potato, coffee, cotton, soybean, clover, lotus, grapevine,…

– Trees: 4 species • Poplar, eucalyptus, pine and banana

– Algae and mosses: ~10 different species

¤ Source: GOLDTM Genomes OnLine Database – http://www.genomesonline.org/

Recommended readingRecommended reading

¤ The Arabidopsis genome sequence• The Arabidopsis Genome Initiative, Nature 248: 796 (2000)

¤ The map-based sequence of the rice genome• International Rice Genome Sequencing Project, Nature 436: 793-800 (2005)

¤ The maize genome sequence• Martienssen et al., Curr. Op. in plant biol., 7: 102 – 107 (2004)

Further reading Further reading

¤ Arabidopsis genome papers– Genome duplications:

• Vision et al, Science 290: 2114 (2000) • Simillion et al, PNAS 99, 13627 (2003)

¤ Rice genome papers– Draft genome sequence

• Japonica: Goff et. al., Science, 296, 5565 (2002)• Indica: Yu et. al., Science, 296, 79 (2002)

– Genome duplications• Paterson et al., PNAS 101: 9903-9908 (2004)

¤ Grass genome papers– Synteny in the grasses

• Moore et. al., Curr. Biol. 5, 737−739 (1995)

– Maize genome sequencing• Messing et al., PNAS 101: 14349-14354 (2004)• Whitelaw et. al., Science, 301, 2118-2120 (2003)

– Sorghum genome sequencing• Bedell et al., PLoS Biol. 3: e13 (2005)