automatic reconstruction of 42 ancestral metazoan … · automatic reconstruction of 42 ancestral...

1
AUTOMATIC RECONSTRUCTION OF 42 ANCESTRAL METAZOAN AUTOMATIC RECONSTRUCTION OF 42 ANCESTRAL METAZOAN GENOMES BY COMPARATIVE GENOMICS. GENOMES BY COMPARATIVE GENOMICS. Matthieu Muffato 1 , Alexandra Louis 1 , Hugues Roest Crollius 1 1 École Normale Supérieure, CNRS UMR 8541, 46 rue d’Ulm, 75005 Paris, France. MATERIAL MATERIAL MATERIAL MATERIAL PURPOSE PURPOSE Zebrafish Medaka Stickleback Fugu Tetraodon Percomorpha Otocephala Tetraodontidae Vertebrates 0 100 50 200 150 300 250 400 350 450 Millions years Actinopterigii Teleostei Euteleostei Shark Chondrichthyes Xenopus Chicken Platypus Opossum Elephant Tenrec Armadillo Hedgehog Cow Dog Rhesus Human Chimpanzee Rodentia Carnivora Boreoeutheria Xenarthra Metatheria Prototheria Theria Mammalia Amniota Rabbit Rat Mouse Murinae Lagomorpha Catarrhini Insectivora Eutheria Sarcopterigii Tetrapoda ? ? ? ? ? ? ? ? Our current understanding of biological processes is limited to contemporary processes occurring in living organisms. Yet Biology is a historical science: all current biological processes are the result of complex evolutionary events. Large scale genome sequencing makes it possible, through comparative genomics, to gain knowledge on ancestral biological genome organization. Here we present a general method to reconstruct ancestral genomes (gene organization) which uses as raw material the genome sequences and gene annotations of extant species, and interprets the data using parsimony and a phylogenetic classification of species to infer the most-plausible ancestral state at any given branch point. Applied to 44 vertebrate genomes, the method produces a new framework to study genome evolution in a broad chronological perspective. Pairwise Pairwise 1. Pairwise comparisons 1. Pairwise comparisons Amniote ancestor Bird/Reptile ancestor Mammal ancestor Conserved gene pairs are defined through the comparisons of every chromosomes of every pair of species. For every two modern species, we define ancestral synteny blocks (same exact gene order and orientations) that exist since their last common ancestor and remained intact through intermediate ancestors. 5 5 4 4 5 1 6 3 2 2 5 3 3 3 2 3 1 1 3 1 2 3 This procedure ensures that none of the deleted links would have connected two extremities of two synteny blocks. The strategy results in maximally connected, non-overlapping segments, or «integrated» synteny blocks. 3 5 5 4 4 5 5 3 2 3 3 3 6 3 3 For each ancestor, we use a weighted «synteny» graph that includes all the inferred ancestral gene pairs from all comparisons. Cycles and genes with multiple links are treated by iteratively selecting links that are consistent with those already selected, starting from the strongest link. 2. Integration 2. Integration In this phylogenetic tree that recapitulates the evolution of a gene since the vertebrate ancestor, the ancestral mammalian genome possessed a single copy of this gene whereas the ancestral fish genome possessed two copies. The Ensembl database provides the genome sequence and the gene annotation for 45 vertebrates species, and also phylogenetic trees for all the genes. At each node of a tree, we can define (i) that the ancestor at that node possessed a copy of the gene (ii) the orthology and paralogy relationship of that ancestral gene and other genes in the same tree. We enriched the dataset with three non-vertebrate species: Amphioxus (Branchiostoma floridae), Oikopleura (Oikopleura dioica) and the sea anemone (Nematostella vectensis) with the following procedure: Compute best reciprocal blast comparisons (BRH) against a set of representative species (human, zebrafish, …) Insert a gene from a new species only when the BRH are consistent both internally and with the Ensembl gene trees. Create a duplication node when the new gene acts as an outgroup to two existing trees. Ancestor Age (My) AncGenes Blocks Genes in blocks (in %) Nb Intervals (in %) Median size N50 size Max size Mean size Chordata 550 10404 149 317 3,05 168 1,62 2 2 6 2,13 Euteleostomi 420 17781 3009 10522 59,18 7513 42,30 3 4 28 3,50 Tetrapoda 359 17910 2506 12855 71,78 10349 57,85 3 7 51 5,13 Amniota 326 18689 1287 14906 79,76 13619 72,95 5 24 178 11,58 Theria 166 19338 948 16249 84,03 15301 79,21 8 37 198 17,14 Boreoeutheria 95 21322 428 17840 83,67 17412 81,74 10 124 345 41,68 Euarchontoglires 90 20989 403 17891 85,24 17488 83,40 12 130 292 44,39 Catarrhini 31 21292 544 18359 86,22 17815 83,75 13 87 387 33,75 Homo/Pan group 5 21075 628 18829 89,34 18201 86,45 12 83 406 29,98 Rodentia 80 19266 604 17585 91,27 16981 88,23 14 62 417 29,11 Laurasiatheria 88 20475 455 17939 87,61 17484 85,48 15 103 287 39,43 Afrotheria 94 18089 3480 11986 66,26 8506 47,08 3 4 28 3,44 Neognathae 105 14971 761 13041 87,11 12280 82,13 8 38 291 17,14 Tetraodontidae 65 18451 1693 16162 87,59 14469 78,50 5 18 103 9,55 RESULTS RESULTS RESULTS RESULTS For every pair of genomes (modern or ancestral), any non- conserved gene pair potentially reveals a breakpoint. In real data, to eliminate false positives, we have to: Eschew genes that are singletons in one genome (no neighbor has been identified) Eschew genes that underwent a duplication Select breakpoints where the initial state (resp. final) is conserved at a few older (resp. more recent) nodes 774 breakpoints were identified in the mammalian lineage and are analyzed in Berthelot et al. work. Amniotes ancestor Mammals ancestor 4. Breakpoint extraction 4. Breakpoint extraction The correct identification of duplication nodes is a notorious problem in phylogeny reconstruction. Information on synteny can greatly assist in correctly identifying duplication nodes, and was used here in a systematic procedure. Most ancestral nodes show an optimal blocks length for a given duplication node threshold. Applying these thresholds to all Ensembl trees increases the length of synteny blocks by approximately 25%. 0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1 0 5 10 15 20 25 30 35 40 45 Evolution of mean block length for each ancestor and for each cutoff. cutoff Block length Erroneous paralogue annotation Conserved gene order between orthologues In Ensembl gene trees, a duplication confidence score indicates how well a duplication node is supported. We computed optimal cut-off values for retaining duplication nodes based on maximal synteny block length. Nodes below the cut-off were displaced towards terminal branches or until a duplication node with a score higher than the cut-off is encountered. 3. Duplications adjustment 3. Duplications adjustment Genomicus (http://www.dyogen.ens.fr/genomicus/) is a synteny browser that has been created to study the conservation of gene order among extant and reconstructed species. The figure shows an inversion that occurred next to YWHAE, just before the Homininae ancestor. GENOMICUS GENOMICUS GENOMICUS GENOMICUS STATISTICS STATISTICS STATISTICS STATISTICS The table shows a number of statistics and measures collected on the gene sets and synteny blocks of some ancestors. One can observe the impact of the age of the ancestor, or the quality of the sequence of the descending extant species. (lengths are measured in number of genes) METHODS METHODS METHODS METHODS Performances Performances In (a), human intergenic spaces are classified into bins according to their length and their G+C content. The color code shows the dominant transcriptional orientations in a given bin: green indicates divergent genes, red is for convergent genes and blue for co-oriented genes. A cluster of short (< 1 kb) and G+C rich (> 60%) intergenic spaces, composed mainly of divergent genes (green) appears clearly. Divergent genes Divergent genes In (b), only intergenic spaces between divergent genes are plotted. Here, the color code reflects the age of the intergenic space, i.e. the most ancient ancestral genome that already possessed this interval (lighter orange means older). Selected pairs Whole genome Number 848 19316 Age 310 My 220 My Size 250 bp 90 kb G+C 65,5 % 46,4 % % that contains CpG (CpG obs/exp > 0.375) 98,6 % 59,8 % in uni. pairs 84,2 % in div. pairs > 1kb % of seq. with CNEs 27,3 % 5,5% % of seq. with Repeats 5,8 % 46,6 % (a) 1 kb 1 Mb 30% 60% Intergenic space length (log10) G+C% (b) Intergenic space length (log10) 1 kb 1 Mb 30% 60% G+C% The cluster of short and GC- rich intervals flanked by divergent genes is much more ancient than the rest of the divergent intervals of the human genome. The table shows a summary of the properties of intervals that belong to the above cluster. HomoPan H omininae H ominidae Catarrhini Murinae Carnivora Haplorrhini Cetartiodactyla Tetraodontidae S ciurognathi Rodentia Glires P rimates Laurasiatheria Euarchontoglires Afrotheria B oreoeutheria Atlantogenata Eutheria Neognathae Metatheria Theria Mammalia P ercomorpha Sauria Clupeocephala A mniota Tetrapoda E u teleostomi 50% 60% 70% 80% 90% 100% Specificity Sensitivity Relaxed spec. Relaxed sens. First, we performed simulations without duplications nor gene loss, in order to benchmark our method against others that cannot deal with such events: MGR (1) , MGRA (2) and inferCars (3) . Each simulation deals with 21 vertebrate species, and varies in number of genes (100 to 10000), number of chromosomes (5 to 20), rearrangement rates (from 0.2x to 3x, 1x is defined in (4) ). The table shows the results for the Boreoeutheria ancestor, according to the simulated genome size. The interpretation of this benchmark is limited because programs are either unable to produce a result due to the size of the genomes or a high rearrangement rate (N/A in the table), or they output nearly perfect reconstructions. In a second set of simulations, where duplications events and gene losses were allowed, we tested only our method since others cannot handle asymmetric gene contents between nodes. In a significant number of cases, co-localization of gene losses and duplications lead to false predictions. In those cases, predicted gene pairs were not neighbors, but still relatively close to each other (generally less than 4 genes away). We thus also computed relaxed sensitivity and specificity, counting these gene pairs as correct ones. Number of genes 100 500 1000 5000 10000 MGR Perfect Perfect Perfect (rate ≤ 1x) N/A (other rates) N/A (all rates) N/A (all rates) MGRA Perfect Perfect Perfect (rate ≤ 2x) N/A (other rates) Perfect (rate ≤ 0.5x) N/A (other rates) Perfect (at 0.2x) N/A (other rates) inferCars Perfect Perfect Perfect N/A (all rates) N/A (all rates) Our method Perfect Perfect 99.99% correct (all rates) 99.99% correct (all rates) 99.99% correct (all rates) Ancestors ordered according to their age (1) Bourque et al., Genome Res 14:507-516 (3) Ma et al., Genome Res 16:1557-1565 (2) Alekseyev et al., Genome Res 19:943-957 (4) Zhao et al., Genome Res 19:934-942

Upload: others

Post on 29-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AUTOMATIC RECONSTRUCTION OF 42 ANCESTRAL METAZOAN … · AUTOMATIC RECONSTRUCTION OF 42 ANCESTRAL METAZOAN GENOMES BY COMPARATIVE GENOMICS. Matthieu Muffato1, Alexandra Louis1, Hugues

AUTOMATIC RECONSTRUCTION OF 42 ANCESTRAL METAZOAN AUTOMATIC RECONSTRUCTION OF 42 ANCESTRAL METAZOAN GENOMES BY COMPARATIVE GENOMICS.GENOMES BY COMPARATIVE GENOMICS.

Matthieu Muffato1, Alexandra Louis1, Hugues Roest Crollius1

1École Normale Supérieure, CNRS UMR 8541, 46 rue d’Ulm, 75005 Paris, France.

MATERIALMATERIALMATERIALMATERIAL

PURPOSEPURPOSE

Zebrafish

Medaka

Stickleback

Fugu

Tetraodon

Percomorpha

Otocephala

Tetraodontidae

Vertebrates

0100 50200 150300 250400 350450Millions years

Actinopterigii Teleostei Euteleostei

SharkChondrichthyes

Xenopus

Chicken

Platypus

Opossum

ElephantTenrec

Armadillo

HedgehogCowDogRhesusHumanChimpanzee

Rodentia

Carnivora

Boreoeutheria

Xenarthra

Metatheria

Prototheria

Theria

Mammalia

Amniota

RabbitRatMouse

Murinae

Lagomorpha

Catarrhini

Insectivora Eutheria

SarcopterigiiTetrapoda

?

??

?

?

?

?

?

Our current understanding of biological processes is limited to contemporary processes occurring in living organisms. Yet Biology is a historical science: all current biological processes are the result of complex evolutionary events. Large scale genome sequencing makes it possible, through comparative genomics, to gain knowledge on ancestral biological genome organization.

Here we present a general method to reconstruct ancestral genomes (gene organization) which uses as raw material the genome sequences and gene annotations of extant species, and interprets the data using parsimony and a phylogenetic classification of species to infer the most-plausible ancestral state at any given branch point.

Applied to 44 vertebrate genomes, the method produces a new framework to study genome evolution in a broad chronological perspective.

PairwisePairwise1. Pairwise comparisons1. Pairwise comparisonsAmniote ancestor

Bird/Reptile ancestorMammal ancestor

Conserved gene pairs are defined through the comparisons of every chromosomes of every pair of species. For every two modern species, we define ancestral synteny blocks (same exact gene order and orientations) that exist since their last common ancestor and remained intact through intermediate ancestors.

5 54 4

51 6

3 2

2

5

3

3

3 23

1

1

3

1 2

3

This procedure ensures that none of the deleted links would have connected two extremities of two synteny blocks. The strategy results in maximally connected, non-overlapping segments, or «integrated» synteny blocks.

35 54 4

5

5

3 23

33

6

3 3

For each ancestor, we use a weighted «synteny» graph that includes all the inferred ancestral gene pairs from all comparisons.

Cycles and genes with multiple links are treated by iteratively selecting links that are consistent with those already selected, starting from the strongest link.

2. Integration2. Integration

In this phylogenetic tree that recapitulates the evolution of a gene since the vertebrate ancestor, the ancestral mammalian genome possessed a single copy of this gene whereas the ancestral fish genome possessed two copies.

The Ensembl database provides the genome sequence and the gene annotation for 45 vertebrates species, and also phylogenetic trees for all the genes. At each node of a tree, we can define (i) that the ancestor at that node possessed a copy of the gene (ii) the orthology and paralogy relationship of that ancestral gene and other genes in the same tree.

We enriched the dataset with three non-vertebrate species: Amphioxus (Branchiostoma floridae), Oikopleura (Oikopleura dioica) and the sea anemone (Nematostella vectensis) with the following procedure:

● Compute best reciprocal blast comparisons (BRH) against a set of representative species (human, zebrafish, …)

● Insert a gene from a new species only when the BRH are consistent both internally and with the Ensembl gene trees.

● Create a duplication node when the new gene acts as an outgroup to two existing trees.

Ancestor Age (My) AncGenes Blocks Genes in blocks (in %) Nb Intervals (in %) Median size N50 size Max

sizeMean size

Chordata 550 10404 149 317 3,05 168 1,62 2 2 6 2,13

Euteleostomi 420 17781 3009 10522 59,18 7513 42,30 3 4 28 3,50

Tetrapoda 359 17910 2506 12855 71,78 10349 57,85 3 7 51 5,13

Amniota 326 18689 1287 14906 79,76 13619 72,95 5 24 178 11,58

Theria 166 19338 948 16249 84,03 15301 79,21 8 37 198 17,14

Boreoeutheria 95 21322 428 17840 83,67 17412 81,74 10 124 345 41,68

Euarchontoglires 90 20989 403 17891 85,24 17488 83,40 12 130 292 44,39

Catarrhini 31 21292 544 18359 86,22 17815 83,75 13 87 387 33,75

Homo/Pan group 5 21075 628 18829 89,34 18201 86,45 12 83 406 29,98

Rodentia 80 19266 604 17585 91,27 16981 88,23 14 62 417 29,11

Laurasiatheria 88 20475 455 17939 87,61 17484 85,48 15 103 287 39,43

Afrotheria 94 18089 3480 11986 66,26 8506 47,08 3 4 28 3,44

Neognathae 105 14971 761 13041 87,11 12280 82,13 8 38 291 17,14

Tetraodontidae 65 18451 1693 16162 87,59 14469 78,50 5 18 103 9,55

RESULTSRESULTSRESULTSRESULTS

For every pair of genomes (modern or ancestral), any non-conserved gene pair potentially reveals a breakpoint.

In real data, to eliminate false positives, we have to:● Eschew genes that are singletons in one genome (no neighbor has

been identified)● Eschew genes that underwent a duplication● Select breakpoints where the initial state (resp. final) is conserved

at a few older (resp. more recent) nodes

774 breakpoints were identified in the mammalian lineage and are analyzed in Berthelot et al. work.

Amniotes ancestor

Mammals ancestor

4. Breakpoint extraction4. Breakpoint extraction

The correct identification of duplication nodes is a notorious problem in phylogeny reconstruction. Information on synteny can greatly assist in correctly identifying duplication nodes, and was used here in a systematic procedure.

Most ancestral nodes show an optimal blocks length for a given duplication node threshold. Applying these thresholds to all Ensembl trees increases the length of synteny blocks by approximately 25%.

0 0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 10

5

10

15

20

25

30

35

40

45

Evolution of mean block length for each ancestor and for each cutoff.

cutoff

Blo

ck le

ngth

Erroneous paralogue annotation

Conserved gene order between orthologues

In Ensembl gene trees, a duplication confidence score indicates how well a duplication node is supported. We computed optimal cut-off values for retaining duplication nodes based on maximal synteny block length. Nodes below the cut-off were displaced towards terminal branches or until a duplication node with a score higher than the cut-off is encountered.

3. Duplications adjustment3. Duplications adjustment

Genomicus (http://www.dyogen.ens.fr/genomicus/) is a synteny browser that has been created to study the conservation of gene order among extant and reconstructed species. The figure shows an inversion that occurred next to YWHAE, just before the Homininae ancestor.

GENOMICUSGENOMICUSGENOMICUSGENOMICUS STATISTICSSTATISTICSSTATISTICSSTATISTICSThe table shows a number of statistics and measures collected on the gene sets and synteny blocks of some ancestors. One can observe the impact of the age of the ancestor, or the quality of the sequence of the descending extant species. (lengths are measured in number of genes)

METHODSMETHODSMETHODSMETHODS

Performances Performances

In (a), human intergenic spaces are classified into bins according to their length and their G+C content. The color code shows the dominant transcriptional orientations in a given bin: green indicates divergent genes, red is for convergent genes and blue for co-oriented genes. A cluster of short (< 1 kb) and G+C rich (> 60%) intergenic spaces, composed mainly of divergent genes (green) appears clearly.

Divergent genesDivergent genes

In (b), only intergenic spaces between divergent genes are plotted. Here, the color code reflects the age of the intergenic space, i.e. the most ancient ancestral genome that already possessed this interval (lighter orange means older).

Selected pairs Whole genomeNumber 848 19316

Age 310 My 220 MySize 250 bp 90 kbG+C 65,5 % 46,4 %

% that contains CpG(CpG obs/exp > 0.375)

98,6 % 59,8 % in uni. pairs84,2 % in div. pairs > 1kb

% of seq. with CNEs 27,3 % 5,5%% of seq. with Repeats 5,8 % 46,6 %

(a)

1 kb 1 Mb

30%

60%

Intergenic space length (log10)

G+C

%

(b)

Intergenic space length (log10)

1 kb 1 Mb

30%

60%

G+C

%

The cluster of short and GC-rich intervals flanked by divergent genes is much more ancient than the rest of the divergent intervals of the human genome.

The table shows a summary of the properties of intervals that belong to the above cluster.

Hom

oPan

Hom

inina

eHo

mini

dae

Cata

rrhini

Mur

inae

Carn

ivora

Haplo

rrhini

Ceta

rtiod

actyl

aTe

traod

ontid

aeSc

iurog

nathi

Rode

ntia

Glire

sPr

imat

esLa

uras

iathe

ria

Euar

chon

togli

res

Afro

theria

Bore

oeuth

eria

Atlan

toge

nata

Euthe

riaNe

ogna

thae

Met

athe

riaTh

eria

Mam

mali

aPe

rcom

orph

aSa

uria

Clup

eoce

phala

Amnio

taTe

trapo

daEu

teleo

stom

i

50%

60%

70%

80%

90%

100%

SpecificitySensitivityRelaxed spec.Relaxed sens.

First, we performed simulations without duplications nor gene loss, in order to benchmark our method against others that cannot deal with such events: MGR(1), MGRA(2) and inferCars(3). Each simulation deals with 21 vertebrate species, and varies in number of genes (100 to 10000), number of chromosomes (5 to 20), rearrangement rates (from 0.2x to 3x, 1x is defined in (4)). The table shows the results for the Boreoeutheria ancestor, according to the simulated genome size. The interpretation of this benchmark is limited because programs are either unable to produce a result due to the size of the genomes or a high rearrangement rate (N/A in the table), or they output nearly perfect reconstructions.

In a second set of simulations, where duplications events and gene losses were allowed, we tested only our method since others cannot handle asymmetric gene contents between nodes. In a significant number of cases, co-localization of gene losses and duplications lead to false predictions. In those cases, predicted gene pairs were not neighbors, but still relatively close to each other (generally less than 4 genes away). We thus also computed relaxed sensitivity and specificity, counting these gene pairs as correct ones.

Number of genes 100 500 1000 5000 10000

MGR Perfect Perfect Perfect (rate ≤ 1x)N/A (other rates) N/A (all rates) N/A (all rates)

MGRA Perfect Perfect Perfect (rate ≤ 2x)N/A (other rates)

Perfect (rate ≤ 0.5x)N/A (other rates)

Perfect (at 0.2x)N/A (other rates)

inferCars Perfect Perfect Perfect N/A (all rates) N/A (all rates)

Our method Perfect Perfect 99.99% correct(all rates)

99.99% correct(all rates)

99.99% correct(all rates)

Ancestors ordered according to their age

(1) Bourque et al., Genome Res 14:507-516 (3) Ma et al., Genome Res 16:1557-1565(2) Alekseyev et al., Genome Res 19:943-957 (4) Zhao et al., Genome Res 19:934-942