elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: a...

6
Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster Wei-Ping Yu †‡ , Vikneswari Rajasegaran , Kenneth Yew , Wai-lin Loh , Boon-Hui Tay § , Chris T. Amemiya , Sydney Brenner ‡§ , and Byrappa Venkatesh ‡§ Gene Regulation Laboratory, National Neuroscience Institute, 11 Jalan Tan Tock Seng, Singapore 308433; § Institute of Molecular and Cell Biology, Agency for Science, Technology, and Research, Biopolis, Singapore 138673; and Benaroya Research Institute at Virginia Mason, Seattle, WA 98101 Contributed by Sydney Brenner, January 14, 2008 (sent for review November 7, 2007) Cartilaginous fishes are the oldest living phylogenetic group of jawed vertebrates. Here, we demonstrate the value of cartilagi- nous fish sequences in reconstructing the evolutionary history of vertebrate genomes by sequencing the protocadherin cluster in the relatively small genome (910 Mb) of the elephant shark (Callorhin- chus milii). Human and coelacanth contain a single protocadherin cluster with 53 and 49 genes, respectively, that are organized in three subclusters, Pcdh, Pcdh, and Pcdh, whereas the dupli- cated protocadherin clusters in fugu and zebrafish contain >77 and 107 genes, respectively, that are organized in Pcdh and Pcdh subclusters. By contrast, the elephant shark contains a single protocadherin cluster with 47 genes organized in four subclusters (Pcdh, Pcdh, Pcdh, and Pcdh). By comparison with elephant shark sequences, we discovered a Pcdh subcluster in teleost fishes, coelacanth, Xenopus, and chicken. Our results suggest that the protocadherin cluster in the ancestral jawed vertebrate con- tained more subclusters than modern vertebrates, and the evolu- tion of the protocadherin cluster is characterized by lineage- specific differential loss of entire subclusters of genes. In contrast to teleost fish and mammalian protocadherin genes that have undergone gene conversion events, elephant shark protocadherin genes have experienced very little gene conversion. The syntenic block of genes in the elephant shark protocadherin locus is well conserved in human but disrupted in fugu. Thus, the elephant shark genome appears to be less prone to rearrangements com- pared with teleost fish genomes. The small and ‘‘stable’’ genome of the elephant shark is a valuable reference for understanding the evolution of vertebrate genomes. ancestral jawed vertebrate clustered protocadherins cartilaginous fish Callorhinchus milii gene conversion R econstructing the evolutionary history of vertebrate ge- nomes, and in particular the human genome, is a major goal of vertebrate comparative genomics. Efforts are underway to reconstruct the evolutionary history of the human genome at the nucleotide level and at the level of large-scale chromosomal rearrangements (1) with the objective of predicting ancestral genomes with high accuracy. The accuracy of the reconstruction of ancestral genomes is greatly influenced by the choice and number of genomes compared. Although a comparison of genomes from all major vertebrate groups is desirable, it is also important to compare genomes from the most distant branches of the phylogenetic tree. Here, we show how the genome sequence from the oldest living phylogenetic group of jawed vertebrates, the cartilaginous fishes, can dramatically change our inferences of the state of the ancestral vertebrate genome. The protocadherin gene cluster is one of the most evolution- arily dynamic loci in vertebrates. It is composed of tandem arrays of paralogous genes that are highly susceptible to lineage-specific gene losses, tandem duplications, and gene conversion (2–7). These clustered genes are distinct from nonclustered protocad- herin genes by possessing a large coding exon that codes for the entire extracellular domain, the transmembrane domain, and a small portion of the cytoplasmic domain, whereas the extracel- lular domain of the nonclustered protocadherins is usually encoded by multiple exons (8). Protocadherin cluster proteins are cadherin-like cell adhesion molecules that are highly en- riched on synaptic membranes (9). These proteins are believed to be a vertebrate innovation that accompanied the emergence of the neural tube and the elaborate central nervous system and have been hypothesized to provide a molecular code for speci- fying the enormous diversity in neuronal connectivity (3, 9, 10). No clustered protocadherin gene has been identified in inver- tebrate genomes (11–13). The protocadherin gene cluster has been characterized in several bony vertebrates (Osteichthyes), including human and coelacanth and teleost fishes. Human and coelacanth protocadherin cluster genes are organized in three tandem subclusters, designated as Pcdh, Pcdh, and Pcdh (3, 4). Each subcluster contains multiple ‘‘variable’’ exons (2.4 kb each) that are transcribed from independent promoters. Each variable exon codes for an extracellular domain comprising six repeats of a calcium-binding ectodomain (EC1– 6), a transmem- brane domain, and a part of the cytoplasmic domain. In addition to variable exons, Pcdh and - subclusters contain three constant exons at the 3 end of the respective subcluster, to which each variable exon is independently spliced. The constant region exons code for the major part of the cytoplasmic domain that is shared by genes in each subcluster. Human and coelacanth protocadherin clusters contain a total of 53 and 49 genes, respectively. Because of the ‘‘fish-specific’’ whole-genome du- plication, teleost fishes such as fugu and zebrafish contain two unlinked protocadherin clusters, designated Pcdh1 and Pcdh2. Although fugu Pcdh1 cluster is highly degenerate, Pcdh2 clusters in both fishes have undergone lineage-specific repeated tandem duplications, and gene losses giving rise to 107 genes in zebrafish (2, 5, 7) and 77 genes in fugu (6). However, these teleost protocadherin genes are orthologous to only Pcdh and - subcluster genes in mammals and coelacanth. Another inter- esting feature of teleost protocadherin cluster genes is that they have experienced extensive regional gene conversion events resulting in nucleotide identity of 99% in the 3 region of variable exons among paralogs (5, 6). Based on the organization, gene content, and phylogenetic relationships of protocadherin Author contributions: W.-P.Y., S.B., and B.V. designed research; W.-P.Y., V.R., K.Y., W.-l.L., and B.-H.T. performed research; C.T.A. contributed new reagents/analytic tools; W.-P.Y., S.B., and B.V. analyzed data; and W.-P.Y. and B.V. wrote the paper. The authors declare no conflict of interest. Data deposition: The sequences reported in this paper have been deposited in the GenBank database (accession nos. EF693945, EU267079, and EU267080). To whom correspondence may be addressed. E-mail: weiping[email protected], backhill@ hotmail.co.uk, or [email protected]. This article contains supporting information online at www.pnas.org/cgi/content/full/ 0800398105/DC1. © 2008 by The National Academy of Sciences of the USA www.pnas.orgcgidoi10.1073pnas.0800398105 PNAS March 11, 2008 vol. 105 no. 10 3819 –3824 EVOLUTION

Upload: b

Post on 18-Feb-2017

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster

Elephant shark sequence reveals unique insights intothe evolutionary history of vertebrate genes: Acomparative analysis of the protocadherin clusterWei-Ping Yu†‡, Vikneswari Rajasegaran†, Kenneth Yew†, Wai-lin Loh†, Boon-Hui Tay§, Chris T. Amemiya¶,Sydney Brenner‡§, and Byrappa Venkatesh‡§

†Gene Regulation Laboratory, National Neuroscience Institute, 11 Jalan Tan Tock Seng, Singapore 308433; §Institute of Molecular and Cell Biology,Agency for Science, Technology, and Research, Biopolis, Singapore 138673; and ¶Benaroya Research Institute at Virginia Mason, Seattle, WA 98101

Contributed by Sydney Brenner, January 14, 2008 (sent for review November 7, 2007)

Cartilaginous fishes are the oldest living phylogenetic group ofjawed vertebrates. Here, we demonstrate the value of cartilagi-nous fish sequences in reconstructing the evolutionary history ofvertebrate genomes by sequencing the protocadherin cluster in therelatively small genome (910 Mb) of the elephant shark (Callorhin-chus milii). Human and coelacanth contain a single protocadherincluster with 53 and 49 genes, respectively, that are organized inthree subclusters, Pcdh�, Pcdh�, and Pcdh�, whereas the dupli-cated protocadherin clusters in fugu and zebrafish contain >77 and107 genes, respectively, that are organized in Pcdh� and Pcdh�subclusters. By contrast, the elephant shark contains a singleprotocadherin cluster with 47 genes organized in four subclusters(Pcdh�, Pcdh�, Pcdh�, and Pcdh�). By comparison with elephantshark sequences, we discovered a Pcdh� subcluster in teleostfishes, coelacanth, Xenopus, and chicken. Our results suggest thatthe protocadherin cluster in the ancestral jawed vertebrate con-tained more subclusters than modern vertebrates, and the evolu-tion of the protocadherin cluster is characterized by lineage-specific differential loss of entire subclusters of genes. In contrastto teleost fish and mammalian protocadherin genes that haveundergone gene conversion events, elephant shark protocadheringenes have experienced very little gene conversion. The syntenicblock of genes in the elephant shark protocadherin locus is wellconserved in human but disrupted in fugu. Thus, the elephantshark genome appears to be less prone to rearrangements com-pared with teleost fish genomes. The small and ‘‘stable’’ genomeof the elephant shark is a valuable reference for understanding theevolution of vertebrate genomes.

ancestral jawed vertebrate � clustered protocadherins � cartilaginous fish �Callorhinchus milii � gene conversion

Reconstructing the evolutionary history of vertebrate ge-nomes, and in particular the human genome, is a major goal

of vertebrate comparative genomics. Efforts are underway toreconstruct the evolutionary history of the human genome at thenucleotide level and at the level of large-scale chromosomalrearrangements (1) with the objective of predicting ancestralgenomes with high accuracy. The accuracy of the reconstructionof ancestral genomes is greatly influenced by the choice andnumber of genomes compared. Although a comparison ofgenomes from all major vertebrate groups is desirable, it is alsoimportant to compare genomes from the most distant branchesof the phylogenetic tree. Here, we show how the genomesequence from the oldest living phylogenetic group of jawedvertebrates, the cartilaginous fishes, can dramatically change ourinferences of the state of the ancestral vertebrate genome.

The protocadherin gene cluster is one of the most evolution-arily dynamic loci in vertebrates. It is composed of tandem arraysof paralogous genes that are highly susceptible to lineage-specificgene losses, tandem duplications, and gene conversion (2–7).These clustered genes are distinct from nonclustered protocad-herin genes by possessing a large coding exon that codes for the

entire extracellular domain, the transmembrane domain, and asmall portion of the cytoplasmic domain, whereas the extracel-lular domain of the nonclustered protocadherins is usuallyencoded by multiple exons (8). Protocadherin cluster proteinsare cadherin-like cell adhesion molecules that are highly en-riched on synaptic membranes (9). These proteins are believedto be a vertebrate innovation that accompanied the emergenceof the neural tube and the elaborate central nervous system andhave been hypothesized to provide a molecular code for speci-fying the enormous diversity in neuronal connectivity (3, 9, 10).No clustered protocadherin gene has been identified in inver-tebrate genomes (11–13). The protocadherin gene cluster hasbeen characterized in several bony vertebrates (Osteichthyes),including human and coelacanth and teleost fishes. Human andcoelacanth protocadherin cluster genes are organized in threetandem subclusters, designated as Pcdh�, Pcdh�, and Pcdh� (3,4). Each subcluster contains multiple ‘‘variable’’ exons (�2.4 kbeach) that are transcribed from independent promoters. Eachvariable exon codes for an extracellular domain comprising sixrepeats of a calcium-binding ectodomain (EC1–6), a transmem-brane domain, and a part of the cytoplasmic domain. In additionto variable exons, Pcdh� and -� subclusters contain threeconstant exons at the 3� end of the respective subcluster, to whicheach variable exon is independently spliced. The constant regionexons code for the major part of the cytoplasmic domain that isshared by genes in each subcluster. Human and coelacanthprotocadherin clusters contain a total of 53 and 49 genes,respectively. Because of the ‘‘fish-specific’’ whole-genome du-plication, teleost fishes such as fugu and zebrafish contain twounlinked protocadherin clusters, designated Pcdh1 and Pcdh2.Although fugu Pcdh1 cluster is highly degenerate, Pcdh2 clustersin both fishes have undergone lineage-specific repeated tandemduplications, and gene losses giving rise to �107 genes inzebrafish (2, 5, 7) and �77 genes in fugu (6). However, theseteleost protocadherin genes are orthologous to only Pcdh� and-� subcluster genes in mammals and coelacanth. Another inter-esting feature of teleost protocadherin cluster genes is that theyhave experienced extensive regional gene conversion eventsresulting in nucleotide identity of �99% in the 3� region ofvariable exons among paralogs (5, 6). Based on the organization,gene content, and phylogenetic relationships of protocadherin

Author contributions: W.-P.Y., S.B., and B.V. designed research; W.-P.Y., V.R., K.Y., W.-l.L.,and B.-H.T. performed research; C.T.A. contributed new reagents/analytic tools; W.-P.Y.,S.B., and B.V. analyzed data; and W.-P.Y. and B.V. wrote the paper.

The authors declare no conflict of interest.

Data deposition: The sequences reported in this paper have been deposited in the GenBankdatabase (accession nos. EF693945, EU267079, and EU267080).

‡To whom correspondence may be addressed. E-mail: weiping�[email protected], [email protected], or [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/0800398105/DC1.

© 2008 by The National Academy of Sciences of the USA

www.pnas.org�cgi�doi�10.1073�pnas.0800398105 PNAS � March 11, 2008 � vol. 105 � no. 10 � 3819–3824

EVO

LUTI

ON

Page 2: Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster

clusters in bony vertebrates, the ancestral vertebrate protocad-herin cluster is predicted to contain either two (Pcdh� and -�)or three subclusters (Pcdh�, -�, and -�) that have subsequentlyundergone lineage-specific gene losses and gene duplicat-ions, including the whole-locus duplication in the teleost fishlineage (6).

The living jawed vertebrates (gnathostomes) fall into twomajor monophyletic groups, the Chondrichthyes (cartilaginousfishes that include elasmobranchs and holocephalians) and theOsteichthyes (bony vertebrates, which include ray-finned fishes,coelacanths, lungfishes, and tetrapods) that shared a commonancestor �450 million years ago. The protocadherin cluster hasbeen characterized in several bony vertebrates. Here, we reportthe sequencing and characterization of the protocadherin clusterfrom a holocephalian cartilaginous fish, the elephant shark(Callorhinchus milii). The elephant shark possesses a relativelysmall genome (�910 Mb) among the cartilaginous fishes and istherefore attractive for genome sequencing and comparativeanalysis (14). Our study shows that the elephant shark protocad-herin cluster contains 47 variable exons arrayed in four subclus-ters. Interestingly, none of the four subclusters show a directorthology relationship to the �, �, or � subcluster in bonyvertebrates. Thus, the protocadherin cluster in the commonancestor of jawed vertebrates appears to have contained moresubclusters of genes than modern vertebrates.

ResultsElephant Shark Protocadherin Cluster Genes Organized into FourSubclusters. Screening of an elephant shark BAC library identified10 positive clones that belong to a single locus. Three representativeoverlapping BACs were sequenced to obtain �409 kb of contiguoussequence spanning the protocadherin cluster. Based on sequencehomology and GENSCAN prediction (see Methods), a total of 47variable exons were identified in this cluster (Fig. 1). In addition, weidentified several nonprotocadherin genes flanking the protocad-herin cluster [supporting information (SI) Appendix, SI Fig. 5],indicating that the elephant shark protocadherin cluster sequenceis complete. The overall gene content of the elephant sharkprotocadherin cluster (47 variable exons arrayed in four subclus-ters) is comparable to that in coelacanth (49 variable exons in threesubclusters) (4) and human (53 variable exons in three subclusters)(3, 15). However, its gene content is considerably lower than thatin fugu (�77 variable exons) (6) and zebrafish (�107 variableexons) (2, 4, 5, 7).

The mammalian and coelacanth protocadherin clusters con-sist of three subclusters, the Pcdh�, -�, and -�. The teleostclusters, however, lack the � subcluster. To determine thesubcluster organization of elephant shark protocadherin cluster,we first searched for constant exons by homology search andGENSCAN prediction and identified three sets of constantexons. We then determined the splicing pattern of the variableexons with the constant exons by RT-PCR and 3�RACE. Theseanalyses indicated that the genes are organized into four sub-clusters. The first subcluster consists of a single variable exon and

three constant exons, whereas the second subcluster is com-prised of only variable exons (21 in all), each of which istranscribed independently as single exon genes similar to thecoelacanth and mammalian � subcluster genes. The third andfourth subclusters consist of 8 and 17 variable exons, respectively,and three constant exons each. In addition, the fourth subclustercontains an alternative constant exon between the second andthird constant exons. The splice sites of the constant exons of thethree subclusters are shown in SI Appendix, SI Fig. 6. Wedesignated the four subclusters as the Pcdh�, Pcdh�, Pcdh�, andPcdh� subclusters, respectively (Fig. 1).

To determine the evolutionary relationships between individ-ual protocadherin (Pcdh) genes in the elephant shark, we alignedthe amino acid sequences of all variable exons using ClustalXand generated a phylogenetic tree using the Neighbor-joiningmethod. Sequences of only the first three ectodomains (EC1–3)were used, because the C-terminal ectodomains (EC4–6) havebeen shown to be susceptible to extensive gene conversion inteleost fishes and human, resulting in homogenization of se-quences among paralogs (5, 6). The phylogenetic tree shows thatmost of the elephant shark Pcdh genes segregate into ‘‘paraloggroups,’’ each consisting of paralogs that belong to the samesubcluster (CmPcdh�2-21; CmPcdh�1-8; CmPcdh�2-3, andCmPcdh�4 -17) (SI Appendix, SI Fig. 7). The only exceptions arethe CmPcdh�1, CmPcdh�1, and CmPcdh�1 genes from differentsubclusters that form a monophyletic group. Such a relationshipbetween genes from different subclusters indicates they are‘‘intersubcluster paralogs’’ that shared a common ancestor in theprimordial protocadherin cluster. Similar intersubcluster paral-ogs have been identified in the human protocadherin cluster, inwhich the Pcdh� genes, c1 and c2, were found to be more closelyrelated to Pcdh� genes, c4 and c5, than to other genes in the �subcluster (3). The presence of such paralogs indicates that theseancient genes have been highly conserved as single paralogousgenes despite the recurrent expansion and loss of genes in theirvicinity. The evolutionary conservation of such ancient paralogsbetween subclusters suggests they may play a fundamental rolein protocadherin functions. In support of this hypothesis, it hasbeen shown that, whereas the expression of all other protocad-herin genes in the mammalian � subcluster is monoallelic andcell-selective, the c1 and c2 genes are expressed biallelically andin every neuron (16, 17). It is thus possible that the highlyconserved intersubcluster paralogs CmPcdh�1, CmPcdh�1, andCmPcdh�1 in the elephant shark carry out a fundamental rolesimilar to that of ‘‘c’’ genes in mammals. The absence of such aparalog in the Pcdh� subcluster is likely due to a subcluster-specific gene degeneration.

Elephant Shark Protocadherin Subclusters Are Not Orthologous toThose in Mammals, Coelacanth, or Teleost Fishes. To examine thephylogenetic relationships of elephant shark subclusters withthose of human, coelacanth, and teleost fishes, we aligned theamino acid sequences of the constant region exons of theCmPcdh�, -�, and -� genes from elephant shark, and Pcdh� and

49C8176N9

166L19

ε1 ε2 ε3 ε4 ε5 ε6 ε7 ε8 ε9 ε10

ε11

ε12

ε13

ε14

ε15

ε16

ε17

ε18

ε19

ε20

ε21

μ1μ

2

μ3μ

7

μ8μ

CX

CX

2

μC

X3

ν1 ν3 ν4 ν5 ν6 ν7 ν8 ν9 ν10

ν11

ν12

ν13

ν14

ν15

ν16

ν17

νCX

1

ν2 νCX

2

νCX

2bνC

X3

δ1 δCX

1δC

X2

δCX

3

Pcdhδ Pcdhε Pcdhμ Pcdhν

ψ1

Fig. 1. Genomic organization of the elephant shark protocadherin cluster. Constant exons at the end of each subcluster are shown in red color. Variable exonsin the same paralog subgroup or intersubcluster subgroup are shown in the same color. Only a single pseudogene (�1) was identified. �CX2b is an alternativelyspliced ‘‘constant’’ exon. The positions and identification numbers of BAC clones sequenced are shown below.

3820 � www.pnas.org�cgi�doi�10.1073�pnas.0800398105 Yu et al.

Page 3: Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster

-� genes from human, mouse, coelacanth, fugu, and zebrafishand constructed a phylogenetic tree using the Neighbor-joiningmethod. We also included sequences of the constant region ofPcdh� from fugu, zebrafish, coelacanth, Xenopus tropicalis, andchicken that we have identified in the present study (describedbelow). The phylogenetic tree (Fig. 2) indicates that the Pcdh�sequences from elephant shark and the two teleost fishes,coelacanth and Xenopus, constitute a monophyletic group dis-tinct from other sequences, indicating this is a separate subclus-ter from the � and � subclusters. This phylogenetic tree alsoshows that elephant shark Pcdh� and -� subclusters may not bedirect orthologs of Pcdh� and -� subclusters in bony vertebrates(Fig. 2). These elephant shark subclusters are either the or-thologs of the � and � subclusters that have undergone anunusually accelerated rate of evolution involving extensive nu-cleotide substitutions or simply distinct subclusters that lackorthologs in bony vertebrates. Given there is no evidence thatelephant shark genes have been evolving at a rapid rate, and that,in particular, the neighboring Pcdh� gene shows a normalevolutionary rate, it is more probable that the Pcdh� and -�subclusters are distinct protocadherin subclusters that lack or-thologs in bony vertebrates.

Because the elephant shark Pcdh� subcluster lacks the con-stant region, its phylogenetic relationship with other protocad-herin subclusters can be inferred only by analyzing the variableexons. To this end, we generated a phylogenetic tree using thealignment of the N-terminal ectodomain sequences (EC1–3) ofall variable exons from elephant shark, fugu, and coelacanth.This phylogenetic tree of variable exons suggests that the Pcdh�subcluster is not related to the bony vertebrate � subcluster thatalso lacks a constant region but an ancient subcluster that hasundergone repeated expansion within the elephant shark lineage(SI Appendix, SI Fig. 8). In addition, this analysis showed that thefirst genes in the fugu Pcdh1 (FrPcdh1�1) and Pcdh2 clusters(FrPcdh2�1) and the coelacanth protocadherin cluster(LmPcdh�1) are more closely related to elephant sharkCmPcdh�1, -�1, and -�1 genes than to other � genes in theirrespective subclusters (SI Appendix, SI Fig. 8). This suggests thatthe fugu Pcdh1�1 and Pcdh2�1 genes and coelacanth Pcdh�1gene may not be � genes but belong to a different subcluster.

A Pcdh� Subcluster in Teleost Fishes, Coelacanth, Xenopus, andChicken. To verify this, we searched for constant exons down-stream of these variable exons by GENSCAN prediction andTBLASTN by using elephant shark Pcdh� constant regionsequence as a query. We uncovered three new constant exonseach in the fugu Pcdh1 and Pcdh2 clusters and in the coelacanthprotocadherin cluster. Because the zebrafish Pcdh1�1 gene isorthologous to the fugu Pcdh1�1 gene (6), we extended our studyto the zebrafish Pcdh1 locus and identified a similar set ofconstant exons immediately downstream of the zebrafishPcdh1�1 gene. RT-PCR analyses with RNA from fugu andzebrafish brain (coelacanth was not included due to unavailabil-ity of RNA) confirmed that the first variable exons in the fuguPcdh1 and Pcdh2 clusters and in the zebrafish Pcdh1 cluster arespliced to the respective newly identified three constant exonsand not to the constant region exons in their respective �subclusters.

The absence of an ortholog for the coelacanth Pcdh�1 gene(redesignated as Pcdh�1) in the human and mouse protocad-herin clusters (2, 3, 18, 19) indicates that the Pcdh� subcluster islost in mammals. To determine more precisely when this sub-cluster was lost, we searched the X. tropicalis and chickengenome assemblies by TBLASTN using the elephant sharkPcdh� constant region sequence as a query. This search led to theidentification of a single-variable exon plus three constant exonsubclusters at the 5� end of protocadherin clusters in Xenopus(scaffold�177, length 2.1 Mb) and chicken (chromosome 13).Phylogenetic analysis of the amino acid sequences of the variableexon (SI Appendix, SI Fig. 8) and the constant region exons (Fig.2) of this cluster from Xenopus and chicken together with theircorresponding sequences in fugu, zebrafish, coelacanth, andelephant shark indicated that these genes are indeed orthologs

Table 1. Synonymous substitutions per codon in the ectodomain region of elephant shark and fugu Pcdh paralog subgroups

Subgroups dSEC1–6† dSEChigh

‡ dSEClow§ dSEChigh/dSEClow

CmPcdh�2–21 0.211 0.281 0.157 1.8CmPcdh�1–8 0.647 1.073 0.534 2.0CmPcdh�4–17 0.210 0.338 0.146 2.3FrPcdh2�3–7 0.104 0.517 0.00515 100.4FrPcdh2�9–25 0.228 1.022 0.00522 195.8FrPcdh2�26–36 0.327 1.331 0.00104 1,280FrPcdh2�1–17 0.067 0.240 0.00302 79.5FrPcdh2�19–31 0.227 1.308 0.00555 235.7FrPcdh2�33–36 0.297 8.667 0.00310 2796

†Average dS for each branch in the gene tree of individual subgroups was calculated based on the alignment of paralogs in the subgroup.‡Average dS per branch calculated based on alignment of the most-divergent ectodomain in each subgroup.§Average dS per branch calculated based on alignment of the least-divergent ectodomain in each subgroup.

Fig. 2. Phylogenetic analysis of protocadherin constant region exons. Thetree was generated from alignment of protein sequences of the protocad-herin constant regions by Neighbor-joining method based on sequence dis-tance matrix. The tree is unrooted. Bootstrap values (%) of only the majornodes are shown. Cm, Callorhinchus milii; Hs, Homo sapiens; Mm, Mus mus-culus; Lm, Latimeria menadoensis; Dr, Danio rerio; Fr, Fugu rubripes; Xt, X.tropicalis; Gg, Gallus gallus.

Yu et al. PNAS � March 11, 2008 � vol. 105 � no. 10 � 3821

EVO

LUTI

ON

Page 4: Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster

of the elephant shark Pcdh� subcluster. Thus, we conclude thatPcdh� is an ancient subcluster that has been conserved in teleostfishes, coelacanth, Xenopus, and chicken protocadherin clustersbut lost in the human and mouse protocadherin clusters.

Elephant Shark Protocadherin Cluster Genes Have Experienced VeryLittle Gene Conversion. The tandem arrays of paralogous pro-tocadherin genes in fugu and zebrafish have experienced re-peated gene conversion events, resulting in extensive regionalsequence homogenization among paralogs in the same subclus-ter (5, 6). Gene conversion in protocadherin clusters is usuallyrestricted to the C-terminal ectodomains from EC4 to EC6 and

the cytoplasmic fragment encoded by the variable exons,whereas the EC2 and EC3 ectodomain regions seldom undergogene conversion, and thus are hypothesized to provide themajority of diversifying signals for protocadherin cluster genes.Protocadherin cluster genes in human have also undergone geneconversions although to a lesser extent compared with teleostfishes, but those in coelacanth have experienced few geneconversions (4). To investigate whether the elephant sharkprotocadherin cluster genes have undergone gene conversion,we estimated the total number of synonymous substitutions percodon (dS) for each of the elephant shark protocadherin paralogsubgroups (CmPcdh�2-21, CmPcdh�1–8, CmPcdh�4–17). Thesynonymous substitution rate reflects the frequency of geneconversion events, because purifying selection for protein func-tion does not act on synonymous sites. Because gene conversionin protocadherin clusters is usually restricted to the C-terminalectodomains, we assessed the synonymous substitution ratesseparately for each ectodomain. The overall ratio of synonymoussubstitution rates between the most- and the least-divergentectodomains among different paralog subgroups of the elephantshark protocadherin genes ranges from 1.8 to 2.3 (Table 1). Thisindicates a uniform rate of synonymous-site substitutions amongdifferent ectodomains in the elephant shark paralogs. In con-trast, the ratio between the most- and the least-divergent ectodo-mains among different subgroups in fugu ranges from 79 to 2,796(Table 1). Such a high ratio indicates that different domains haveexperienced different rates of substitution, most likely due togene conversion in some ectodomains. These results suggest thatnone of the ectodomains in the elephant shark protocadherincluster genes have undergone gene conversion and sequencehomogenization. Thus, like the coelacanth genome, the elephantshark genome appears to be less susceptible to recombination-mediated rearrangements.

Evolution of Core Promoter Elements of Protocadherin Cluster Genes.Previous studies have identified a 15-bp core sequence elementconserved in the promoter sequences of all mammalian andcoelacanth protocadherin genes and in zebrafish Pcdh1� andPcdh2� genes but divergent in zebrafish Pcdh1� and Pcdh2�genes (5, 18, 20). This core element includes a ‘‘CGCT’’ motifimplicated in promoter function (20). We searched the promoterregions (480 bp upstream of the start codon) of the elephantshark genes using MEME algorithm to look for conserved motifsin genes among different subclusters. Our results show that theelephant shark Pcdh� and Pcdh� subcluster genes share a highlyconserved 27-bp promoter element (Fig. 3 A and B). Thepresence of this element in genes from two different subclustersand the high level of conservation indicate that this is an ancientpromoter element that is under purifying selection. The pro-moter sequences of the elephant shark Pcdh� genes, however,are divergent from this element and also from each other, exceptfor a short CAAT-box-like motif in the 3� region (Fig. 3C). Thesegenes, located at the 3� end of the elephant shark protocadherincluster, are likely to be regulated differently from those of Pcdh�and Pcdh� genes.

The first 15 bp of the 27-bp element in the elephant sharkPcdh� and Pcdh� genes are well conserved in the Pcdh�, -�, and-� genes in human and coelacanth (Fig. 3 D–H) and in Pcdh�genes in fugu (Fig. 3J). The 3� region of this element, however,is divergent in bony vertebrates, except for the CAAT-box-likemotif that is partially conserved in the Pcdh� and Pcdh� genesin human and coelacanth (Fig. 3 D–G). The substitutionsaccumulated in the 3� region of this promoter element in bonyvertebrates may have resulted in the acquisition of new expres-sion patterns and contributed to adaptive diversity of protocad-herin cluster genes. The promoter elements of fugu Pcdh�subcluster genes are quite divergent from the promoter elementsin all other protocadherin clusters. Even the ‘‘CGCT’’ motif is

A

C

E

G

B

D

F

H

I

J

Fig. 3. Comparison of the promoter elements of protocadherin genes. Thecore promoter elements of elephant shark Pcdh� (A), Pcdh� (B), and Pcdh� (C);human Pcdh� (D), Pcdh� (H), and Pcdh� (F); coelacanth Pcdh� (E) and Pcdh�

(G); and fugu Pcdh� (I) and Pcdh� (J) were identified by MEME (http://meme.sdsc.edu/meme/intro.html) and displayed as WebLogo plots.

3822 � www.pnas.org�cgi�doi�10.1073�pnas.0800398105 Yu et al.

Page 5: Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster

rather weakly conserved in the promoters of these genes (Fig.3I). Thus, these fugu genes appear to have undergone furtheradaptive radiation in the teleost lineage.

Syntenic Block in the Elephant Shark Protocadherin Locus Is Conservedin Mammals but Rearranged in Fugu. The elephant shark protocad-herin locus contains a block of syntenic genes (Hars, Dnd1, andZmat2, protocadherin cluster, and Diaph1 genes) spanning �409kb that is conserved in the human protocadherin locus. However,the human locus is considerably larger (�950 kb) than theelephant shark locus and contains an extra gene, HARS2, whichis a duplicated copy of HARS (SI Appendix, SI Fig. 5). Thecontent and order of genes in the orthologous mouse locus areidentical to that in the human (data not shown). In contrast tothe well conserved synteny between the elephant shark andmammalian protocadherin loci, the regions flanking the dupli-cate protocadherin clusters of fugu have experienced extensiverearrangements (SI Appendix, SI Fig. 5). Comparisons of �1.4�whole-genome shotgun sequences of elephant shark derivedfrom paired end sequences of fosmid clones with the human andzebrafish genome assemblies have previously indicated that thelevel of conserved synteny between the elephant shark andhuman genomes is higher than that between the elephant sharkand zebrafish genomes (21). The highly conserved synteny ofgenes at the protocadherin locus in the elephant shark andmammals and its disruption in fugu provides further support tothis observation and is consistent with the notion that teleost fishgenomes have experienced a higher rate of chromosomal rear-rangements relative to elephant shark and mammals. Recon-struction of ancestral vertebrate karyotypes based on compari-sons of teleost fish genomes (medaka, zebrafish, and Tetraodon)and human genome has suggested that several major chromo-somal rearrangements occurred in the fish lineage within a shortperiod of �50 million years after the ‘‘fish-specific’’ whole-genome duplication (22). The whole-genome duplication mayhave triggered an accelerated rate of rearrangements by facili-tating recombination between paralogous chromosomal seg-ments.

DiscussionThe living jawed vertebrates comprise two major lineages, thecartilaginous fishes and bony vertebrates. In this study, we havecharacterized the protocadherin cluster from a cartilaginous fish,

the elephant shark. The protocadherin cluster in elephant sharkcontains 47 genes organized into four subclusters, the Pcdh�, -�,-�, and -�. Surprisingly, phylogenetic analyses indicated thatthese subclusters are not direct orthologs of Pcdh�, -�, or -�subcluster previously identified in bony vertebrates. We haveshown that Pcdh� is an ancient subcluster that is conserved inteleost fishes, coelacanth, Xenopus, and chicken but lost inmammals. Thus, our study suggests that the protocadherincluster in the ancestral jawed vertebrate contained more sub-clusters than previously proposed based on characterization ofprotocadherin cluster in bony vertebrates. We propose that theprotocadherin cluster in the last common ancestor of jawedvertebrates contained seven subclusters (Pcdh�, -�, -�, -�, -�, -�,and -�) with multiple variable exons (Fig. 4). After the diver-gence of the cartilaginous fish and bony vertebrate lineages, thisancestral cluster has experienced differential loss of entiresubclusters of genes in different vertebrate lineages. Althoughthe Pcdh�, -�, and -� subclusters were lost in the elephant sharklineage, the common ancestor of bony vertebrates lost the Pcdh�,-�, and -� subclusters. Of the remaining four subclusters (Pcdh�,-�, -�, and -�) in bony vertebrates, the Pcdh� subcluster was lostin the ray-finned fish lineage, most likely before the ‘‘fish-specific’’ whole-genome duplication event. Another subcluster,Pcdh�, was lost more recently in the mammalian lineage (Fig. 4).Besides the entire subclusters, the constant region exons ofprotocadherin genes have also been targeted for lineage-specificlosses. The constant exons of the Pcdh� and -� subclusters werelost independently in the elephant shark lineage and in acommon ancestor of lobe-finned fishes, respectively (Fig. 4).Thus, the characterization of the protocadherin cluster in theelephant shark has demonstrated the importance of cartilagi-nous fish genome sequence in reconstructing the evolutionaryhistory of vertebrate genomes and in inferring the organizationof the ancestral jawed vertebrate genome. Because they are themost ancient phylogenetic group of jawed vertebrate, cartilagi-nous fishes not only provide a better insight into the now-extinctancestral jawed vertebrate genome but also serve as an outgroupthat helps in identifying lineage-specific adaptive changes indifferent lineages of bony vertebrates.

The elephant shark genome was proposed as a model carti-laginous fish genome because of its relatively small size (14).Subsequently, comparative analysis of �1.4� coverage elephantshark sequences with the whole-genome assemblies of human,

Fugu Pcdh1

Elephant shark Pcdh

Cmδ1 Cmε1 Cmν1Cmμ1-8Cmε2-21Cmν2-3

Cmν4-17

Ancestor Pcdh

Pcdhδ Pcdhε Pcdhα Pcdhβ PcdhγPcdhµ Pcdhν

Fr1α1 Fr1α3Fr1α2

Human Pcdh

Hsα1-13Hsγc4-5

Hsγc3Hsαc1

Hsαc2 Hsγa1-12Hsβ1-19Hsγb1-7

Coelacanth Pcdh

Lmα11-14Lmγ21-24

Lmα21 Lmγ20Lmα15-20Lmα1 Lmα2-10 Lmγ1-19Lmβ1-4

Fr2α8-36Fr2α2-7Fr2α1 Fr2α37

Fr2γ37Fr2γ33-36Fr2γ1-18

Fugu Pcdh2

Fr2γ19-32D

Fig. 4. A model to explain the evolutionary history of the protocadherin cluster in jawed vertebrates. Variable exons are indicated by boxes and constant exonsby red bars. Intersubcluster paralogs or orthologs between species are shown in the same color. White boxes represent protocadherin genes or paralog subgroupswith no apparent intersubcluster paralogs or orthologs. Encircled ‘‘D’’ denotes whole-genome duplication in the ray-finned fish lineage.

Yu et al. PNAS � March 11, 2008 � vol. 105 � no. 10 � 3823

EVO

LUTI

ON

Page 6: Elephant shark sequence reveals unique insights into the evolutionary history of vertebrate genes: A comparative analysis of the protocadherin cluster

fugu, and zebrafish suggested that noncoding sequences inelephant shark are evolving more slowly than in teleost fishes andthat the elephant shark genome has experienced fewer chromo-somal rearrangements compared with teleost fish genomes (21,23). The characterization of the protocadherin locus in elephantshark has provided further support to the hypothesis that theelephant shark genome is more stable compared with teleost fishgenomes. The conserved synteny of genes at the protocadherinlocus in elephant shark and human, the lower turnover ofprotocadherin genes in the elephant shark protocadherin cluster,and a lack of evidence for gene conversion between paralogousprotocadherin genes in elephant shark suggest that elephantshark genome is less prone to recombination-mediated rear-rangements and tandem gene duplications compared with te-leost fish genomes. Although we did not measure the relativesubstitution rates of the promoter sequences of protocadheringenes, the relatively higher levels of conservation of the 15-bpcore promoter element in the elephant shark compared with thatin fugu and coelacanth indicate that the regulatory region ofelephant shark protocadherin genes is evolving more slowly thanthat in teleosts and coelacanth. The relatively small and slowlyevolving genome of the elephant shark, therefore, is an impor-tant ‘‘reference’’ for understanding the evolution of vertebrategenomes.

MethodsSequencing and Assembly of BACs. To identify probes for the elephant sharkprotocadherin cluster, we first performed a TBLASTN search of the whole-genome shotgun sequences of elephant shark (GenBank accession nos.CW854842–CW882785) using the amino acid sequences of human protocad-herins as queries. We identified two sequences (GenBank accession nos.CW868635 and CW882727) that showed high similarity to human protocad-herin. The shotgun clones of these sequences were used to probe theIMCB�Eshark BAC library (cloned in pCCBAC-EcoRI; unpublished data). In total,10 positive clones were identified, from among which three representativeoverlapping clones (49C8, 166L19, and 176N9) were sequenced completely bythe standard shotgun sequencing method and assembled using SeqMan(Lasergene). This sequence has been submitted to GenBank (accession no.EF693954).

Sequence Annotation. The variable and constant exons of elephant sharkprotocadherin genes and exons of the nonprotocadherin genes flanking theprotocadherin cluster were identified and annotated based on homology(TBLASTN and BLASTX) (www.ncbi.nlm.nih.gov) and GENSCAN prediction

(24). Human orthologs of the elephant shark nonprotocadherin genes wereidentified by BLAT search of the human genome at University of California,Santa Cruz (UCSC), Genome Browser (http://genome.ucsc.edu). The genomicsequences of human and mouse protocadherin clusters were retrieved fromthe UCSC Genome Browser, whereas the coelacanth protocadherin clustersequence was assembled from BAC sequences in GenBank (accession nos.AC150238, AC150284, and AC150308–AC150310) (4). The genomic sequencesof fugu protocadherin clusters were also obtained from GenBank (accessionnos. DQ986917–DQ986918) (6). Protocadherin sequences of X. tropicalis andchicken were identified by BLAT search of their genome assemblies (Ver. 4.1and 2.1, respectively; http://genome.ucsc.edu) using the protein sequence ofthe elephant shark Pcdh� constant region as query. Gaps in the intronicregions of Xenopus and chicken sequences were filled by PCR amplification ofthe genomic DNA. The complete genomic sequences of the Xenopus andchicken Pcdh� subclusters have been submitted to the GenBank (accession nos.EU267079 and EU267080).

RT-PCR and 3�RACE Analyses. Total RNA was extracted from the elephant shark,fugu, and zebrafish tissues by the TRIzol method (Invitrogen) and reverse-transcribed using SMART rapid amplification of cDNA ends (RACE) cDNA Ampli-fication kit (Clontech). The PCR was performed by an initial denaturing step of95°C for 2 min, followed by 35 cycles of 95°C for 30 sec, 55°C for 30 sec, and 72°Cfor 1–3 min. 3�RACE analysis was performed according to the manufacturer’sinstructions (Clontech). The RACE products were sequenced completely.

Phylogenetic Analyses. The amino acid sequences of EC1–3 domains of pro-tocadherin cluster genes from various species were aligned by ClustalX (25).Phylogenetic trees were constructed by the Neighbor-joining method basedon sequence distance matrix and displayed using NJplot (www-igbmc.u-strasbg.fr/BioInfo). The robustness of the tree was determined by bootstrapanalysis of 1,000 replicate sample sequences.

Synonymous Substitution Analyses. The synonymous substitution rates wereestimated using the CODEML program in PAML package with defaultparameters (26). The nucleotide sequence alignments were generated byRevTrans program (27), using amino acid sequence alignment (generatedby ClustalX) as templates. The synonymous substitution rates were cal-culated as average dS for each branch in the gene tree of individualsubgroups.

ACKNOWLEDGMENTS. The research work in the W.-P.Y. laboratory is sup-ported by the Biomedical Research Council (BMRC); the National MedicalResearch Council (NMRC); and the SingHealth Foundation Funds, Singapore;and the work in B.V.’s laboratory is supported by the Agency for Science,Technology and Research (A*STAR), Singapore. B.V. is adjunct staff of theDepartment of Pediatrics, Yong Loo Lin School of Medicine, National Univer-sity of Singapore.

1. Ma J, et al. (2006) Reconstructing contiguous regions of an ancestral genome. GenomeRes 16:1557–1565.

2. Wu Q (2005) Comparative genomics and diversifying selection of the clustered verte-brate protocadherin genes. Genetics 169:2179–2188.

3. Wu Q, Maniatis T (1999) A striking organization of a large family of human neuralcadherin-like cell adhesion genes. Cell 97:779–790.

4. Noonan JP, et al. (2004) Coelacanth genome sequence reveals the evolutionary historyof vertebrate genes. Genome Res 14:2397–2405.

5. Noonan JP, Grimwood J, Schmutz J, Dickson M, Myers RM (2004) Gene conversion andthe evolution of protocadherin gene cluster diversity. Genome Res 14:354–366.

6. Yu WP, Yew K, Rajasegaran V, Venkatesh B (2007) Sequencing and comparativeanalysis of fugu protocadherin clusters reveal diversity of protocadherin genes amongteleosts. BMC Evol Biol 7:49.

7. Tada MN, et al. (2004) Genomic organization and transcripts of the zebrafish Pro-tocadherin genes. Gene 340:197–211.

8. Morishita H, Yagi T (2007) Protocadherin family: diversity, structure, and function. CurrOpin Cell Biol 19:584–592.

9. Kohmura N, et al. (1998) Diversity revealed by a novel family of cadherins expressed inneurons at a synaptic complex. Neuron 20:1137–1151.

10. Shapiro L, Colman DR (1999) The diversity of cadherins and implications for a synapticadhesive code in the CNS. Neuron 23:427–430.

11. Hill E, Broadbent ID, Chothia C, Pettitt J (2001) Cadherin superfamily proteins inCaenorhabditis elegans and Drosophila melanogaster. J Mol Biol 305:1011–1024.

12. Sasakura Y, et al. (2003) A genomewide survey of developmentally relevant genes inCiona intestinalis. X. Genes for cell junctions and extracellular matrix. Dev Genes Evol213:303–313.

13. Whittaker CA, et al. (2006) The echinoderm adhesome. Dev Biol 300:252–266.14. Venkatesh B, Tay A, Dandona N, Patil JG, Brenner S (2005) A compact cartilaginous fish

model genome. Curr Biol 15:R82–R83.

15. Vanhalst K, Kools P, Vanden Eynde E, van Roy F (2001) The human and murineprotocadherin-beta one-exon gene families show high evolutionary conservation,despite the difference in gene number. FEBS Lett 495:120–125.

16. Ribich S, Tasic B, Maniatis T (2006) Identification of long-range regulatory elements inthe protocadherin-alpha gene cluster. Proc Natl Acad Sci USA 103:19719–19724.

17. Kaneko R, et al. (2006) Allelic gene regulation of Pcdh-alpha and Pcdh-gamma clustersinvolving both monoallelic and biallelic expression in single Purkinje cells. J Biol Chem281:30551–30560.

18. Wu Q, et al. (2001) Comparative DNA sequence analysis of mouse and human pro-tocadherin gene clusters. Genome Res 11:389–404.

19. Zou C, Huang W, Ying G, Wu Q (2007) Sequence analysis and expression mapping of therat clustered protocadherin gene repertoires. Neuroscience 144:579–603.

20. Tasic B, et al. (2002) Promoter choice determines splice site selection in protocadherinalpha and gamma pre-mRNA splicing. Mol Cell 10:21–33.

21. Venkatesh B, et al. (2007) Survey sequencing and comparative analysis of the elephantshark (Callorhinchus milii) genome. PLoS Biol 5:e101.

22. Kasahara M, et al. (2007) The medaka draft genome and insights into vertebrategenome evolution. Nature 447:714–719.

23. Venkatesh B, et al. (2006) Ancient noncoding elements conserved in the humangenome. Science 314:1892.

24. Burge C, Karlin S (1997) Prediction of complete gene structures in human genomicDNA. J Mol Biol 268:78–94.

25. Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL�Xwindows interface: flexible strategies for multiple sequence alignment aided byquality analysis tools. Nucleic Acids Res 25:4876–4882.

26. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximumlikelihood. Comput Appl Biosci 13:555–556.

27. Wernersson R, Pedersen AG (2003) RevTrans: Multiple alignment of coding DNA fromaligned amino acid sequences. Nucleic Acids Res 31:3537–3539.

3824 � www.pnas.org�cgi�doi�10.1073�pnas.0800398105 Yu et al.