supporting information appendix - pnas · s1 genome sequencing and de novo assembly s1.1 plant...

66
www.pnas.org/cgi/doi/10.1073/pnas. Supporting Information Appendix Draft genome sequence of Camellia sinensis var. sinensis provides insights into the evolution of the tea genome and tea quality Chaoling Wei a,1 , Hua Yang a,1 , Songbo Wang b,1 , Jian Zhao a,1 , Chun Liu b,1 , Liping Gao a,1 , Enhua Xia a , Ying Lu c , Yuling Tai a , Guangbiao She a , Jun Sun a , Haisheng Cao a , Wei Tong a , Qiang Gao b , Yeyun Li a , Weiwei Deng a , Xiaolan Jiang a , Wenzhao Wang a , Qi Chen a , Shihua Zhang a , Haijing Li a , Junlan Wu a , Ping Wang a , Penghui Li a , Chengying Shi a , Fengya Zheng b , Jianbo Jian b , Bei Huang a , Dai Shan b , Mingming Shi b , Congbing Fang a , Yi Yue a , Fangdong Li a , Daxiang Li a , Shu Wei a , Bin Han d , Changjun Jiang a , Ye Yin b , Tao Xia a , Zhengzhu Zhang a , Jeffrey L. Bennetzen a,e,2 , Shancen Zhao b,2 , Xiaochun Wan a,2 a State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University, Hefei, 230036, China. b BGI Genomics, BGI–Shenzhen, Shenzhen 518083, China. c College of Fisheries and Life Science, Shanghai Ocean University, Shanghai 201306, China. d National Center for Gene Research, Shanghai Institute of Plant Physiology and Ecology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China. e Department of Genetics, University of Georgia, Athens, GA30602, USA 1 C.W., H.Y., S.W., J.Z., C.L., L.G. contributed equally to this work. 2 To whom correspondence may be addressed. Email: [email protected], [email protected] or [email protected]. Contributed by Jeffrey L. Bennetzen 1719622115

Upload: others

Post on 08-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

www.pnas.org/cgi/doi/10.1073/pnas.

Supporting Information Appendix

Draft genome sequence of Camellia sinensis var. sinensis provides

insights into the evolution of the tea genome and tea quality

Chaoling Weia,1, Hua Yanga,1, Songbo Wangb,1, Jian Zhaoa,1, Chun Liub,1, Liping Gaoa,1, Enhua

Xiaa, Ying Luc, Yuling Taia, Guangbiao Shea, Jun Suna , Haisheng Caoa, Wei Tonga , Qiang

Gaob, Yeyun Lia, Weiwei Denga, Xiaolan Jianga, Wenzhao Wanga, Qi Chena, Shihua Zhanga,

Haijing Lia, Junlan Wua, Ping Wanga, Penghui Lia, Chengying Shia, Fengya Zhengb, Jianbo

Jianb, Bei Huanga, Dai Shanb, Mingming Shib, Congbing Fanga, Yi Yuea, Fangdong Lia,

Daxiang Lia, Shu Weia, Bin Hand, Changjun Jianga, Ye Yinb, Tao Xiaa, Zhengzhu Zhanga,

Jeffrey L. Bennetzena,e,2, Shancen Zhaob,2, Xiaochun Wana,2

a State Key Laboratory of Tea Plant Biology and Utilization, Anhui Agricultural University,

Hefei, 230036, China.

b BGI Genomics, BGI–Shenzhen, Shenzhen 518083, China.

c College of Fisheries and Life Science, Shanghai Ocean University, Shanghai 201306, China.

d National Center for Gene Research, Shanghai Institute of Plant Physiology and Ecology,

Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai

200032, China.

e Department of Genetics, University of Georgia, Athens, GA30602, USA

1 C.W., H.Y., S.W., J.Z., C.L., L.G. contributed equally to this work.

2 To whom correspondence may be addressed. Email: [email protected],

[email protected] or [email protected].

Contributed by Jeffrey L. Bennetzen

1719622115

Page 2: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Table of contents

S1 Genome sequencing and de novo assembly .......................................................... 4

S1.1 Plant materials for sequencing .................................................................................. 4

S1.2 Estimation of the genome size .................................................................................. 4

S1.3 Whole genome sequencing ....................................................................................... 5

S1.4 De novo assembly process ........................................................................................ 6

S1.5 Comparison of the heterozygosity ............................................................................ 6

S1.6 Evaluation of genome quality ................................................................................... 7

S1.7 Assessment of chloroplast insertions ........................................................................ 8

S2 Genome annotation and characterization ............................................................ 8

S2.1 Annotation of transposable elements ........................................................................ 8

S2.2 Prediction of protein-encoding genes ....................................................................... 9

S2.3 Functional annotation .............................................................................................. 11

S2.4 Annotation of transcription factors .......................................................................... 11

S2.5 Scaffold anchoring by genetic map ........................................................................ 12

S2.6 Annotation of non-coding RNA genes .................................................................... 12

S2.7 Characterization of gene structures ........................................................................ 13

S3 Comparison of CSS with CSA genomes.............................................................. 13

S3.1 Genomic synteny between CSS and CSA genome assemblies .............................. 13

S3.2 Charaterization of collinear orthologous genes between CSS and CSA ................ 14

S4 Genome evolution and expansion ........................................................................ 14

S4.1 Orthologous gene clusters ...................................................................................... 14

S4.2 Phylogenetic inference ........................................................................................... 15

S4.3 Tea-specific gene families ...................................................................................... 16

S4.4 Expansion and contraction of gene families ........................................................... 16

S4.5 Genome expansion of tea genome .......................................................................... 17

S5 Evolution of catechin biosynthesis ...................................................................... 17

S5.1 Catechin biosynthesis pathway ............................................................................... 17

S5.2 Extraction and HPLC analysis of catechins and PAs ............................................. 20

S5.3 Identification of genes involved in the catechin biosynthesis ................................ 21

S5.4 Evolution of genes involved in the catechin biosynthesis ...................................... 23

S5.5 Expression of catechin biosynthetic genes ............................................................. 24

S5.6 Corelation analyses of gene expression patterns .................................................... 25

S6 Multilevel regulation of catechin biosynthesis ................................................... 26

S6.1 Correlation of gene expression and catechin contents ............................................ 26

S6.2 Regulatory network for catechin biosynthesis ........................................................ 26

S6.3 Transporters related to flavonoids .......................................................................... 28

S7. Identification of genes encoding theanine synthetase ....................................... 28

S7.1 Theanine biosynthesis pathway .............................................................................. 28

S7.2 Theanine biosynthetic genes ................................................................................... 29

Page 3: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

S7.3 Evolution of genes for theanine synthetase ............................................................ 30

S7.4 Detection of theanine in different tissues ............................................................... 31

S7.5 Comparison of protein sequences ........................................................................... 32

S7.6 Verification of in vivo theanine synthesis function of CsTSI ................................. 32

References

Supporting Figures

Page 4: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

S1 Genome sequencing and de novo assembly

S1.1 Plant materials for sequencing

The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to the section Thea of the genus Camellia in the Theaceae family. Camellia sinensis includes two main varieties: C. sinensis var. sinensis (CSS) and C. sinensis var. assamica (CSA) (1). The tea plant exhibits a high level of heterozygosity because of genetic barriers, such as inbreeding depression and related self-incompatibility (1). To select a plant material that is suitable for genome sequencing, we previously used RAD-Seq technology to identify heterozygous SNPs in a total of 18 cultivated and wild tea accessions which are from six Camellia species and varieties, and then with those SNPs we estimated the heterozygosity of each accession (2). On average, the heterozygosity in cultivated varieties was observed to be slightly higher than that in wild tea plants, which could be caused by frequent hybridization during recent breeding activities. As a cultivated variety of Camellia sinensis (L.) O. Kuntze, “Shuchazao” was found to exhibit a relatively low level of heterozygosity and was selected for sequencing (Dataset S1). Shuchazao is a diploid plant, containing 15 pairs of chromosomes (2n=30). It was produced using individual plant selection from local natural populations and cutting propagation by the “916 Tea Plantation of Shucheng” project in Anhui province in China, and was certified as a national variety (accession number: GS2002008) by the National Crop Variety Approval Committee in 2002. The typical characteristics of the variety include medium-sized leaves, early sprouting and high levels of resistance to cold and drought. It possesses both excellent quality and high yield of tea, and has become a very popular variety in China. Currently, it is grown in six provinces with a total planting area of approximately 20,000 hectares. DNA for sequencing was derived from tender shoots of an individual plant clone. The leaves from the sequenced tea plants were disinfected with 70% ethyl alcohol and cleaned with distilled water. Then, they were immediately frozen in liquid nitrogen and stored at -80°C prior to DNA extraction.

S1.2 Estimation of the genome size

To estimate genome size, tender leaves were collected from the sequenced tea plant and analyzed using a flow cytometer. A total of 35 samples were analyzed using soybean (Glycine max cv. Polanka, DNA 2C value = 2.5pg) as the genome size standard. Over 5,000 nuclei per sample were collected and detected with a CyFlow® Space flow cytometer (Partec, Germany), equipped with a UV-LED source (with emission at 365nm) and blue solid-state laser (λ = 455 nm). The data were analyzed using the software Flomax2.8 and the coefficient variation was under 5%. The left peak in the chromatograph of flow cytometer analysis indicated 2C DNA of Glycine max at 52.64. The right peak indicated 2C DNA of CSS at 132.52. Compared with that of soybean (2.5 pg), nuclear DNA amount of the tea plant was estimated as 6.09 ± 0.20

Page 5: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

pg, which indicated the genome size of “Shuchazao” is 2.98 ± 0.10 Gb (1 pg DNA = 0.978×109 bp) (3) (SI Appendix, Fig. S1a).

To further estimate tea genome size, we also performed k-mer analysis with parallel next-generation sequencing short reads (Supporting Information S1.3). A k-mer refers to an oligonucleotide of k bp in length. The k-mer frequencies derived from the sequencing reads follow a poisson distribution in a given data set. Given a certain k-mer, genome size can be simply inferred from the total number of k-mers (referred to as K_num) divided by the k-mer depth (referred to as peak_depth), G=K_num/peak_depth. When the k-mer size was set to be 17, then the 350 Gb of sequencing reads from short-insert size libraries genernated a total number of 305,017,456,592 k-mers, and then the peak_depth was at about 103. From these statistics, we estimated that the Shuchazao genome size was about 2.96 Gb, which was consistent with that from flow cytometric analysis, as well as with that previously reported for C. sinensis var. assamica (4) (SI Appendix, Fig. S1b).

S1.3 Whole genome sequencing

Tender shoots were harvested from the Shuchazao plant for genomic DNA extraction. Genomic DNA was extracted using the standard cetyltrimethyl ammonium bromide (CTAB) procedure (5). A total of 10 paired-end (PE) libraries were constructed using paired-end kits (Illumina, San Diego, CA, USA) with average insert sizes of approximately 170 bp, 250 bp, 500 bp and 800 bp. In addition, 10 other mate-pair (MP) libraries were prepared with mate-pair kits (Illumina, San Diego, CA, USA) at average insert sizes of 2 kb, 5 kb, 10 kb, 20 kb and 40 kb. Based on the standard Illumina protocols, library preparation, sequencing and base calling were performed on the Illumina Hiseq 2500 platform. In total, we generated 2124 Gb (about 699-fold depth) of raw sequencing data, including 1261 Gb of short-insert data (<1 kb) and 826 Gb of large-insert data (>=2 kb). To retain only high-quality reads, several rules were applied in the filtering process: (1) removing reads from short insert-size libraries (<1 kb) where N constituted more than 2% of bases, and from long insert-size libraries (>=2 kb) where more than 5% of bases were poly(A); (2) removing low-quality reads from short insert-size libraries (<1 kb) when 40% or more of bases had a quality score <=7, and from long insert-size libraries (>=2 kb) at 30%; (3) removing reads with more than 10 bp aligned to the adapter sequence (allowing less than or equal to 3 bp mismatch); (4) removing reads from short insert-size libraries when two reads overlapped by 10 or more bp allowing for a 10% mismatch; (5) removing PCR duplication reads; (6) removing possible contaminating reads of known bacterial or viral origin. Finally, a total of 1325 Gb (436-fold depth) of high-quality reads were used for subsequent assembly (Dataset S2).

The 10 kb and 20 kb libraries for PacBio sequencing were constructed using the SMRTBell Template Prep Kit 1.0 (Pacific Biosciences, http://www.pacb.com/),

Page 6: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

according to the manufacturer’s instructions. The SMRT Bell Template sequencing primer with DNA polymerase was applied to the SMRT Cell for the sequencing reaction. The P4 DNA polymerase with C2 chemistry (P4-C2, a total of 40 cells for 10 kb library and 92 cells for 20 kb library) was used in the sequencing reaction using the PacBio RS II sequencer (Pacific Biosciences). In order to obtain high quality subreads, we firstly filtered these reads with size < 2 kb and RQ value < 0.8. A total of 33.2 Gb (~11-fold coverage) of subread bases with a mean read length of 5314 bp was generated from 10K library, and 92.2 Gb (30-fold) with a mean read length of 8692 bp was from 20 kb library. PBcR (the Pacbio Corrected reads pipeline) was used to correct these reads. Finally, a total of 86.4 Gb of high quality, error-corrected, subread bases were obtained (Dataset S3).

S1.4 De novo assembly process

Prior to assembly, we built a 17-mer frequency table for the short-insert size data (<1kb) and removed the reads whose frequency lower than 10. For the data from libraries of 170 bp and 250 bp, we connected the paired-end reads into linked reads by overlapping sequences. After processing, a total of 553 Gb of clean data were generated and loaded into RAM (Random Access Memory). To construct contigs, all used reads were split into 119-mers (118 bp overlap with 1 overhang) that were used to construct a de Bruijn graph. The contigs in the assembly were conducted with SOAPdenovo (v2.04) (6, 7) and Platanus (Version 1.24). We also removed the low coverage links and simplified the graph to solve the k-mer path to generate contigs. Following that, all clean reads (short-insert size and large-insert size) were aligned onto the assembled contigs, and all paired-end and mate-pair information was subsequently used to connect contigs into scaffolds step by step using Krskgf and Gapclose (8). The threshold was set to be at least 4 supporting paired-end reads to form a connection. And then, reads from 10K library and 20 kb library were further employed to fill the gaps using the Pbjelly (9). Finally, our assembly reaches 3.14 Gb scaffolds, containing 2.89 Gb of contigs. The contig N50 and scaffold N50 sizes are 67.07 kb and 1.39 Mb, respectively (Table 1 in main text).

The insert-size distribution of PE and MP libraries is an important factor to allow accurate scaffolding. To evaluate the insert sizes of PE (<1kb) and MP (>= 2 kb) data, we mapped PE and MP reads to the initial contigs and subsequently assembled scaffolds using SOAPdenovo software. The calculated insert distance for all libraries exhibited normal distributions, indicating that the high-quality PE and MP data were unbiased for assembling contigs and scaffolds.

S1.5 Comparison of the heterozygosity

Based on the occurrence of heterozygous SNPs, we evaluated the heterozygosity in the genome assembly and compared it with those in bamboo, coffee and kiwifruit genomes. A total of 553 Gb of clean data (<1 kb, approximately 184-fold depth) were

Page 7: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

mapped to the assembly with SOAP2, and then SNPs were called using SOAPsnp (v1.05) with the following thresholds: 1) lowest base quality value of each SNP position >= 20; 2) maximum sequencing depth of each SNP locus <= 300; 3) minimum sequencing depth for each allele >= 10; 4) lowest distance between two adjacent SNPs >= 5bp; 5) only bi-allelic SNPs retained at each SNP locus; 6) the ratio of the depths of two alleles at each SNP locus between 3:17 and 17:3. The SNPs of bamboo (10), coffee (11) and kiwifruit (12) were also identified with the same process using the publicly available sequencing reads. The estimated heterozygous SNP rate of the tea genome was ~4.5 polymorphism per kilobase (~1% of SNP density), which was similar to kiwifruit (~4.2 per kilobase) (12) and orchid (~4.0 per kilobase) (13), but much higher than that of the dihaploid coffee (~0.1 polymorphisms per kilobase) (11) and bamboo (~1.0 per kilobase) (10).

S1.6 Evaluation of genome quality

To investigate the quality of the genome assembly, we constructed a tea BAC library using a modified approach described by Luo et al (14). The BAC library is composed of 161,280 clones and represents genomic coverage of ~6-fold depth (based on a genome size of 2.98 Gb). The nuclear DNA was isolated from petals of the sequenced tea plant and partially digested with the HindIII enzyme. The resultant DNA fragments were cloned into pindgo536-s vector (15), yielding an average insert size of 113 kb. A total of 18 BACs were randomly selected and analyzed by standard Sanger sequencing methods. Subcloned plasmids BAC shotgun-cloned DNA were isolated using the AxyPrep Easy-96 plasmid kit (AP-E96-P-24G AXYGEN). Both ends of all these subclones were sequenced on ABI3730 DNA analyzer using BigDye Terminator Cycle sequencing kit V3.1 (Applied Biosystems, Life Technologies). Raw sequencing data were collected for base calling using the PHRED programs. The transferring of phd to fasta files, masking of vector sequences and assembly of the BAC sequences were performed with the Phd2fasta, Crossmatch and PHRAP programs (16), respectively. PCR, and Sanger sequencing of the PCR fragments, was used to close gaps in the BAC assemblies. The BAC sequences were then aligned to the genome assembly using MUMmer and BLASTN (E-value < 1e-5). Only alignments with sequence identity ≥0.97 were retained. Overall, 98.3% of the 2.08 Mb of randomly selected BAC sequences were mapped to our assembly with >95% sequence identity, which is much higher than the rate (84.59%) in CSA (SI Appendix, Fig. S2, Datasets S4-S5). No rearrangements caused by scaffolding were observed, except some small insertions or deletions. The BAC evaluation provided independent support for the validation of the assembly. A total of 2,304 C. sinensis DNA sequences retrieved from NCBI (http://www.ncbi.nlm.nih.gov/dbEST/) were also used to aligned with CSS and CSA genome assemblies with a threshold of a coverage >= 90% and an identity >= 90%. The mapping rate of CSS is 81.0%, which was higher than that of CSA (73.5%, Dataset S5). To evaluate the gene quality of our annotation, 26,046 Sanger-derived C. sinensis ESTs (>200 bp) were downloaded from NCBI and mapped to the assembly

Page 8: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

using BLAT (17) with a threshold of E-value < 1e-5 and identity >0.99. Of the 26,046 ESTs, 90.1% (23,672) were aligned to CSS scaffolds with an accuracy of >90%, while only 78.5% of ESTs can be covered by CSA scaffolds. It confirmed a higher level of gene coverage in the CSS genome than in the CSA genome (Dataset S6). Besides, we also assessed the assembly using the plant Benchmarking Universal Single-Copy Orthologs (BUSCO) database (18). About 91.40% and 85.20% of the BUSCO gene sets were discovered in CSS and CSA genome assembly, respectively (Dataset S7). Results also showed that CSA has missed ~5.2% of BUSCO gene sets and 9.6% of the BUSCO genes were fragmented when mapping to the annotated gene sets, while CSS only missed 2.0% of BUSCO gene sets and harboured 2.6% of the fragmented BUSCO genes. These indicate that CSS genome has a relatively completeassembly.

S1.7 Assessment of chloroplast insertions

To detect the insertion of chloroplast DNA into the CSS nuclear genome, the previously reported CSS chloroplast sequence (1) (19) was aligned against the assembled scaffolds using BLAST with a threshold e-value <1e-5 and identity >0.9. Mapping blocks in scaffolds with the length longer than 1 Kb were identified as chloroplast insertions. And then, a total of 79 large (1-12 Kb) homologous sequences of the tea chloroplast were identified in our assembly, amounting to a total size of 177.5 Kb that were inserted into 40 scaffolds. The length of the longest chloroplast DNA insertion was 12.04 Kb and the total chloroplast DNA accounts for ~ 1.10% of the tea nuclear genome. Our results indicated that frequent organellar-to-nuclear genome transfers had occured.

S2 Genome annotation and characterization

S2.1 Annotation of transposable elements

To identify transposable elements (TEs) in the assembly, we integrated de novo methods and homology-based methods. The de novo prediction programs RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html) and LTR-FINDER(20) were first used to search for repetitive sequences in the assembled scaffolds, and then the obtained repeat sequences were applied to build a non-redundant repeat sequence library, with which repetitive sequences in the tea plant genome were identified using Repeatmasker (http://www.repeatmasker.org). The homology-based prediction was conducted by comparing the assembly to the Repbase database (Repbase-18.04) using RepeatMasker and RepeatProteinMask (version 3.3.0).

And then, TEs identified by two methods were further integated to remove redundant TEs. All the TEs were classified into families with an identity cutoff of >=50%. Additionally, tandem repeats in the tea plant genome were identified using Tandem Repeats Finder (version 4.04). Repeat annotation identified a total amount of ~1.86 Gb, or nearly 64% TEs against the 2.89 Gb non-gapped tea genome assembly (Dataset S8). This percentage is similar to that evaluated by Sanger-sequenced BACs (56.81%). The

Page 9: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

~ 64% TE content of the tea genome is lower than the ~ 69% predicted for the cocoa genome, but more than those observed ~36% in kiwifruit (12) and ~50% in coffee (11). Out of the TEs, long terminal repeat (LTR) elements were the dominating type, accounting for ~ 58.6% of the assembly (more than 90.8% of all TEs). The Gypsy-type and Copia-type LTR retrotransposons account for ~45.85% and ~8.24% of the tea genome, respectively (Dataset S9).

SSRs (simple sequence repeats) are frequently used as markers for genetic analysis. To detect SSRs, we analyzed the genome sequence of CSS using a program written in Perl. Briefly, a DNA sequence, like N1N2N3N4N5……Nk……Ni_1Ni, was considered as a string. To detect a tandem repeat of size n (1-6) at position Nk, we compared the sequence Nk……Nk+n-1 with subsequent sequences starting at positions Ni+n, Ni+2n, Ni+3n, Ni+4n ……, and further extended a repeat when a minimum number of units [(N)15, (NN)8, (NNN)6, (NNNN)5, (NNNNN)4, and (NNNNNN)4 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide motifs, respectively], were repeated in tandem. Monomers were also identified as mathematical repeats in repeat annotation, and were not considered here. While scanning for di-, tri-, tetra-, penta-, and hexa-nucleotide repeats, we did not consider the motifs involving runs of single nucleotides. Similarly, for tetra-nucleotide repeats, the combinations representing perfect di-nucleotide repeats were ignored. For hexa-nucleotide repeats, combinations representing perfect tri-nucleotide repeats were ignored. The results were validated by manually checking a randomly selected sequence of scaffolds. A polyA repeat was the same as a polyT repeat on a complementary strand. Similarly, (AC)n was equivalent to (CA)n, (TG)n, and (GT)n, while (AGC)n was equivalent to (CGA)n, (TCG)n, and (GCT)n in different reading frames or on a complementary strand. Using the pipeline, we detected 59,765 SSRs in the CSS assembly, which can serve as resources for marker-assisted breeding. The repeat units of SSRs varied from dimers to hexamers, and dimers were the most abundant type (68.63% of all SSRs). Of the dimer repeats, CT/TC/AG/GA repeats were the most abundant type. For trimers, TAA/TTA/ATT/AAT repeats were the most predominant. In tetramers, AAAT/TTTA/ATTT/TAAA repeats were the most abundant. No apparent predominance of a given motif was detected in the pentamer, or hexamer motifs (Dataset S10).

S2.2 Prediction of protein-encoding genes

Our strategy to predict non-redundant protein-encoding gene models was a combination of ab initio gene prediction, homolog searching and EST/unigene-based

prediction conducted on the repeat-masked genome. To aid the gene prediction, we generated an average of 11.8 Gb clean RNA-seq data from eight primary tissues of Camellia sinensis cv. Shuchazao (Dataset S11). The eight tissues were from Shuchazao grown in De Chang fabrication base in Anhui, China. These tissues were apical buds (AB), young leaves (YL), mature leaves (ML), old leaves (OL), immature stems (ST), flowers (FL), young fruits (FR), and tender roots (RT). Samples were

Page 10: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

snap-frozen and stored at -80°C until processing. Total RNA from each tissue was extracted separately using a modified CTAB method (21, 22). RNA integrity was measured using gel electrophoresis and the Agilent 2100 Bioanalyzer with a minimum integrity number value of 8. RNA containing polyA tails (mostly mRNA) was isolated from 20µg of the total RNA pool using Dynal oligo (dT)25 beads (Invitrogen) according to the manufacturer's protocol. Following the purification, the polyA RNA was fragmented into approximately 200 nt size fragments using RNA Fragmentation Reagents (Ambion). Using these short fragments as templates, double-stranded cDNA was synthesized using random primers (Invitrogen), Superscript II reverse transcriptase (Invitrogen), RNase H (Invitrogen) and DNA polymerase I (Invitrogen). The end repair of double-stranded cDNA fragments was subsequently performed using Klenow polymerase, T4 DNA polymerase and T4 polynucleotide kinase (NEB, Britain), and Illumina adapters (containing primer sites for sequencing and flowcell surface annealing) were ligated to the short fragments using T4 DNA ligase (Invitrogen, USA). The products were enriched for the cDNA fragments between 180-220 bp using Qiaquick Gel Extraction Kit (Qiagen) and amplified with PCR for preparing the sequencing library. Then, Agilent 2100 Bioanalyzer was used to detect the quantity and quality of cDNA. Finally, eight cDNA libraries were sequenced on an Illumina HiSeq™ 2000 platform according to the manufacturer's instructions. The fluorescent images underwent base-calling and quality value calculations by the standard Illumina data processing pipeline, from which paired-end reads averaging 90bp were obtained (Dataset S11).

For the de novo gene predictions, AUGUSTUS (v2.5.5) (23) were used to identify candidate protein-encoding genes in the masked CSS genome with self-trained model parameters, and we obtained 78,513 gene models. For the homology-based predictions, we used the homologous proteins proposed for the genomes of kiwifruit (Actinidia chinensis, ftp://bioinfo.bti.cornell.edu/pub/kiwifruit/), coffee (Coffea canephora, http://coffee-genome.org/), poplar (Populus trichocarpa, JGI phytozome Version 9.0) and grape (Vitis vinifera, genoscope 2011-4 www.genoscope.cns.fr/externe/Download/Projets/Projet_ML/data/12X/assembly/). They were mapped to the masked assembly using TBLASTN with an E-value threshold of 1e-5. Subsequently, homologous genome sequences were aligned against the matching proteins to define gene models using GeneWise (v2.2.0)(24). With this method, the preliminary homology-based gene models in the tea genome were comprised of 59,739 from kiwifruit, 42,217 from coffee, 65,800 from poplar, and 40,491 from grape, respectively (Dataset S14). For the EST-based method, 26,046 Sanger-derived Camellia sinensis ESTs (> 200 bp) in GenBank were mapped to the tea genome using BLAT (identity ≥ 90%, coverage ≥ 90%), and then the overlaps among the spliced alignments were filtered and linked using PASA software. We obtained 26,318 gene models supported by tea ESTs. The RNA-seq data were assembled into potenial transcripts isoform using the StringTie (25) for each of eight tissues, and then merged into a finally transcriptome yielding 61,681 unigenes.

Page 11: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

To further improve the accuracy of prediction of gene models, PacBio sequencing was performed to construct a transcriptome of the mixed the tissues above mentioned. The Iso-Seq library was prepared according to the Isoform Sequencing protocol (Iso-Seq) using the Clontech SMARTer PCR cDNA Synthesis Kit and the BluePippin Size Selection System protocol. A total of four libararies were constructed with the insert-sizes of 0~1 Kb, 1~2 Kb, 2~3 Kb and 3~6 Kb. According to the Iso-Seq protocol, we obtained 361,947 reads, of which 210,682 are full-length reads of insert (ROI) containing 5′-primer, 3′- primer and the poly(A) tail (Dataset S12). Each of the library ROI were clustered into library isoforms, including 24,711, 45,159, 42,963, 18,990 isoforms of <1 Kb, 1~2 Kb, 2~3 Kb, 3~6 Kb, respectively. These isoforms were then assembled into the final merged 80,217 transcripts with an average length of 1,781 bases (Dataset S13).

The elementary gene set was generated after removing weakly supported gene models according to the following filtering thresholds: (1) those with two or more supporting evidence from de novo prediction, homology, EST or RNA-Seq data but a overlap <=20%; (2) those without RNA-Seq data evidence and a start codon; (3) those predicted by de novo gene models with a overlap >=20% but without RNA-Seq data evidence; (4) those with supporting evidence from de novo prediction but an RPKM <5 and coverage <50; and (5) those with a CDS length smaller than 300 bp. The predicted gene models were integrated to yield the unique gene set using the MAKER annotation pipeline (26), generating a total of 33,932 high confident protein-encoding genes in CSS genome (Dataset S15).

S2.3 Functional annotation

To assign gene functions to the protein-encoding gene models, we compared them to the SwissProt and TrEMBL protein databases (Uniprot release 2011-01) using BLASTP with an E-value threshold of ≤10-5. Domain-based comparisons (including Pfam, SMART, PRINTS, PROSITE, and ProDom) were performed by InterProScan (version 4.7, ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/), and searched to identify conserved domains/families in the protein-coding genes. We selected and assigned the highest-scoring category to each tea plant gene. Functional annotation by Gene Ontology (GO; http://www.geneontology.org) was carried out based on the corresponding InterPro entries using Blast2GO (27) with an E-value cutoff of 10-5. Metabolic pathway annotations were performed by sequence comparisons with the KEGG proteins (Release 76) using BLASTP (E-value threshold: 10-5). Of the total gene models, 31,392 had significant similarities in functional protein databases, wherein Swissprot, InterPro, GO and KEGG assigned possible functions to 28,269, 27,251, 15,896 and 26,379 CSS gene models, respectively (Dataset S16).

S2.4 Annotation of transcription factors

Transcription factors (TFs) are key regulators of gene expression. Transcription factors

Page 12: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

were identified and classified into different gene families with iTAK (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi). We identified 2,486 TF genes (7.32% of all the protein-coding genes) in the tea genome, which was similar to that in Arabidopsis but higher than that in coffee, cocoa or grape (Figure 1A in main text, Dataset S17). In 86 TF families, the most highly represented TF families were ERF (170 genes), MYB (154 genes), bHLH (134 genes), C2H2 (128 genes) and NAC (110 genes) (Dataset S17).

S2.5 Scaffold anchoring by genetic map

To construct CSS pseudochromosomes, we used the previous high-density genetic map (28) to anchor the assembled scaffolds to linkage groups. The genetic map was developed with an F1 mapping population (148 F1 individuals). The map spans 3,965 cM across 15 linkage groups and is composed of 6,042 single nucleotide polymorphism (SNP) markers, with an average inter-locus distance of 1.0 cM. The anchoring of the assembled scaffolds onto 15 linkage groups (LGs) using 6,042 SNP markers was performed in 5 steps: (1) we mapped all the SLAF-Seq reads containing SNP markers against the CSS assembled sequences, using the BLAST tool (version 2.2.23) with one mismatch allowed; (2) based on the BLAST results, we preferentially selected a total of 5,218 SNP markers that matched identical scaffolds in both directions; (3) from the SNP markers with only one direction matched with scaffolds, we retained the markers with one-to-one matching relationship, which generated a total of 664 markers; (4) for the SNP markers matching with multiple scaffolds, a total of 787 markers were screened according to SNP alleles located in the corresponding scaffolds; (5) the identified syntenic relationships between the tea genome with those of coffee and/or grape (Supporting Information S4) provided information for us to identify an additional 222 SNP markers from the remaining markers. Using the selected 5,218 SNP markers, a total of 7,420 scaffolds were anchored to the 15 tea plant pseudochromosomes, comprising 74.39% (2.34Gb) of the tea genome assembly. The anchored scaffold carried 25,520 genes, accounting for 75.21% of all predicted protein-coding genes (Dataset S18).

S2.6 Annotation of non-coding RNA genes

Small RNA fragments (16-30 nt) were isolated from 200 μl of total RNA pool for leaves or roots of Shuchazao (sample ID: ZLW01 and ZLW02) using 15% polyacrylamide denaturing gel. After purification, the small RNAs were ligated sequentially to 5′ RNA adaptor and 3′ RNA adaptor by T4 RNA ligase, and were reversely transcribed to cDNA, and then were amplified by PCR. Finally, the purified and validated sRNA-derived cDNA libraries were constructed and sequenced by the Illumina HiSeq 2500 system provided by LC Sciences (Houston, Texas, USA). Raw sequences were processed to remove adaptors and we obtained 118Mb and 143Mb of clean data for the two samples, respectively. Small RNA sequences of 16 to 25 nt were

Page 13: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

retained and mapped back to the tea genome assembly to identify microRNA genes. We aligned the microRNAs against the miRBase (version 21) by Blastn, and then the hits with 100% coverage and mismatch ≤2 were retained as known tea microRNAs. With the same threshold, the sequencing reads were aligned to known tea pre-microRNAs and microRNAs for confirmation. The remaining reads without matching hits were further mapped to the tea genome assembly. The small RNAs that could be mapped to the antisense strand of exons in the assembly, introns or intergenic regions but could not be annotated with other identified ncRNAs were selected to predict novel microRNAs using MIREAP software (http://sourceforge.net/projects/mireap/) with default parameters. Conserved microRNA were identified by comparing the predicted microRNA genes with plant miRNA sequences in miRBase (http://www.mirbase.org/, version 21) and their secondary structures were predicted using mFold (29). We used psRobot (version 1.2) and TargetFinder (version 1.5) to predict microRNA target genes. Finally, we identified 355 microRNAs, of which 189 are novel by small RNA sequencing from leaf and root. A total of 3,283 target genes were predicted for these microRNAs (Dataset S19).

The tRNA genes in CSS were identified using tRNAscan-SE (version 1.23) (30) with eukaryote parameters. The rRNA genes were identified by searching the genome assembly against the rRNA sequences (Rfam database, release 11.0(31)) of Arabidopsis, grape and poplar using Blastn with an identity cutoff of >= 90% and a coverage at 80% or more. The C/D box snoRNAs were predicted by Snoscan(32). The snRNAs and H/ACA box snoRNAs were identified by mapping the genome sequences to the Rfam database using INFERNAL (v1.0) (33) with default parameters. Through these efforts, a total 597 transfer RNA, 2,838 ribosomal RNA and 832 small nuclear RNA were annotated in the whole genome assembly (Dataset S20).

S2.7 Characterization of gene structures

To characterize the gene structures in the tea genome, we compared the parameters of gene models across several fully sequenced plant genomes. The average value for predicted protein-encoding gene models in the tea genome is 4,053 bp in length, with 3.3 exons with an average exon length of 259 bp per gene. The average intron size is 1,408 bp, which is larger than other published eudicots, but smaller than and the basal angiosperm Amborella trichopoda (34) (Dataset S15).

S3 Comparison of CSS with CSA genomes

S3.1 Genomic synteny between CSS and CSA genome assemblies

To perform comparison between CSS and CSA genome, we first constructed the genomic synteny between them using MCSCanX package (35) with default

Page 14: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

parameters. Briefly, annotation files of CSA genome that include protein sequences, coding sequences (CDS) and gene general feature format (GFF3) were downloaded from CSA genome database (http://www.plantkingdomgdb.com/tea_tree/) (36). The downloaded CSA protein seuqnces were then aligned against CSS proteome using Blastp with the parameters of “–e 1e-10 –b 5 –v 5 –m 8”. Finally, the generated alignment file together with GFF3 annotation files of CSA and CSS were fed to MCSCanX package to detect the collinear blocks. Five genes were required to call a collinear block. Consequently, a total of 121 syntenic blocks that contain 1543 orthologous collinear genes were identified. Notably, the total number of obtained blocks between CSA and CSS are underestimated primarily due to their fragmentation assembly, particularly CSA. In CSA, a total of 22,452 genes are located in the scaffolds with fewer than 10 genes, and approximately 12,213 genes are just positioned in scaffolds with fewer than five genes. It is understandable that fewer number of genes in scaffolds prevent us from massively detecting the intra-genome collinearity.

S3.2 Charaterization of collinear orthologous genes between CSS and CSA

To investigate the evolutionary divergence between CSA and CSS genome, we calculated the synonymous substitution rates (Ks) districution of the collinear orthologous gene pairs identified in section 3.1 using the perl script “add_ka_and_ks_to_collinearity.pl” implemented in MCScanX package. Distribution of Ks values were ploted using the function of “barplot” in R language. Results showed that Ks distribution of these collinear orthologous gene pairs peaked around 0.005 to 0.02. The Ks value was converted to the divergence time according to the formular of T=Ks/2r, where T indicated divergence time. Based on the molecular colock (r) of a substitution rate of 6.5 × 10-9 mutations per site per year (37) for eudicots, we revealed that CSA and CSS diverged from their common ancestor ~0.38-1.54 mya. We also further calculated the sequence similarity of collinear orthologous gene pairs between CSS and CSA using the function of “percentage_identity” implemented in Bioperl. This suggested that the average sequence similarities of orthologous genes at DNA and protein level were 92.35% (median 97.76%) and 93.94% (median 98.41%), respectively.

S4 Genome evolution and expansion

S4.1 Orthologous gene clusters

To identify orthologous genes among 12 representative plant genomes, including CSS, grape (38) (Vitis vinifera, genoscope), poplar (39) (Populus trichocarpa, JGI phytozome version 9.0), coffee (11) (Coffea arabica, http://coffee-genome.org/), Cocoa (40) (Theobroma cacao, CIRAD version 0.9), African oil palm (41) (Elaeis guineensis, version 1.0), peach (42) (Prunus persica, JGI phytozome version 7.0),

Page 15: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Medicago (43) (Medicago truncatula, Version Mt3.5v4), Kiwifruit (12) (Actinidia chinensis, ftp://bioinfo.bti.cornell.edu/pub/kiwifruit/), Amborella (34) (Amborolla trichopoda, version 1.0), Arabidopsis (44) (Arabidopsis thaliana, TAIR 10.0) and CSA (36) (Camellia sinensis var. assamica) were retrieved from their genomic websites. In data preprocessing, we removed pseudogenes or the TE-derived genes that had BLASTN hits (E-value<1e-5, identity>50%, coverage >80%, and length<=150bp) in Repbase. The longest ORF was selected to represent each gene and the translated putative protein sequences longer than 50 amino acids were retained. For genes that contained ambiguous bases, the codons were substituted to ‘NNN’ and the corresponding amino acid to ‘X’. Then, an all-against-all alignment was performed to compare protein sequences with a database containing the full protein dataset for all these plant species using BLASTP with a maximum acceptable E-value of 1×10-5. Based on the mapping results, we applied OrthoMCL (version 1.4) (45) to cluster genes and to construct gene families with the default inflation parameter. From this analysis, 27,610 genes in 15,224 gene families were identified in the CSS genome. Of these, 320 single-copy orthologous gene families were shared by the tea genome and other 10 investigated plant species (SI Appendix, Fig. S3).

S4.2 Phylogenetic inference

OrthoMCL clustered a total of 320 single-copy gene families in CSS, CSA and 10 fully sequenced plant genomes mentioned above. As the basal angiosperm, Amborella trichopoda was chosen as the root in our analysis. These families were used to construct a phylogenetic tree. Protein sequences of single-copy gene families were firstly aligned using MUSCLE (http://www.ebi.ac.uk/Tools/msa/muscle/). The coding sequences of the genes were extracted based on the alignment results and concatenated to generate a supergene for each species. A phylogenetic tree was constructed from the supergene sequences by Mrbayes (46) with the parameter set at 1,000,000 (1 sample per 100 generations) and the best substitution model (GTR+gamma+I) determined by Modeltest (47). Using Amborolla trichopoda as the out-group, two independent runs supported the same topology. The tree demonstrated that tea plant belongs to the Theales order in subclass Dilleniidae due to the closer relationship between tea plant and kiwifruit than those between tea plant and other sequenced plants (SI Appendix, Fig. S7).

To estimate the divergence times of C.sinensis from the other ten plant species, the 320 single-copy orthologous genes were subjected to the MrBayes and the MCMC tree programme (48) with the Phylogenetic Analysis by Maximum Likelihood (PAML) software package (49). MCMC runs were conducted for 1,000,000 generations (1 sample/100 generations) and the first 250 samples were burned in. Calibration times were gained from the TimeTree database (http://www.timetree.org/, Datasets S21). The divergence time of tea plant and kiwifruit ancestors was measured as ~79 million years ago, whereas a remote ancestor was shared by tea and coffee lineages ~108

Page 16: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

million years ago. This indicated that Theales and Rubiales arised quite soon after the divergence of Asteridae and Rosidae (SI Appendix, Fig. S7). Although the phylogenetic tree suggests that CSS represents a remote lineage in the Asteridae, this may be explained by the lower substitution rate detected in the CSS lineage (~0.2 substitutions per site per mya) when compared to the kiwifruit or coffee lineages.

S4.3 Tea-specific gene families

To investigate the common gene families between the genes that shared by tea and 10 representative species and tea plant-specific gene families using the result from OrthoMCL. In total, 11,128 gene families shared by the 11 species were identified (SI Appendix, Fig. S3). More gene families were shared between the tea plant and kiwifruit than between tea plant and the other studied species. This is as what has been expected, given that kiwifruit is the closest relative of tea plant in the group (SI Appendix, Fig. S7). Notably, a total of 1,064 genes in 429 unique gene families were supported as tea-specific (SI Appendix, Fig. S3). Out of them, 808 genes were annotated against KEGG database, and 615 domains were characterized using InterProScan (Pfam database). Using a custom PERL script, 35 KEGG gene families and 42 Pfam families of the tea-specific families displayed significant difference by

Fisher’s exact test (p-value<0.01, FDR<0.01; Datasets S23-S24). The 10 most frequently occuring KEGG gene familes contained interleukin-1 receptor-associated kinases, disease resistance proteins, leucine-rich repeat proteins, glutathione S-transferases, stromal membrane-associated proteins, F-box proteins, 26S proteasome non-ATPase regulatory subunits and (+)-neomenthol dehydrogenases. Of these candidate tea-specific domains, numerous genes were predicted to encode serine-threonine/tyrosine-protein kinases, ankyrin repeat-containing domains, cytochrome P450s, NB-ARCs, F-box domains, Zinc finger (GRF-type) domains, PGG domain, pentatricopeptide repeat, aminotransferase-like proteins, sulfotransferase domains, S-locus glycoproteins (Dataset S23). Interestingly, cytochrome P450 domains are predicted to contribute to extensive modifications of various secondary compounds (50). Aminotransferase-like proteins are related to amino acid metabolism and nitrogen metabolism. Leucine-rich repeat proteins and NB-ARCs are often associated with disease resistence and stress responses (51).

S4.4 Expansion and contraction of gene families

To investigate the expansion or contraction of gene clusters, the gene families generated by OrthoMCL and the phylogenetic tree structure of 11 species mentioned above were subjected to a computational analysis of changes in gene family size using a CAFE calculation (version 2.1) (52). By this approach, we identified a total of 1,810 and 1,001 CSS genes families that had undergone expansion and contraction, respectively, of which 2,704 and 2,890 genes were annotated by KEGG and InterScan (Pfam), respectively (Fig. S4). Among the expanded gene, 52 and 53 domains had

Page 17: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

significant differences in their percentage in tea plant (FDR <0.05; Datasets S23-S24). Gene families that specifically expanded in tea were also further investigated, indicating specific expansions of gene families involved in certain metabolic and catalytic activities in the tea plant. Notably, the genes encoding the S-locus glycoproteins and the S-locus receptor kinases, which have been previously reported to control pollen-stigma interaction of self-incompatibility(53), markedly expanded in the tea genome. MYB domains activities have been shown to actively interact with promoters of the phenylpropanoid pathway and flavonoids pathway genes(54). Volatile compounds derived from oxidation of lipids and carotenoids or from the terpenoid and shikimate pathways are crucial for tea aroma. The gene families involved in the biosynthesis of terpines, such as 1-deoxy-D-xylulose-5-phosphate synthase, Deoxyxylulose-5-phosphate synthase, (-)-germacrene D synthase, alpha-farnesene synthase, isoprene synthase, valencene/7-epi-alpha-selinene synthase, (3S)-linalool synthase, (R)-limonene synthase, geranyllinalool synthase, and (-)-alpha-terpineol synthase are abundant in the tea genome (Datasets S23-S24). Interestingly, the patterns of functional enrichment were similar in the specific gene families and expansive gene families, indicating that these genes recently and independently originated in the tea plant lineage.

S4.5 Genome expansion of tea genome

To investigate genome expansion in CSS, whole-genome duplication (WGD) event were analyzed. Orthologous genes between the genomes of tea and two other species (grape and cocoa) were performed by all-versus-all BlastP method (E-value less than 1E-5). Then, homologous syntenic blocks were identified across the studied plant genomes according to the method in moso bamboo. Syntenic genes were first detected by paired alignments of orthologous genes using Blastp with an E-value less than 1E-20. Then, syntenic gene blocks in tea scaffolds were determined with two thresholds: (1) number of the genes in one syntenic block ≥ 3; (2) number of non-syntenic tea genes between two adjacent syntenic genes < 5. The identification and following manual check of the syntenic blocks and breakpoints between the blocks were performed based on our in-house perl script (Dataset S21 and 25). The total length of syntenic blocks was estimated to cover ~50% of the tea genome in comparison with each of coffee, grape or cocoa. Finally, the 4DTv (distance-transversion rate at 4-fold degenerate sites) values (55) of the identified homologous blocks in tea-vs-tea, tea-vs-grape and tea-vs-cocoa were calculated with the HKY substitution model (SI Appendix, Fig. S8).

S5 Evolution of catechin biosynthesis

S5.1 Catechin biosynthesis pathway

Phenolic compounds, one kind of important secondary metabolites in plants, account for 18% to 36% dry weight in fresh tea leaves and tender stems. The primary

Page 18: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

phenolics in tea are flavan-3-ols, flavonols, anthocyanins, flavones and proanthocyanidins (PAs, also called condensed tannins). In fresh leaves, catechins (a subgroup of flavan-3-ols), comprising up to 80% of phenolic compounds, include non-galloylated catechins [(+)-catechin (C), (-)-epicatechin (EC), (+)-gallocatechin (GC), and (-)-epigallocatechin (EGC)] and galloylated catechins [(-)-epicatechin-3-gallate (ECG) and (-)-epigallocatechin-3-gallate (EGCG)] (56). Among them, galloylated catechins account for up to 80% of total catechins. Catechins are important components of tea flavor and have been shown to provide benefits for human health (57). They also play a crucial role in plant defense against insect herbivores, microorganisms and competing plants (58). Catechins, like other flavonoids, are synthesized via the phenylpropanoid-flavonoid pathway in plants (59) (Figure 3A in main text). In the phenylpropanoid pathway, phenylalanine is converted to 4-coumaroyl-CoA by the enzymes phenylalanine ammonia lyase (PAL, EC 4.3.1.24), cinnamate 4-hydroxylase (C4H, EC 1.14.13.11) and 4-coumarate CoA ligase (4CL, EC 6.2.1.12). The PAL initiates the phenylpropanoid pathway and catalyzes the deamination of L-phenylalanine to produce trans-cinnamic acid (60, 61), which is then converted to p-coumaric acid by an oxidative reaction catalyzed by C4H, a CYP73 family of cytochrome P450-dependent monooxygenase (61). Then, 4-Coumaroyl-CoA is synthesized through 4CL using 4-coumaric acid and acetyl-CoA as substrates (62, 63). Following the phenylpropanoid pathway, metabolic fluxes go toward different pathways, such as flavonoids pathway by chalcone synthase (CHS, EC 2.3.1.74), or toward lignin through shikimate/quinate hydroxycinnamoyltransferase (HCT, EC:2.3.1.133), or hydroxycinnamates biosynthesis. In the flavonoid biosynthetic pathway, chalcone synthase (CHS, EC 2.3.1.74) is the first key enzyme turning on the pathway, and catalyses the condensation of three acetate residues from malonyl-coenzyme A (CoA) with 4-coumaroyl-CoA to form chalcone (64, 65). Subsequently, chalcone isomerase (CHI, EC 5.5.1.6) catalyzes the stereo-specific cyclization of chalcones into naringenin (5, 7, 4'-trihydroxyflavanone) (66, 67). Naringenin can be converted by flavonoid 3'-hydroxylase (F3'H, EC 1.14.13.21, CYP75B subfamily) (68) or flavonoid 3', 5'-hydroxylase (F3'5'H, EC 1.14.13.88, CYP75A subfamily) (69, 70) into eriodictyol (5,7,3',4'-tetrahydroxyflavanone) or dihydrotricetin (5,7,3',4',5'-pentahydroxyflavanone), respectively. Naringenin, eriodictyol and dihydrotricetin are flavanones, from which flavonoid B-ring 4′-dihydroxylated, 3′,4′-dihydroxylated and 3′,4′,5′-trihydroxylated compounds can be derived, respectively. Among them, B-ring 3′,4′-dihydroxylated compounds are the precursors of 3′,4′-dihydroxylated catechins (C, EC and ECG), whereas 3′,4′,5′-trihydroxylated flavanones determine the formation of 3′,4′,5′- trihydroxylated catechins (GC, EGC and EGCG) (69). With the catalyzation of flavanone 3-hydroxylase (F3H, EC 1.14.11.9) (71, 72), flavanones can be converted to dihydroflavonols. Following that, DFR, LAR, ANS and ANR, the four enzymes involved in the downstream of

Page 19: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

flavonoid pathway, are directly responsible for non-galloylated catechin biosynthesis. Dihydroflavonol 4-reductase (DFR, EC1.1.1.219) is a key enzyme responsible for reduction of dihydroflavonols to leucoanthocyanidins (flavan-3,4-diols). DFR controls dihydroflavonol precursor flux into biosynthetic pathways of anthocyanins, catechins and PAs, in competing with a parallel branch pathway from dihydroflavonols to flavonols controlled by flavonol synthase (FLS, EC 1.14.11.23). Due to its crucial role, DFRs have been studied extensively in several plant species. One DFR gene has been reported in barley (Hordeum vulgare) (73), Arabidopsis (66), grape (Vitis vinifera) (74), tomato (Lycopersicon esculentum) (75), snapdragon (Antirrhinum majus) (76), and rice (Oryza sativa) (77). However, DFR is present as a small gene family in Petunia hybrida (78), Japanese morning glory (Ipomoea nil) or commonmorning glory (Ipomoea purpurea) (79), Gerbera hybrida (80), Medicago truncatula (81) and poplar (82). So far only one DFR has been functionally characterized from tea plant (83). Flavan-3,4-diols can be converted into catechins through two divergent brache pathways that lead to 2,3-trans-flavan-3-ols [such as (+)-C and (+)-GC] or 2,3-cis-flavan-3-ols [such as (-)-EC and (-)-EGC]. Leucoanthocyanidin reductase (LAR, EC 1.17.1.3) is responsible for reduction of leucoanthocyanidins into (+)-C or (+)-GC, whereas (-)-EC and (-)-EGC are generated from leucoanthocyanidin through a two-step reaction catalyzed by anthocyanidin synthase/Leucoanthocyanidin dioxygenase (ANS /LDOX, EC 1.14.11.19) and anthocyanidin reductase (ANR, EC 1.3.1.77). To date, three CsLAR gene has been identified, and they were validated to promote the biosynthesis of catechin monomers and inhibited their polymerization (84, 85). In grape (Vitis vinifera L. cv. Shiraz), the two reported LAR orthologues have different expression patterns in skin and seeds (86). With the completion of the poplar genome sequence, three putative LAR genes were identified (PtLAR1, PtLAR2 and PtLAR3) (87). PtLAR1/PtLAR2 and PtLAR3 proteins occurred in two distinct lineages by phylogenetic analysis. PtLAR1(88) and PtLAR3 (89) have been functionally characterized. One of the two putative LAR homologues in cocoa genome has been functionally validated in transgenic tobacco and Arabidopsis (90). However, no homologous LAR gene has been characterized in Arabidopsis. Although LAR and ANS use a common substrate, ANS, belonging to a 2-oxoglutarate-dependent dioxygenase (2-ODD) superfamily, catalyzes the biosynthesis of anthocyanidins from leucoanthocyanidins. Genes encoding the single anthocyanidin synthase in Medicago (91), grape (74), and Arabidopsis (At4g22880) (92, 93), have been characterized. Two copies of ANS predicted in the poplar genome were reported (87). ANR catalyzes the reduction of anthocyanidins into EC or EGC. ANR activity was first demonstrated in Arabidopsis thaliana and Medicago truncatula (94). In addition, only a single ANR gene was verified in grape(86) and cocoa (90). Two possible ANR sequences were identified in the Populus genome (87), in which

Page 20: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

PtrANR1 has been functionally characterized (88). Two ANR genes have been experimentally validated in tea plant (84). Galloylated catechins are the major flavonoids in tea tender shoots and are synthesized from non-galloylated catechins (EC and EGC) via1-O-glucose ester-dependent two-step reactions catalyzed by galloyl-1-O-β-D-glucosyltransferase (UGGT) and epicatechin:1-O-galloyl-β-D-glucose O-galloyltransferase (ECGT) (95). Genes encoding the enzyme catalyzing first committed step of gallotannin biosynthesis, UGGT, have been previously reported in Quercus robur (UGT84A13) (96) and grape (3 VvgGTs) (97). Functional characterization of these UGGTs suggests that they exhibit UDP-glucose: gallic acid glucosyltransferase activity and catalyze the formation of galloyl -1-O-β-D-glucose. In tea plant, CsUGT84A22, one UGT84A subfamily member, has been validated to encode UGGT and to be responsible for the production of galloyl-1-O-β-D-glucoside from gallic acid and β-D-glucose (98). For the second step, the enzyme ECGT in tea plant has been demonstrated as an acyltransferase to catalyze the galloylation of EC and EGC using galloyl-1-O-β-D-glucoside as a galloyl donor to form ECG and EGCG (95). Several previously described genes encoding acyltransferases in Arabidopsis (99-101), Solanum pennellii (102), Brassica napus (103) and Avena strigosa (104) that play key roles in secondary metabolism, have been characterized as a subfamily 1A of serine carboxypeptidase-like (SCPL) proteins.

S5.2 Extraction and HPLC analysis of catechins and PAs

Catechins were extracted from the eight tissues of tea cultivar Shuchazao used for transcriptome sequencing according to the method described by Tai et al (105). Briefly, 0.1 gram of freeze-dried tissue was ground in liquid nitrogen with a mortar and pestle and extracted with 3 mL 80% methanol in an ultrasonic sonicator for 10 min at 4°C. After centrifugation at 6,000 rpm for 10 min, the residues were re-extracted twice as described above. The supernatants were combined and diluted with 80% methanol to a volume of 10 mL. The obtained supernatants were filtered through a 0.22 μm organic membrane before high-performance liquid chromatography (HPLC) analysis. The C, GC, EC, EGC, ECG and EGCG contents in the extracts were measured using a Waters 2695 HPLC system equipped with a 2489 ultraviolet (UV)-visible detector. A reverse-phase C18 column (Phenomenex 250 mm×4.6 mm, 5 micron) was used and the samples were eluted at 25 °C at a flow-rate of 1 mL min−1. The detection wavelength was set to 278 nm. The mobile phase consisted of 0.17% (v/v) acetic acid (A) in water, 100% acetonitrile (B), and the gradient elution was as follows: B 6% from 0 to 4 min, to 14% at 16 min, to 15% at 22 min, to 18% at 32 min, to 29% at 37 min, to 45% at 45 min, to 45% at 50 min, to 6% at 51 min and to 6% at 60 min. Then, 10 μL of the filtrate was injected into the HPLC system for analysis. The filtered sample (10 μL) was injected into the HPLC system for analysis. The samples from 8 different tissues of CSS cv. Shuchazao mentioned above in the RNA-Seq experiments were analyzed in triplicate. Standard compounds for C, GC, EC,

Page 21: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

EGC, ECG and EGCG were purchased from Shanghai Winherb Medical Technology, Ltd., China. Total catechins were measured by the vallinin-HCl staining method as described by Liu et al (106). The extraction and detection of PAs was performed according to the method described by Pang et al (107). Briefly, 0.5 g ground samples were extracted with 5 ml of 70% acetone/0.5% acetic acid by vortexing, and then sonicated at room temperature for 1 h. After centrifugation at 2,500 g for 10 min, the residues were reextracted twice as above. The pooled supernatants were then extracted three times with chloroform and three times with hexane, and the supernatants (containing soluble PAs) and residues (containing non-soluble PAs) from each sample were freezing dried separately. The dried soluble PAs were suspended in extraction solution to a concentration of 3g/ml per sample. Total soluble PA content was calculated spectrophotometrically after reaction with DMACA-staining reagent (0.2% wt/vol DMACA in methanol-3N HCl) at 640 nm, with (-)-catechin as standard. To analyze the components of the soluble PAs, normal phase HPLC analysis was performed using HPLC coupled to postcolumn derivatization with DMACA reagent (2). For quantification of insoluble PAs, 1 ml of butanol-HCl reagent was added to the dried residues and the mixtures sonicated at room temperature for 1 h, followed by centrifugation at 2,500g for 10 min. The absorption of the supernatants was measured at 550 nm; the samples were then boiled for 1 h, cooled to room temperature, and the absorbance at 550 nm recorded again, with the first value being subtracted from the second. Absorbance values were converted into PA equivalents using a standard curve of procyanidin B1 (Sigma, USA).

By quantitative analysis, the accumulations of the six catechins were found in all of the 8 different tissues, except roots. In roots, a small quantity of EC and trace amounts of C were detected, but GC, EGC, ECG and EGCG were not detected. These results are similar to those described by Jiang et al (108). In other tissues of tea plants, galloylated catechins, mainly ECG and EGCG, were the predominant catechins, with maximam contents of 48.3 and 122.5 mg·g-1 dry weight in buds, respectively. The galloylated catechins account for ~75-84% of total catechins in buds, young leaves and flowers. The cis-flavan-3-ols (EC, EGC, ECG and EGCG) were more abundant than the trans-flavan-3-ols (C and GC) in all tissues. The amounts of total catechins were higher in buds (212 mg·g-1 dry weight) and young leaves (160 mg·g-1 dry weight) than in other tissues (Figure 3B in main text, Dataset S30). Using DMACA-staining method, the differrent distribution of soluble PAs (including monomeric catechins and soluble ploymeric catechins) and non-soluble PAs (non-soluble ploymeric catechins) were detected. Unlike the abundant accumulation of galloylated catechins in young buds and leaves, relatively much more non-soluble PAs are generated in tea flowers, fruits and roots (Dataset S30).

S5.3 Identification of genes involved in the catechin biosynthesis

Page 22: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

To investigate the genes involved in catechin biosynthesis, previously described homologous genes from Arabidopsis, grape, poplar, medicago, petunia, cocoa and other dicots were used as query sequences. These genes encode 11 PAL proteins (4 AtPAL (60), 1 VvPAL (74), 5 PtPAL (87) and 1NtPAL (109) proteins), 4 C4H proteins (1 CsC4H (61), 1 AtC4H (110, 111) and 2 PtC4H (112) proteins) , 11 4CL proteins (1Cs4CL (63), 4 At4CL (113), 4 Pt4CL (114, 115) and 2 Nt4CL (116) proteins), 12 CHS proteins (3CsCHS (64), 1 AtCHS (117), 1 VvCHS (74), 1 McCHS (118), 5 MdCHS (119, 120) and 1 PtCHS (65) proteins), 14 CHI proteins (2AtCHI (117, 121) , 1 VvCHI (74), 1 PsCHI (122), 1PhCHI (123), 1PvCHI (124), 1 CisCHI (125), 1 SmCHI (126), 1 PcCHI (127), 2 LjCHIs (67), 1 IpCHI (128), 1 MsCHI (129) and 1 GmCHI (130) proteins), 7 F3’H proteins (1 AtF3'H (131), 1VvF3'H (70), 1 IpF3'H (132), 1 PhF3'H (133), 3 MdF3'Hs (68) proteins), 8 F3'5'H proteins (1 CsF3'5'H (69), 1 VvF3'5'H (70), 1 SlF3'5'H (134), 1 VmF3'5'H (135), 1 CrF3'5'H (136), 1 PhF3'5'H (137), 1 EgF3'5'H (138) and 1 GmF3'5'H (139) proteins), 8 F3H sequences (1 AtF3H (71), 1 CasF3H (72), 1 VvF3H (74), 1 HvF3H (140), 1 GmF3H (130), 1 GsF3H (141), 1 CisF3H (125), 1 MsF3H (142) proteins), 7 DFR proteins (1 CsDFR (83, 143), 1 AtDFR (66), 2 MdDFRs (81), 2 PtDFR (82), 1 VvDFR (74) proteins), 11 FLS proteins (1 CsFLS (144), 2 AtFLS (145, 146), 5 VvFLS(147), 1 PhFLS (148) and 1 CuFLS (149) proteins), 4 ANS proteins (1 AtANS (93), 1 VvANS (74), 1 TcANS (90), 1MtANS (91)), 7 LAR proteins (1 CsLAR (84), 2 VvLAR (86), 2 PtLAR (88, 89), 1 TcLAR (90), 1 MtLAR (91) proteins), 7 ANR proteins (2 CsANR (84), 1 AtANR (94), 1 MtANR (150), 1 VvANR (86), 1 PtANR (88), 1 TcANR (90) proteins), 6 UGT proteins (1 CsUGT84A22 (98), 1 QrUGT84A13 (96), 3 VvgGT and 1VIRSgt (97) proteins), 11 SCPL proteins [1 AtSMT (99), 1 AtSCT (100), 1 AtSAT (101), 2 AtSST (101), 1 SpGAC (102), 1 BnSCT (103), 1 AsSAD7 (104) and 5 CsSCPL proteins (http://pcsb.ahau.edu.cn:8080/CSS/)].

We applied a multiple step approach to identify and annotate genes involved in the catechin biosynthetic pathway. In the first step, the query protein sequences were aligned with all tea gene models by blastp with an E-value cutoff of 1 × 10–10. The aligned hits with at least 50% coverage of seed protein sequences and >50% protein sequence identity was selected to be homologs of catechin biosynthetic genes. In the second step, we compared the query protein genes with the tea plant assembly using tblastn with an E-values threshold of 1E-20. Using GeneWise (v2.2.0) (24), the tblastn results were subjected to gene structure prediction for each copy, in which the gene models aligned against the seed proteins with coverage >50% and identity >50% were retained. The aligned hits from the two steps were manually checked and integrated as the gene candidates involved in catechins biosynthesis. In the third step, we further aligned the gene candidates from step 2 against genes annotated in the phenylpropanoid and flavonoid pathways in the KEGG database by blastp with default parameters. Only the blast hits with the identical functional annotations and >50% identity were considered to be homologs of the catechin biosynthetic genes in the tea

Page 23: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

genome. The same method was also used to identify the corresponding genes in Arabidopsis, grape, poplar, coffee and cocoa. The versions of these plant genomes employed were TAIR 10.0 ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release for Arabidopsis, genoscope 2011-4, www.genoscope.cns.fr/externe/Download/Projets/Projet_ML/data/12X/assembly/ for grape, JGI phytozome version 9.0 for poplar, http://coffee-genome.org/ for coffee and CIRAD version 0.9 for cocoa. All of the identified catechin biosynthetic genes from the tea genome and the other 5 plant genomes were used in the next analysis. In the tea genome assembly, we identified small gene families for PAL (5 copies), C4H (4 copies) and 4CL (4 copies). The copy numbers of these genes in tea are similar to those in the other five plant genomes, excepting the more numerous PAL genes in grape. In the flavonoids biosynthesis pathway, 9 CHS, 3 CHI, 1 F3'H, 4 F3'5'H, 2 F3H, 6 DFR, 3 ANS, 3 LAR, and 2 ANR genes were identified in the tea genome. For the genes encoding UGGT and ECGT enzymes, related to galloylation of EC and EGC, we identified 1 UGT84A and 22 SCPL genes in the tea genome assembly. In comparisons with the related genes identified in the other five studied plant genomes, CHS, DFR, ANS, LAR, ANR and SCPL genes were all greatly expanded in the tea genome, often representing the highest copy number of these genes among these 6 sequenced plant genomes (Dataset S27).

S5.4 Evolution of genes involved in the catechin biosynthesis

To elucidate the evolution of the catechin biosynthetic pathway in tea, we constructed a phylogenetic tree and evaluated the divergence times of gene duplication for each structural gene in the pathway. Amino acid sequence alignments of the identified catechin biosynthetic genes from tea plant, Arabidopsis, grape, poplar, coffee and cocoa were first performed using ClustalW, and then the alignment data were used to construct rooted phylogenetic trees using MEGA 7 (151) with the maximum likelihood method and 1000 bootstrap replicates (SI Appendix, Fig. S15). To detect tandem duplications that may have occurred in each gene family in the pathway, we investigated the locations of all identified genes in the assembly. Timing of divergence of duplicates in each catechin biosynthesis gene was estimated based on the rooted phylogenetic tree with Arabidopsis genes as outgroups. Paralogous pairs of each gene were selected and subjected to calculation of the synonymous substitution rate (Ks) using KaKs_Calculator 2.0 (152) with the MA model. The calculated Ks value was then converted to the divergence time according to T=Ks/2r, where r represented a substitution rate of 6.5 × 10-9 mutations per site per year for eudicots (Dataset S26). The most recent duplications of catechin biosynthetic genes from PAL, C4H, 4CL, CHI, CHS, F3′5′H, FLS, to DFR, ANS, ANR, LAR, SCPL occurred 70 - 110 million years ago, around which the ancient WGD were observed (SI Appendix, Fig. S6).

Page 24: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Notably, large tandem duplication was detected in the CHS gene family in the tea genome (Dataset S31). Tandem duplications were also observed for this gene family in grape and soybean (153), species that contain abundant flavonoids and isoflavonoids, respectively. Galloylated catechins are the major products of the modification of catechins in tea tender shoots. Unlike in tea, grape contains small amount of galloylated catechin monomers whereas the majority is still in condensed PAs, and no galloylated catechins have been reported in Arabidopsis and rice. In the phylogenetic tree, the SCPL genes from tea are clustered into a large clade with several SCPL genes from grape. Notably, out of 22 identified SCPL1A genes, a total of 15 tea SCPL genes were generated by several tandem duplications occurring 2-26 million years ago, suggesting their specificity to the tea lineage (Dataset S26). Based on Ks values, one pair of DFR, and two pairs of CHS divergences occurred at a very recent period, in which tandem duplications contribute to the most evolution of DFR, CHS and SCPL genes (Dataset S26).

S5.5 Expression of catechin biosynthetic genes

Using the reference genome assembly of the tea plant, the expression of all identified catechin biosynthetic genes was analyzed in the transcriptomes of different tissues of Camellia sinensis cv. Shuchazao and Camellia sinensis cv. Longjing 43 (154). We used 8 different tissues of tea cultivar Shuchazao and 13 tissues of tea cultivar Longjing 43. The different tissues contain apical buds (AB), lateral buds at early stage (ELB), lateral buds (LB), young leaves (YL), 1st leaves (L1), 2nd leaves (L2), one leaf and a bud (BL), two leaves and a bud (BTL), mature leaves (ML), old leaves (OL), young stems (ST), flowers (FL), young fruits (FR), seeds (SD), and tender roots (RT). The expression patterns of six key genes (DFR, ANS, LAR, ANR, UGT84A, SCPL1A) involved in catechin biosynthesis were validated in the same 8 tissues of Shuchazao by qRT-PCR using primers designed with Primer 6 software (Dataset S32).

Total RNA was extracted separately from eight tissues of tea plants using the modified CTAB method (21). The RNA integrity was measured using gel electrophoresis and spectrophotometer (Nanodrop). RNA samples were extracted from the samples, and single-stranded cDNAs used for qRT-PCR analysis were synthesized from the RNAs using a Prime-Script™ 1st Strand cDNA Synthesis Kit (TaKaRa, Dalian, China). The PCR amplification was performed at an annealing temperature of 60 °C with IQ5 real-time PCR detection system (Bio-Rad) according to the manufacturer’s instructions. The glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene was used as an internal reference gene, and relative expression was calculated using the 2−ΔΔCT method (155). All qRT-PCR analyses were performed in three biological and three technical replications. In the R platform (version 3.2.2), we evaluated the Pearson Correlation Coefficient (PCC) values and the p-values between each gene’s RNA levels from qRT-T- analysis and from transcriptome (RNA-seq) analysis. The authors appreciate that the RNA level is only a surrogate for measuring gene

Page 25: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

expression, because different rates of RNA turnover can also lead to very different final RNA abundances. However, because we did not measure RNA turnover in this study, we are using RNA levels as a predictor of gene expression.

By transcriptomic analysis, different RNA levels were detected for the different copies of each gene involved in catechin biosynthesis (Dataset S28). In the phenylpropanoid-flavonoid biosynthetic pathway, except for F3′5′H, at least one copy of each gene provides high levels of RNA in almost all tissues of both tea cultivars, while others are differentially expressed in various tissues. The expressions of all F3′5′H copies were barely detected in roots but highly found in buds and young leaves. Thus result is in accordance with extremely low contents of 3′,4′,5′-trihydroxyl flavonoids in roots but their most abundant contents in buds and young leaves(69). After this analysis of RNA-seq data, six key catechin synthesis genes were chosen for validation of RNA level by qRT-PCR. Using the PCC method, we detected significantly high correlations (PCC of 0.79-0.99 and p-value < 0.05) between the expression levels of the selected genes from qRT-PCR analysis and those from transcriptome analysis (SI Appendix, Fig. S13). This confirmation reinforces our conclusion that the multiple copies of these genes show very distinctive expression patterns, characteristic of subfuctionalization occuring after gene duplication.

S5.6 Corelation analyses of gene expression patterns

RNA levels for DFR, LAR, ANS, ANR and SCPL genes were performed for correlated patterns by the PCC method in the R platform (version 3.2.2). We screened for correlations among the gene candidates with a PCC value cutoff of |0.7| and a p-value < 0.05. Among 13 DFR genes, positive correlation was discovered between the expression of the constitutively expressed DFR gene with those of three LARs, two ANR and two ANS genes, indicating a substantial consumption of initial substrates provided by DFR toward flavan-3-ols. In the successive two steps, the expression of one ANR gene exhibits a high cooperativity with two ANS genes. In addition, 20 out of 22 SCPL1A genes are positively correlated with two ANR, one ANS and two LAR genes, strongly driving a large proportion of metabolic flux to go further into the galloylation of flavan-3-ols (Dataset S33). The results revealed a coordinated expression for at least one, sometimes several copies, of the genes that encode the enzymes needed for each step in catechin biosynthesis. These data are also in agreement with the phytochemical observation that the esterfication of flavan-3-ols with galloyl donors is the dominant modification pathway after the biosynthesis of flavan-3-ols in the tea plant. Generally speaking, tandem and non-tandem gene duplications occurred in some key structural genes of the branch nodes leading to catechin biosynthesis in the tea genome. Majority of gene duplication events emerged after the divergence of tea and kiwifruit,

Page 26: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

and were specific to the tea lineage, indicating they independently expanded in the tea genome. Highly coordinated co-expression of the key structural genes and differential expression of different gene duplicates due to their subfunctionalization endow tea plants with a robust activity for the biosynthesis of catechins and galloylated catechins.

S6 Multilevel regulation of catechin biosynthesis

S6.1 Correlation of gene expression and catechin contents

The quantity of each catechin and the RNA levels of the structural genes involved in catechins biosynthesis were compared for the 8 different tissues of tea cultivar Shuchazao. By the PCC method in the R platform (version 3.2.2), significant positive correlations with PCC value cutoff of |0.7| and a p-value < 0.05 were detected between the accumulation of catechins and the RNA levels of 74 catechin biosynthetic genes (Dataset S29). Totally, 34 of them showed strongly positive correlation with the contents of galloylated cis-flavan-3-ols ECG and EGCG, respectively, and most of them show extremely high expression levels in buds and young leaves. Compared with many other flavonoids-producing plants, the most interesting question is why tea plants produce a large amount of monocatechins, especially galloylated catechins, amounting to ~80% of the total catechins, instead of PAs that are abundant in many other plants. We observed 14 out of 16 SCPL that harbor positive correlation with production of galloylated cis-flavan-3-ols highly expressed in the buds and young leaves (Pearson’s correlation test, P < 0.05, Figure 3B and Dataset S29). The results demonstrated that the tissue-specific expression patterns of catechin biosynthetic genes play important roles in the differential distributions of catehins in various tissues.

S6.2 Regulatory network for catechin biosynthesis

Gene expression patterns for all identified TFs and the catechin biosynthetic genes in the 8 tissues from tea cultivar Shuchazao and the 13 tissues from tea cultivar Longjing 43 (154) mentioned above were used to construct the co-expression network using WGCNA (version 1.47) (156) within the R platform (version 3.2.2). Genes that had no detected expression in all tissues were removed in advance. We set the soft thresholds up to 19 based on the scale-free topology criterion employed by Zhang and Horvath (157). An adjacency matrix was developed on the basis of the squared Euclidean distance values, and the topological overlap matrix were calculated for unsigned network detection with Pearson method. Then, we selected the co-expression coefficients more than 0.1 between genes and TFs (p-value ≤ 1e-6). Finally, the network connections were shown using cytoscape (158).

In the tea plant, the expression levels of important genes involved in catechin

Page 27: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

biosynthesis, including CHS, DFR, LAR, ANS, and SCPL, were highly co-associated with 1,089 transcription factor (TF) expression patterns. The correlated TFs were in 56 different TF classes, such as MYB, basic helix-loop-helix (bHLH), MADS, zinc-finger (such as C2H2 and C3H), WRKY, NAC, and ERF (SI Appendix, Fig. S10, Dataset S34). CHS, encoding the first key enzyme controlling the flavonoid pathway, was co-expressed with numerous TFs. Co-expression analyses of these TFs and the downstream structural genes indicated that more TFs were involved in the regulation of DFR and ANS genes genes than in regulating FLS and LAR genes. The greater number and diversity of TFs co-associated with DFR-ANS pathway than with the FLS-pathway is in agreement with the observation that the tea plant generates greater flavan-3-ol components from the DFR-ANS pathway. Interestingly, numerous TFs appear to be involved in the regulation of the SCPL genes, indicating that the extensive galloylation of flavan-3-ol components is mediated by the differential regulation of multi-copy SCPLs.

MYB and bHLH TFs are among the largest families of plant TFs co-expressed with these structural genes. MYB-bHLH-WD40 repeat (MBW) ternary complexes have been demonstrated to be components of regulatory machineries for both up- and down-regulation of the structural genes involved in the phenylpropanoid and flavonoid biosynthesis (159-162). Homologues of MYB, bHLH and WD40 TFs in MBW complexes are present in the tea plant genome. Co-expression analysis suggests that the several MYB TFs participate in fine-tuned transcriptional regulation of the flavan-3-ol biosynthesis pathway by either positive or negative regulatory feedback. Several R2R3-MYB activators homologous to AtTT2 (163), VvMYBPA1 (164), VvMYBPAR (165) and AtMYB5 (166) are correlated to phenylpropanoid and flavonoid biosynthesis in different tissues, according to their expression patterns (Datasets S35-S36). We also found some R2R3-MYB TFs homologous to AtMYBL2 (167), VvMYBC2-L1 (168), and MtMYB2 (169) repressors that show expression correlations to structrual genes involved in catechin biosynthesis. These repressors negatively regulate flavonoid biosynthesis in other plant species through competitively binding to MBW in place of MYB activators, thereby inactivating MBW’s function in up-regulation of target structural genes (160, 161, 170). The tissue-specific expression of MYB activator or repressor genes, representing differential regulating signals in different tissues, influences the relative propertions of anthocyanins or flavan-3-ols in other plant species (161, 168, 169), and our results suggest that this is also true in the tea plant. Some members of the MADS, zinc-finger, WRKY, NAC, and ERF gene families are known to be regulated by and involved in biotic and abiotic stress responses. Some of these genes in the tea genome also show highly co-expressed patterns with flavonoid pathway genes. More WRKY genes, a family known to be especially sensitive to environmental cues, have tight co-expression relations with DFR genes than with FLS genes in the tea genome, suggesting that flavan-3-ol biosynthesis pathways play more crucial roles in tea plant defense against pathogens or general responses to various

Page 28: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

environmental cues than the flavonol branch pathway.

S6.3 Transporters related to flavonoids

Most flavan-3-ols and their derivatives have some degree of cytotoxic activity that can inhibit plant growth and development, especially at high concentrations. Since tea plants can accumulate these flavan-3-ols to ~ 20% dry weight, transport to specific subcellular compartments such as the vacuole, cells, or tissues for their storage to should be essential to minimize toxicity. Multidrug and toxin extrusion (MATE) transporters, such as AtTT12 and MtMATE1, are reported to mediate vacuolar sequestration of flavan-3-ols and derivatives in several plants, although the transport mechanisms remain elusive (171-173). Glutathione S-transferase-like (GST) proteins were previously found to be essential for transport of anthocyanins and proanthocyanidins to vacuoles. AtTT19 (174), VvGST4 (175), and PhAN9 (176) encoding the homologous GST proteins in Arabidopsis, grape, and petunia have been characterized. However, little is known about subcellular, intercellular, or long-distance transport of flavan-3-ols in the tea plant.

In the tea genome, two putative MATE genes are highly homologous to AtTT12 and MtMATE1. At least two GSTs genes are predicted in the tea genome assembly. Although their expression levels are very low in tissues accumulating flavonoids (Dataset S37), some transporters from the solute carrier family (>196 genes), MATE (31 genes), multidrug facilitator superfamily (MFS, 58 genes), oligopeptide transporter, and ATP-binding cassette (ABC, 147 genes) transporter gene familes have experienced expansion in the tea lineage (Datasets S23-S24). Based on co-expression analysis, the close association of anthocyanins and catechin accumulation with expression patterns of ABC, MFS and MATE transporter genes were discovered. The proteins encoded by these genes have been previously characterized as transporters for various plant secondary metabolites (172, 177). Hence, the expansion of those gene families provides numerous transporters both for substrate communications during biosynthesis in different subcellular compartments and for product storage after catechins, alkaloids (e.g., caffeine), theanine, terpenoids, or other secondary products are synthesized in tea plant tissues.

S7. Identification of genes encoding theanine synthetase

S7.1 Theanine biosynthesis pathway

Theanine (γ-glutamylethylamide), a unique non-protein amino acid, occurs in tea plant (178) with much higher amounts than the contents detected in other ~20 species or varieties in Theaceae (179). In addition, it is also present in several plants belonging to the order Ericales (180) and the edible bay boletes mushroom Xerocomus badius (181). It is the major free amino acid in tea, accounting for more than 50% of total free amino acids and constituting 1-2% of the dry weight of tea leaf. Theanine contributes to the

Page 29: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

umami and sweet taste and unique flavor to tea infusion, so it is one of the main components especially responsible for the quality of green tea (182). Especially, theanine has various favorable health benefits in the relaxation of mentality, the improvement of concentration and learning ability, the prevention of certain cancers and cardiovascular disease, the promotion of weight loss and enhancement of the immune system (178, 183, 184).

Tea plant is a kind of high ammonia-tolerance plant that preferentially makes use of ammonia nitrogen rather than nitrate nitrogen. The biosynthesis of theanine in tea plant, which is tightly related to the nitrogen assimilation and metabolism, from the incorporation of glutamate and ethylamine catalyzed by theanine synthetase (TS) is present in all organs of tea seedlings, but roots are the major site of theanine biosynthesis in adult tea trees b y tracing the radioactivity of products from the substrate of 1-14C-ethylamine(179, 180, 185). However, in other plants without theanine, glutamine is considered as the primary product of nitrogen assimilation with all inorganic nitrogen sources and the central metabolite in nitrogen metabolism. Glutamine synthetase (GS, glutamate–ammonia ligase) is usually found to be a crucial enzyme in nitrogen assimilation, catalyzing the ATP-dependent condensation of ammonium with glutamate to form glutamine (186, 187). Therefore, glutamate is a co-substrate shared with TS and GS enzymes. The two enzymes including glutamate synthase (GOGAT) and glutamate dehydrogenase (GDH) are capable of the production of glutamate. GOGAT catalyzes the transfer of the amido group of glutamine to the α-keto position of 2-oxoglutarate to generate two molecules of glutamate, which is associated with the nitrogen recycling (188). GDH is responsible for both the assimilation of ammonia onto 2-oxoglutarate for the formation of glutamate and the deamination of glutamate into 2-oxoglutarate and ammonia (189). The other substrate ethylamine for theanine is possibly derived from a decarboxylation of alanine with alanine decarboxylase (AIDA)(190, 191). Newly synthesized theanine is translocated into the tender shoots through the xylem, where it either accumulates or is broken down into glutamate and ethylamine by theanine hydrolase (ThYD) (180) (Figure 4A).

S7.2 Theanine biosynthetic genes

We used the protein sequences of TS, GS, GOGAT and GDH of Camellia and some other plants that were retrieved from GenBank to identify the relative genes in tea genome, except for AIDA and ThYD due to neither CsAIDA and CsThYD nor orthologs from other species available from the public databases. The genes involved in theanine biosynthesis were identified just like the method described in catechins and caffeine biosynthesis. The seed protein sequences were aligned with tea plant assembly by tblastn (E-value threshold of 1E-20 and coverage cutoff of 50%), and subsequently searched for the corresponding gene model of each gene with GeneWise (v2.2.0) (24). In addition, the seed protein sequences were aligned with the total tea

Page 30: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

plant gene models by blastp (E-value cutoff of 1E–10). All the hits were manually checked to integrate the whole models for each gene, and only the gene models aligned against the seed proteins with coverage > 50% and identity > 50% were retained. After that, the candidate genes with the same functional annotations as the seed proteins were remained to be the candidate genes. In total, 2 GOGAT genes and 4 GDH genes were identified in tea plant. In addition, 5 candidate genes were identified for the reported sequences of TS and GS, which show high similarity (SI Appendix, Fig. S11).

S7.3 Evolution of genes for theanine synthetase

The multiple amino-acid sequences of the tea candidate genes involved in theanine biosynthesis were used for phylogenetic analysis. After the first alignment of the candidate genes against each other using ClustalW, a maximum likelihood tree was constructed using MEGA 7 (151) with 100 bootstrap replicates. Based on the sequence similarity on the amino acids level, the 5 identified candidate genes are homolous to the reported sequences of TS and GS from GenBank in the phylogenetic tree (Figure 4B in main text). GS genes are universally distributed in both prokaryotes and eukaryotes, and the GS protein superfamily can be divided into three distinct types, GSI, GSII, and GSIII based on sequence similarity, molecular size and the number of subunits in the holoenzyme. To date, GSIII type has only reported in a limited number of prokaryotes (192, 193), and no GSIII has been found in plants. GSI type has been found in both prokaryotes and eukaryotes (194, 195), but so far the functions of plant GSI genes has barely characterized except for the recent reports of the possible roles of AtNodGS in biotic stress signaling (196) and MtGSI-like genes in nitrogen signaling (197). GSII type is the typical eukaryotic type, besides some members of GSII gene family harbored in eubacterial lineages (198). Plant GSII enzymes have been extensively investigated in many eudicot (199-203) and monocot plants (204-208), comprising multiple GSII isoenzymes located in the cytosol or chloroplast. In most angiosperms, multiple cytosolic GSII isoenzymes (GSII-1) are encoded by a small family of nuclear genes which exhibit tissue-specific and development-dependent expression patterns (204, 206, 209), whereas basically a single nuclear gene encodes the plastidic GSII isoenzymes (GSII-2) which is primarily located in the chloroplast. According to the GS phylogenetic tree, the identified 5 putative candidates are split into two distinct groups with less than 20% similarity in amino acids, exhibiting only TEA015198.1 (named CsTSI) grouped in GSI clade, and the other 4 genes clustered in GSII clade (named CsGSII-1.1, CsGSII-1.2, CsGSII-1.3 and CsGSII-2). CsGSII-1.1, CsGSII-1.2 and CsGSII-1.3 in the GSII clade are highly homolgous to cytosolic GSII genes, and CsGSII-2 are highly homolgous to plastidic GSII isoenzyme (85~99.6% identity at amino acid level). The co-existence of two differential types of GS in tea plant supported the paralogous evolution of GSI and GSII genes via a gene duplication event, which was similar to those observed in Medicago, Arabidopsis, rice, maize and Sorghum (195). The GS phylogenetic tree also confirmed that the evolution of GS

Page 31: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

gene family gene preceded the divergence of tea plant with other species.

S7.4 Detection of theanine in different tissues

Theanine was extracted and detected as described by Tai et al (105). Briefly, 0.15 g of freeze-dried tea leaves was ground in liquid nitrogen with a mortar pestle, and 5 mL deionized water were added into the sample, which was incubated for 20 min in a water bath at 100 °C. After centrifugation at 6,000 rpm for 10 min, the residues were re-extracted once as described above. The supernatants were combined and diluted with water to a volume of 10 mL. The supernatants were also filtered through a 0.22μm membrane before HPLC analysis. Theanine was detected using a Waters 600E series HPLC system equipped with a quaternary pump and a 2489 ultraviolet (UV)-visible detector. A reverse-phase C18 column (Phenomenex 250 mm× 4.6 mm, 5 micron) was used at a flow rate of 1.0 mL/min. The column oven temperature was set to 25 °C. The detection wavelength was set to 199 nm for analysis. The mobile phase consisted of 0.05 % (v/v) trichloroacetic acid (A) in water, 50 % acetonitrile (B), and the gradient elution was as follows: 0% (v/v) to 100% at 40 min, to 100 % at 45 min and to 0% at 60 min. Then, 5μL of the filtrate was injected into the HPLC system for analysis. The standard compound theanine were purchased from Shanghai Winherb Medical Technology, Ltd.

Using HPLC analysis of the contents of theanine in 8 different tissues of the tea plant as mentioned as those in transcriptome sequencing, we detected the highest content of theanine (43.45 mg·g-1 dry weight) in the root, and the higher contents of theanine in buds and the tender first leaves than the older leaves and other tissues (SI Appendix, Fig. S12). The gene expression of the CsTSI and 4 CsGSII genes were calculated in the same 8 tissues based on their transcriptome data. Especially, the gene expression level of CsTSI was discovered extremely higher in roots than those in other 7 tissues (Dataset S38), which exhibited the significant correlation with the content of theanine in 8 tissues with the correlation coefficient of 0.98 (p-value of 1.27E-5; Dataset S39). Notably, CsTSI is homologous to PtGS gene (GenBank accession number: BAE44186) of a bacteria, Pseudomonas taetrolens, which has been verified to encode an enzyme available for theanine production by coupled fermentation with energy transfer (210, 211). Therefore, we supposed that the CsTSI could be a constitutive gene contributing to the biosynthesis of theanine in tea plant. Furthermore, one CsGSII-1.1 gene that was highly expressed in young buds are also strongly correlated with theanine content across tissues (Pearson’s correlation test, P < 0.05, Dataset S39).

In addition, we treated a large number of tea cutting seedlings by the addition of ethylamine hydrochloride (EA) aqueous solution to their roots. The leaves of the treated tea cutting seedlings were collected at 1, 3, 6, 9, 12 and 18 days after EA treatment for expression analysis of the CsTSI and 4 CsGSII genes using qRT-PCR and quantitative analysis of theanine using HPLC, respectively (Dataset S40, SI

Page 32: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Appendix, Fig. S12). Expression levels of GS/TS, GOGAT, and GDH genes also increased with EA treatment.

S7.5 Comparison of protein sequences

So far, two types of GS, Zea mays type I GS, ZmGSIa, Medicago truncatula type II GS, MtGSIIa, have been studied on their crystal structures (197, 212). These studies revealed some key amino acid residues for substrate binding such as glutamate and ammonia binding, and cofactor binding such as ATP binding and metal coordination domains (197, 212). CsTSI protein is especially larger (more than 830 aa) than general GS proteins, which usually have 300-400 aa. Most GSs have only a C-terminal GLN-SYN domain; however, CsTSI has also an additional amido-hydrolyase domain in N-terminus. Surprisingly, this domain is homology to Medicago truncatula nodulin MtN6 and soybean GmN6L that plays a role in plant-rhizobial interaction (197). Similar large GSs have been identified from several plants, an AtNodGS from Arabidopsis, HvNodGS from Hordeum vulgare, MtNodGSs from Medicago truncatula (213), and fungus Aspergillus nidulans FluG (196). Alignment of ZmGSIa and MtGSIIa, with CsTSI, CsGSII-1.1, and Pseudomonas taetrolens PtGS (210), indicates that CsTSI has very different amino acid residues such as D and E for ammonia binding, suggesting that CsTSI may have different binding preference to ethylamine over ammonia (SI Appendix, Fig. S11). CsTSI has closer structure modeling to PtGS and GMAS (214) that synthesizes theanine, than other GS proteins. This is also in line with its expression patterns highly correlated with theanine accumulation patterns. Further molecular and enzymatic characterization of CsTSI and other gene candidates, such as CsGSII-1.1, for theanine synthetase may provide more in-depth understanding of theanine synthesis.

S7.6 Verification of in vivo theanine synthesis function of CsTSI

The open reading frame (ORF) of CsTSI was cloned from cDNAs prepared from tea root tissues. After sequencing confirmation, CsTSI ORF was cloned into pDONR221 vector and then recombined into pB2GW7 binary vector using gateway cloning system (Invitrogene, Life Science Technology, USA). The resulting construct pB2GW7- harboring 35S::CsTSI was transformed into Agrobacterium tumefaciens GV3101 by electroporation. The positive A. tumefaciens GV3101 clones were selected for Arabidopsis thaliana (Col-0) transformation using standard flower dipping method. At least ten independent transformants were screened and selected by using BASTA. Homozygous T3 transgenic lines were verified with RT-PCR and used for theanine biosynthesis experiments. qRT-PCR was conducted with standard protocols for RNA extraction, cDNA preparation, PCR with a pair of CsTSI specific primers, and calculation.

Seeds of CsTSI-OE lines and the wild-type (Col-0) were surface-sterilized and germinated on a half-strength MS medium agar plate supplemented with or without

Page 33: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

10 mM ethylamine chloride (EA). The plates were incubated on 22℃ growth chamber under light period 16/8 h for germination and seedling development. After twenty two days of growth, four-leaf Arabidopsis seedlings were harvested for extraction and measurement of theanine, according to the method described above. At least three independent transgenic lines were used in each experiment, and three independent experiments were conducted for in vivo assay of theanine synthesis activity of CsTSI.

The qRT-PCR was set by using a forward primer AtACTIN-F: 5'-AATGGAACTGGAATGGTCAAGGC-3' and a reverse primer AtACTIN-R: 5'-TGCCAGATCTTCTCCATGTCATCCCA-3' to amplify AtACTIN gene, and used a forward primer CsTSI-1qRTF: 5'-GTTGATGTTTCTGGGCAGCA-3' and reverse primer and a reverse primer CsTSI-1qRTR: 5'-CTCACCCACACCAGTCAGAT-3' to amplify the CsTSI gene.

References 1. Huang H, Shi C, Liu Y, Mao SY, & Gao LZ (2014) Thirteen Camellia chloroplast genome

sequences determined by high-throughput sequencing: genome structure and phylogenetic

relationships. BMC evolutionary biology 14:151.

2. Yang H, et al. (2016) Genetic divergence between Camellia sinensis and its wild relatives

revealed via genome-wide SNPs from RAD Sequencing. PloS one 11(3):e0151424.

3. Dolezel J, Greilhuber J, & Suda J (2007) Estimation of nuclear DNA content in plants using

flow cytometry. Nature protocols 2(9):2233-2244.

4. Huang H, Tong Y, Zhang QJ, & Gao LZ (2013) Genome size variation among and within

Camellia species by using flow cytometric analysis. PLoS One 8(5):e64981.

5. Murray MG & Thompson WF (1980) Rapid isolation of high molecular weight plant DNA.

Nucleic acids research 8(19):4321-4325.

6. Li R, et al. (2010) De novo assembly of human genomes with massively parallel short read

sequencing. Genome Res. 20(2):265-272.

7. Luo R, et al. (2012) SOAPdenovo2: an empirically improved memory-efficient short-read de

novo assembler. Gigascience 1(1):18.

8. Li R, et al. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics

25(15):1966-1967.

9. English AC, et al. (2012) Mind the gap: upgrading genomes with Pacific Biosciences RS

long-read sequencing technology. PLoS One 7(11):e47768.

10. Peng Z, et al. (2013) The draft genome of the fast-growing non-timber forest species moso

bamboo (Phyllostachys heterocycla). Nat. Genet. 45(4):456-461, 461e451-452.

11. Denoeud F, et al. (2014) The coffee genome provides insight into the convergent evolution of

caffeine biosynthesis. Science 345(6201):1181-1184.

12. Huang S, et al. (2013) Draft genome of the kiwifruit Actinidia chinensis. Nature

communications 4:2640.

13. Cai J, et al. (2015) The genome sequence of the orchid Phalaenopsis equestris. Nat Genet

Page 34: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

47(1):65-72.

14. Luo M & Wing RA (2003) An improved method for plant BAC library construction. Methods

Mol. Biol. 236:3-20.

15. Shi X, Zeng H, Xue Y, & Luo M (2011) A pair of new BAC and BIBAC vectors that facilitate

BAC/BIBAC library construction and intact large genomic DNA insert exchange. Plant

methods 7:33.

16. Ewing B & Green P (1998) Base-calling of automated sequencer traces using phred. II. Error

probabilities. Genome research 8(3):186-194.

17. Kent WJ (2002) BLAT--the BLAST-like alignment tool. Genome Res. 12(4):656-664.

18. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, & Zdobnov EM (2015) BUSCO:

assessing genome assembly and annotation completeness with single-copy orthologs.

Bioinformatics 31(19):3210-3212.

19. Ye XQ, et al. (2014) Entire chloroplast genome sequence of tea (Camellia sinensis cv.

Longjing43): a molecular phylogenetic analysis. Journalof Zhejiang University (Agric. & Life

Sci.) 40(4):404-412.

20. Xu Z & Wang H (2007) LTR_FINDER: an efficient tool for the prediction of full-length LTR

retrotransposons. Nucleic acids research 35(Web Server issue):W265-268.

21. Shi CY, Wan XC, Jiang CJ, & Sun J (2007) Method for high-quality total RNA isolation from

tea plants [Camellia sinensis (L.) O. Kuntze)]. Journal of Anhui Agricultural University

34(3):360-363.

22. Shi CY, et al. (2011) Deep sequencing of the Camellia sinensis transcriptome revealed

candidate genes for major metabolic pathways of tea-specific compounds. BMC genomics

12:131.

23. Stanke M, Steinkamp R, Waack S, & Morgenstern B (2004) AUGUSTUS: a web server for

gene finding in eukaryotes. Nucleic acids research 32(Web Server issue):W309-312.

24. Birney E, Clamp M, & Durbin R (2004) GeneWise and Genomewise. Genome research

14(5):988-995.

25. Pertea M, et al. (2015) StringTie enables improved reconstruction of a transcriptome from

RNA-seq reads. Nat. Biotechnol. 33(3):290-295.

26. Holt C & Yandell M (2011) MAKER2: an annotation pipeline and genome-database

management tool for second-generation genome projects. BMC Bioinformatics 12:491.

27. Gotz S, et al. (2008) High-throughput functional annotation and data mining with the

Blast2GO suite. Nucleic acids research 36(10):3420-3435.

28. Ma JQ, et al. (2015) Large-scale SNP discovery and genotyping for constructing a

high-density genetic map of tea plant using specific-locus amplified fragment sequencing

(SLAF-seq). PLoS One 10(6):e0128798.

29. Zuker M (2003) Mfold web server for nucleic acid folding and hybridization prediction.

Nucleic acids research 31(13):3406-3415.

30. Lowe TM & Eddy SR (1997) tRNAscan-SE: a program for improved detection of transfer

RNA genes in genomic sequence. Nucleic acids research 25(5):955-964.

31. Burge SW, et al. (2013) Rfam 11.0: 10 years of RNA families. Nucleic acids research

41(Database issue):D226-232.

Page 35: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

32. Lowe TM & Eddy SR (1999) A computational screen for methylation guide snoRNAs in yeast.

Science 283(5405):1168-1171.

33. Nawrocki EP, Kolbe DL, & Eddy SR (2009) Infernal 1.0: inference of RNA alignments.

Bioinformatics 25(10):1335-1337.

34. Amborella Genome P (2013) The Amborella genome and the evolution of flowering plants.

Science 342(6165):1241089.

35. Wang Y, et al. (2012) MCScanX: a toolkit for detection and evolutionary analysis of gene

synteny and collinearity. Nucleic Acids Res. 40(7):e49.

36. Xia EH, et al. (2017) The tea tree genome provides insights into tea flavor and independent

evolution of caffeine biosynthesis. Mol Plant 10(6):866-877.

37. Gaut BS, Morton BR, McCaig BC, & Clegg MT (1996) Substitution rate comparisons

between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate

differences at the plastid gene rbcL. Proc. Natl. Acad. Sci. U. S. A. 93(19):10274-10279.

38. Jaillon O, et al. (2007) The grapevine genome sequence suggests ancestral hexaploidization in

major angiosperm phyla. Nature 449(7161):463-467.

39. Tuskan GA, et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. &

Gray). Science 313(5793):1596-1604.

40. Argout X, et al. (2011) The genome of Theobroma cacao. Nature genetics 43(2):101-108.

41. Singh R, et al. (2013) Oil palm genome sequence reveals divergence of interfertile species in

Old and New worlds. Nature 500(7462):335-339.

42. International Peach Genome I, et al. (2013) The high-quality draft genome of peach (Prunus

persica) identifies unique patterns of genetic diversity, domestication and genome evolution.

Nat Genet 45(5):487-494.

43. Young ND, et al. (2011) The Medicago genome provides insight into the evolution of

rhizobial symbioses. Nature 480(7378):520-524.

44. Arabidopsis Genome I (2000) Analysis of the genome sequence of the flowering plant

Arabidopsis thaliana. Nature 408(6814):796-815.

45. Li L, Stoeckert CJ, Jr., & Roos DS (2003) OrthoMCL: identification of ortholog groups for

eukaryotic genomes. Genome Res. 13(9):2178-2189.

46. Ronquist F, et al. (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference and model

choice across a large model space. Syst. Biol. 61(3):539-542.

47. Posada D & Crandall KA (1998) MODELTEST: testing the model of DNA substitution.

Bioinformatics 14(9):817-818.

48. Arvestad L, Berglund AC, Lagergren J, & Sennblad B (2003) Bayesian gene/species tree

reconciliation and orthology analysis using MCMC. Bioinformatics 19 Suppl 1:i7-15.

49. Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol.

24(8):1586-1591.

50. Singer AC, Crowley DE, & Thompson IP (2003) Secondary plant metabolites in

phytoremediation and biotransformation. Trends in biotechnology 21(3):123-130.

51. Martin GB, Bogdanove AJ, & Sessa G (2003) Understanding the functions of plant disease

resistance proteins. Annual review of plant biology 54:23-61.

52. De Bie T, Cristianini N, Demuth JP, & Hahn MW (2006) CAFE: a computational tool for the

Page 36: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

study of gene family evolution. Bioinformatics 22(10):1269-1271.

53. Nasrallah J, Kao T-H, Chen C-H, Goldberg M, & Nasrallah M (1987) Amino-acid sequence of

glycoproteins encoded by three alleles of the S locus of Brassica oleracea. Nature

326:617-619.

54. Hichri I, et al. (2011) Recent advances in the transcriptional regulation of the flavonoid

biosynthetic pathway. (1460-2431 (Electronic)).

55. Huang S, et al. (2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics.

41(12):1275-1281.

56. Cabrera C, Artacho R, & Gimenez R (2006) Beneficial effects of green tea--a review. Journal

of the American College of Nutrition 25(2):79-99.

57. Khan N & Mukhtar H (2007) Tea polyphenols for health promotion. Life sciences

81(7):519-533.

58. War AR, et al. (2012) Mechanisms of plant defense against insect herbivores. Plant signaling

& behavior 7(10):1306-1320.

59. Koes R, Verweij W, & Quattrocchio F (2005) Flavonoids: a colorful model for the regulation

and evolution of biochemical pathways. Trends in plant science 10(5):236-242.

60. Huang J, et al. (2010) Functional analysis of the Arabidopsis PAL gene family in plant growth,

development, and response to environmental stress. Plant physiology 153(4):1526-1538.

61. Singh K, Kumar S, Rani A, Gulati A, & Ahuja PS (2009) Phenylalanine ammonia-lyase (PAL)

and cinnamate 4-hydroxylase (C4H) and catechins (flavan-3-ols) accumulation in tea.

Functional & integrative genomics 9(1):125-134.

62. Hamberger B & Hahlbrock K (2004) The 4-coumarate:CoA ligase gene family in Arabidopsis

thaliana comprises one rare, sinapate-activating and three commonly occurring isoenzymes.

Proceedings of the National Academy of Sciences of the United States of America

101(7):2209-2214.

63. Rani A, Singh K, Sood P, Kumar S, & Ahuja PS (2009) p-Coumarate:CoA ligase as a key gene

in the yield of catechins in tea [Camellia sinensis (L.) O. Kuntze]. Functional & integrative

genomics 9(2):271-275.

64. Takeuchi A, Matsumoto S, & Hayatsu M (1994) Chalcone synthase from Camellia sinensis:

isolation of the cDNAs and the organ-specific and sugar-responsive expression of the genes.

Plant & cell physiology 35(7):1011-1018.

65. Sun Y, et al. (2011) Isolation and promoter analysis of a chalcone synthase gene PtrCHS4

from Populus trichocarpa. Plant cell reports 30(9):1661-1671.

66. Shirley BW, Hanley S, & Goodman HM (1992) Effects of ionizing radiation on a plant

genome: analysis of two Arabidopsis transparent testa mutations. The Plant cell 4(3):333-347.

67. Shimada N, et al. (2003) A cluster of genes encodes the two types of chalcone isomerase

involved in the biosynthesis of general flavonoids and legume-specific 5-deoxy(iso)flavonoids

in Lotus japonicus. Plant physiology 131(3):941-951.

68. Han Y, et al. (2010) Ectopic expression of apple F3'H genes contributes to anthocyanin

accumulation in the Arabidopsis tt7 mutant grown under nitrogen stress. Plant physiology

153(2):806-820.

69. Wang YS, et al. (2014) Functional analysis of flavonoid 3',5'-hydroxylase from tea plant

Page 37: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

(Camellia sinensis): critical role in the accumulation of catechins. BMC plant biology 14:347.

70. Bogs J, Ebadi A, McDavid D, & Robinson SP (2006) Identification of the flavonoid

hydroxylases from grapevine and their regulation during fruit development. Plant physiology

140(1):279-291.

71. Owens DK, Crosby KC, Runac J, Howard BA, & Winkel BS (2008) Biochemical and genetic

characterization of Arabidopsis flavanone 3beta-hydroxylase. Plant physiology and

biochemistry : PPB / Societe francaise de physiologie vegetale 46(10):833-843.

72. Mahajan M & Yadav SK (2014) Overexpression of a tea flavanone 3-hydroxylase gene

confers tolerance to salt stress and Alternaria solani in transgenic tobacco. Plant molecular

biology 85(6):551-573.

73. Kristiansen KN & Rohde W (1991) Structure of the Hordeum vulgare gene encoding

dihydroflavonol-4-reductase and molecular analysis of ant18 mutants blocked in flavonoid

synthesis. Molecular & general genetics : MGG 230(1-2):49-59.

74. Sparvoli F, Martin C, Scienza A, Gavazzi G, & Tonelli C (1994) Cloning and molecular

analysis of structural genes involved in flavonoid and stilbene biosynthesis in grape (Vitis

vinifera L.). Plant molecular biology 24(5):743-755.

75. Bongue-Bartelsman M, O'Neill SD, Tong Y, & Yoder JI (1994) Characterization of the gene

encoding dihydroflavonol 4-reductase in tomato. Gene 138(1-2):153-157.

76. Holton TA & Cornish EC (1995) Genetics and Biochemistry of Anthocyanin Biosynthesis.

The Plant cell 7(7):1071-1083.

77. Chen M, SanMiguel P, & Bennetzen JL (1998) Sequence organization and conservation in

sh2/a1-homologous regions of sorghum and rice. Genetics 148(1):435-443.

78. Beld M, Martin C Fau - Huits H, Huits H Fau - Stuitje AR, Stuitje Ar Fau - Gerats AG, &

Gerats AG (1991) Flavonoid synthesis in Petunia hybrida: partial characterization of

dihydroflavonol-4-reductase genes. (0167-4412 (Print)).

79. Inagaki Y, et al. (1999) Genomic organization of the genes encoding dihydroflavonol

4-reductase for flower pigmentation in the Japanese and common morning glories. Gene

226(2):181-188.

80. Helariutta Y, Elomaa P, Kotilainen M, Seppanen P, & Teeri TH (1993) Cloning of cDNA

coding for dihydroflavonol-4-reductase (DFR) and characterization of dfr expression in the

corollas of Gerbera hybrida var. Regina (Compositae). Plant molecular biology

22(2):183-193.

81. Xie DY, Jackson LA, Cooper JD, Ferreira D, & Paiva NL (2004) Molecular and biochemical

analysis of two cDNA clones encoding dihydroflavonol-4-reductase from Medicago

truncatula. Plant physiology 134(3):979-994.

82. Huang Y, et al. (2012) Molecular cloning and characterization of two genes encoding

dihydroflavonol-4-reductase from Populus trichocarpa. PloS one 7(2):e30364.

83. Kumar V, Nadda G, Kumar S, & Yadav SK (2013) Transgenic tobacco overexpressing tea

cdna encoding dihydroflavonol 4-reductase and anthocyanidin reductase induces early

flowering and provides biotic stress tolerance. PLoS One 8(6):e65535.

84. Pang Y, et al. (2013) Functional characterization of proanthocyanidin pathway enzymes from

tea and their application for metabolic engineering. Plant physiology 161(3):1103-1116.

Page 38: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

85. Wang P, et al. (2018) Evolutionary and functional characterization of leucoanthocyanidin

reductases from Camellia sinensis. Planta 247(1): 1-16.

86. Bogs J, et al. (2005) Proanthocyanidin synthesis and expression of genes encoding

leucoanthocyanidin reductase and anthocyanidin reductase in developing grape berries and

grapevine leaves. Plant physiology 139(2):652-663.

87. Tsai CJ, Harding SA, Tschaplinski TJ, Lindroth RL, & Yuan Y (2006) Genome-wide analysis

of the structural genes regulating defense phenylpropanoid metabolism in Populus. The New

phytologist 172(1):47-62.

88. Wang L, et al. (2013) Isolation and characterization of cDNAs encoding leucoanthocyanidin

reductase and anthocyanidin reductase from Populus trichocarpa. PLoS One 8(5):e64664.

89. Yuan L, et al. (2012) Molecular cloning and characterization of PtrLAR3, a gene encoding

leucoanthocyanidin reductase from Populus trichocarpa, and its constitutive expression

enhances fungal resistance in transgenic plants. Journal of experimental botany

63(7):2513-2524.

90. Liu Y, Shi Z, Maximova S, Payne MJ, & Guiltinan MJ (2013) Proanthocyanidin synthesis in

Theobroma cacao: genes encoding anthocyanidin synthase, anthocyanidin reductase, and

leucoanthocyanidin reductase. BMC plant biology 13:202.

91. Pang Y, Peel GJ, Wright E, Wang Z, & Dixon RA (2007) Early steps in proanthocyanidin

biosynthesis in the model legume Medicago truncatula. Plant physiology 145(3):601-615.

92. Appelhagen I, et al. (2011) Leucoanthocyanidin Dioxygenase in Arabidopsis thaliana:

characterization of mutant alleles and regulation by MYB-BHLH-TTG1 transcription factor

complexes. Gene 484(1-2):61-68.

93. Abrahams S, et al. (2003) The Arabidopsis TDS4 gene encodes leucoanthocyanidin

dioxygenase (LDOX) and is essential for proanthocyanidin synthesis and vacuole

development. The Plant journal : for cell and molecular biology 35(5):624-636.

94. Xie DY, Sharma SB, Paiva NL, Ferreira D, & Dixon RA (2003) Role of anthocyanidin

reductase, encoded by BANYULS in plant flavonoid biosynthesis. Science

299(5605):396-399.

95. Liu Y, et al. (2012) Purification and characterization of a novel galloyltransferase involved in

catechin galloylation in the tea plant (Camellia sinensis). The Journal of biological chemistry

287(53):44406-44417.

96. Mittasch J, Bottcher C, Frolova N, Bonn M, & Milkowski C (2014) Identification of

UGT84A13 as a candidate enzyme for the first committed step of gallotannin biosynthesis in

pedunculate oak (Quercus robur). Phytochemistry 99:44-51.

97. Khater F, et al. (2012) Identification and functional characterization of cDNAs coding for

hydroxybenzoate/hydroxycinnamate glucosyltransferases co-expressed with genes related to

proanthocyanidin biosynthesis. Journal of experimental botany 63(3):1201-1214.

98. Cui L, et al. (2016) Identification of UDP-glycosyltransferases involved in the biosynthesis of

astringent taste compounds in tea (Camellia sinensis). Journal of experimental botany

67(8):2285-2297.

99. Lehfeldt C, et al. (2000) Cloning of the SNG1 gene of Arabidopsis reveals a role for a serine

carboxypeptidase-like protein as an acyltransferase in secondary metabolism. The Plant cell

Page 39: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

12(8):1295-1306.

100. Shirley AM, McMichael CM, & Chapple C (2001) The sng2 mutant of Arabidopsis is

defective in the gene encoding the serine carboxypeptidase-like protein

sinapoylglucose:choline sinapoyltransferase. The Plant journal : for cell and molecular

biology 28(1):83-94.

101. Fraser CM, et al. (2007) Related Arabidopsis serine carboxypeptidase-like sinapoylglucose

acyltransferases display distinct but overlapping substrate specificities. Plant physiology

144(4):1986-1999.

102. Li AX, Eannetta N, Ghangas GS, & Steffens JC (1999) Glucose polyester biosynthesis.

Purification and characterization of a glucose acyltransferase. Plant physiology

121(2):453-460.

103. Milkowski C, Baumert A, Schmidt D, Nehlin L, & Strack D (2004) Molecular regulation of

sinapate ester metabolism in Brassica napus: expression of genes, properties of the encoded

proteins and correlation of enzyme activities with metabolite accumulation. The Plant journal :

for cell and molecular biology 38(1):80-92.

104. Mugford ST, et al. (2009) A serine carboxypeptidase-like acyltransferase is required for

synthesis of antimicrobial compounds and disease resistance in oats. The Plant cell

21(8):2473-2484.

105. Tai Y, et al. (2015) Transcriptomic and phytochemical analysis of the biosynthesis of

characteristic constituents in tea (Camellia sinensis) compared with oil tea (Camellia oleifera).

BMC plant biology 15:190.

106. Liu Y, Gao L, Xia T, & Zhao L (2009) Investigation of the site-specific accumulation of

catechins in the tea plant (Camellia sinensis (L.) O. Kuntze) via vanillin-HCl staining. Journal

of agricultural and food chemistry 57(21):10371-10376.

107. Pang Y, Peel GJ, Sharma SB, Tang Y, & Dixon RA (2008) A transcript profiling approach

reveals an epicatechin-specific glucosyltransferase expressed in the seed coat of Medicago

truncatula. Proceedings of the National Academy of Sciences of the United States of America

105(37):14210-14215.

108. Jiang X, et al. (2013) Tissue-specific, development-dependent phenolic compounds

accumulation profile and gene expression pattern in tea plant [Camellia sinensis]. PloS one

8(4):e62315.

109. Pellegrini L, Rohfritsch O, Fritig B, & Legrand M (1994) Phenylalanine ammonia-lyase in

tobacco. Molecular cloning and gene expression during the hypersensitive reaction to tobacco

mosaic virus and the response to a fungal elicitor. Plant physiology 106(3):877-886.

110. Mizutani M, Ohta D, & Sato R (1997) Isolation of a cDNA and a genomic clone encoding

cinnamate 4-hydroxylase from Arabidopsis and its expression manner in planta. Plant

physiology 113(3):755-763.

111. Schilmiller AL, et al. (2009) Mutations in the cinnamate 4-hydroxylase gene impact

metabolism, growth and development in Arabidopsis. The Plant journal : for cell and

molecular biology 60(5):771-782.

112. Chen HC, et al. (2011) Membrane protein complexes catalyze both 4- and 3-hydroxylation of

cinnamic acid derivatives in monolignol biosynthesis. Proceedings of the National Academy

Page 40: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

of Sciences of the United States of America 108(52):21253-21258.

113. Ehlting J, et al. (1999) Three 4-coumarate:coenzyme A ligases in Arabidopsis thaliana

represent two evolutionarily divergent classes in angiosperms. The Plant journal : for cell and

molecular biology 19(1):9-20.

114. Hu WJ, et al. (1998) Compartmentalized expression of two structurally and functionally

distinct 4-coumarate:CoA ligase genes in aspen (Populus tremuloides). Proceedings of the

National Academy of Sciences of the United States of America 95(9):5407-5412.

115. Chen HC, et al. (2013) Monolignol pathway 4-coumaric acid:coenzyme A ligases in Populus

trichocarpa: novel specificity, metabolic regulation, and simulation of coenzyme A ligation

fluxes. Plant physiology 161(3):1501-1516.

116. Lee D & Douglas CJ (1996) Two divergent members of a tobacco 4-coumarate:coenzyme A

ligase (4CL) gene family. cDNA structure, gene inheritance and expression, and properties of

recombinant proteins. Plant physiology 112(1):193-205.

117. Shirley BW, et al. (1995) Analysis of Arabidopsis mutants deficient in flavonoid biosynthesis.

The Plant journal : for cell and molecular biology 8(5):659-671.

118. Tai D, Tian J, Zhang J, Song T, & Yao Y (2014) A Malus crabapple chalcone synthase gene,

McCHS, regulates red petal color and flavonoid biosynthesis. PloS one 9(10):e110570.

119. Hellens RP, et al. (2005) Transient expression vectors for functional genomics, quantification

of promoter activity and RNA silencing in plants. Plant methods 1:13.

120. Gosch C, Halbwirth H, Kuhn J, Miosic S, & Stich K (2009) Biosynthesis of phloridzin in

apple (Malus x domestica Borkh). Plant science : an international journal of experimental

plant biology 176 223-231.

121. Tohge T, et al. (2005) Functional genomics by integrated analysis of metabolome and

transcriptome of Arabidopsis plants over-expressing an MYB transcription factor. The Plant

journal : for cell and molecular biology 42(2):218-235.

122. Wood AJ & Davies E (1994) A cDNA encoding chalcone isomerase from aged pea epicotyls.

Plant physiology 104(4):1465-1466.

123. van Tunen AJ, et al. (1988) Cloning of the two chalcone flavanone isomerase genes from

Petunia hybrida: coordinate, light-regulated and differential expression of flavonoid genes.

The EMBO journal 7(5):1257-1263.

124. Blyden ER, Doerner PW, Lamb CJ, & Dixon RA (1991) Sequence analysis of a chalcone

isomerase cDNA of Phaseolus vulgaris L. Plant molecular biology 16(1):167-169.

125. Moriguchi T, Kita M, Tomono Y, Endo-Inagaki T, & Omura M (2001) Gene expression in

flavonoid biosynthesis: Correlation with flavonoid accumulation in developing citrus fruit.

Physiol Plantarum 111:66-74

126. Li FX, et al. (2006) Overexpression of the Saussurea medusa chalcone isomerase gene in S.

involucrata hairy root cultures enhances their biosynthesis of apigenin. Phytochemistry

67(6):553-560.

127. Fischer TC, et al. (2007) Flavonoid genes of pear (Pyrus communis). Trees (Berl. West)

21(5):521-529.

128. Lu Y, et al. (2009) Environmental regulation of floral anthocyanin synthesis in Ipomoea

purpurea. Molecular ecology 18(18):3857-3871.

Page 41: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

129. McKhann HI & Hirsch AM (1994) Isolation of chalcone synthase and chalcone isomerase

cDNAs from alfalfa (Medicago sativa L.): highest transcript levels occur in young roots and

root tips. Plant molecular biology 24(5):767-777.

130. Ralston L, Subramanian S, Matsuno M, & Yu O (2005) Partial reconstruction of flavonoid and

isoflavonoid biosynthesis in yeast using soybean type I and type II chalcone isomerases. Plant

physiology 137(4):1375-1388.

131. Schoenbohm C, Martens S, Eder C, Forkmann G, & Weisshaar B (2000) Identification of the

Arabidopsis thaliana flavonoid 3'-hydroxylase gene and functional expression of the encoded

P450 enzyme. Biological chemistry 381(8):749-753.

132. Hoshino A, et al. (2003) Spontaneous mutations of the flavonoid 3'-hydroxylase gene

conferring reddish flowers in the three morning glory species. Plant & cell physiology

44(10):990-1001.

133. Brugliera F, Barri-Rewell G, Holton TA, & Mason JG (1999) Isolation and characterization of

a flavonoid 3'-hydroxylase cDNA clone corresponding to the Ht1 locus of Petunia hybrida.

The Plant journal : for cell and molecular biology 19(4):441-451.

134. Olsen KM, et al. (2010) Identification and characterisation of CYP75A31, a new flavonoid

3'5'-hydroxylase, isolated from Solanum lycopersicum. BMC plant biology 10:21.

135. Mori S, Kobayashi H, Hoshi Y, Kondo M, & Nakano M (2004) Heterologous expression of

the flavonoid 3',5'-hydroxylase gene of Vinca major alters flower color in transgenic Petunia

hybrida. Plant cell reports 22(6):415-421.

136. Kaltenbach M, Schroder G, Schmelzer E, Lutz V, & Schroder J (1999) Flavonoid hydroxylase

from Catharanthus roseus: cDNA, heterologous expression, enzyme properties and cell-type

specific expression in plants. The Plant journal : for cell and molecular biology

19(2):183-193.

137. Holton TA, et al. (1993) Cloning and expression of cytochrome P450 genes controlling flower

colour. Nature 366(6452):276-279.

138. Shimada Y, et al. (1999) Expression of chimeric P450 genes encoding flavonoid-3',

5'-hydroxylase in transgenic tobacco and petunia plants(1). FEBS letters 461(3):241-245.

139. Takahashi R, Dubouzet JG, Matsumura H, Yasuda K, & Iwashina T (2010) A new allele of

flower color gene W1 encoding flavonoid 3'5'-hydroxylase is responsible for light purple

flowers in wild soybean Glycine soja. BMC plant biology 10:155.

140. Meldgaard M (1992) Expression of chalcone synthase, dihydroflavonol reductase, and

flavanone-3-hydroxylase in mutants of barley deficient in anthocyanin and proanthocyanidin

biosynthesis. TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik

83(6-7):695-706.

141. Cheng H, Wang J, Chu S, Yan HL, & Yu D (2013) Diversifying selection on flavanone

3-hydroxylase and isoflavone synthase genes in cultivated soybean and its wild progenitors.

PloS one 8(1):e54154.

142. Charrier B, Coronado C, Kondorosi A, & Ratet P (1995) Molecular characterization and

expression of alfalfa (Medicago sativa L.) flavanone-3-hydroxylase and

dihydroflavonol-4-reductase encoding genes. Plant molecular biology 29(4):773-786.

143. Kashmir Singh SK, Sudesh Kumar Yadav, Paramvir Singh Ahuja (2009) Characterization of

Page 42: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

dihydroflavonol 4-reductase cDNA in tea [Camellia sinensis (L.) O. Kuntze]. Plant

Biotechnology Reports 3(1):95-101.

144. Lin GZ, et al. (2007) Expression and purification of His-tagged flavonol synthase of Camellia

sinensis from Escherichia coli. Protein expression and purification 55(2):287-292.

145. Owens DK, et al. (2008) Functional analysis of a predicted flavonol synthase gene family in

Arabidopsis. Plant physiology 147(3):1046-1061.

146. Preuss A, et al. (2009) Arabidopsis thaliana expresses a second functional flavonol synthase.

FEBS letters 583(12):1981-1986.

147. Fujita A, Goto-Yamamoto N, Aramaki I, & Hashizume K (2006) Organ-specific transcription

of putative flavonol synthase genes of grapevine and effects of plant hormones and shading on

flavonol biosynthesis in grape berry skins. Bioscience, biotechnology, and biochemistry

70(3):632-638.

148. Holton TA, Brugliera F, & Tanaka Y (1993) Cloning and expression of flavonol synthase from

Petunia hybrida. The Plant journal : for cell and molecular biology 4(6):1003-1010.

149. Wellmann F, et al. (2002) Functional expression and mutational analysis of flavonol synthase

from Citrus unshiu. European journal of biochemistry / FEBS 269(16):4134-4142.

150. Xie DY, Sharma SB, & Dixon RA (2004) Anthocyanidin reductases from Medicago

truncatula and Arabidopsis thaliana. Archives of biochemistry and biophysics 422(1):91-102.

151. Tamura K, Stecher G, Peterson D, Filipski A, & Kumar S (2013) MEGA6: molecular

evolutionary genetics analysis version 6.0. Molecular biology and evolution

30(12):2725-2729.

152. Wang D, Zhang Y, Zhang Z, Zhu J, & Yu J (2010) KaKs_Calculator 2.0: a toolkit

incorporating gamma-series methods and sliding window strategies. Genomics, proteomics &

bioinformatics 8(1):77-80.

153. Todd JJ & Vodkin LO (1996) Duplications that suppress and deletions that restore expression

from a chalcone synthase multigene family. The Plant cell 8(4):687-699.

154. Li CF, et al. (2015) Global transcriptome and gene regulation network for secondary

metabolite biosynthesis of tea plant (Camellia sinensis). BMC genomics 16:560.

155. Livak KJ & Schmittgen TD (2001) Analysis of relative gene expression data using real-time

quantitative PCR and the 2(-Delta Delta C(T)) Method. Methods 25(4):402-408.

156. Langfelder P & Horvath S (2008) WGCNA: an R package for weighted correlation network

analysis. BMC bioinformatics 9:559.

157. Zhang B & Horvath S (2005) A general framework for weighted gene co-expression network

analysis. Statistical applications in genetics and molecular biology 4:Article17.

158. Shannon P, et al. (2003) Cytoscape: a software environment for integrated models of

biomolecular interaction networks. Genome Res. 13(11):2498-2504.

159. Lepiniec L, et al. (2006) Genetics and biochemistry of seed flavonoids. Annual review of plant

biology 57:405-430.

160. Xu W, et al. (2014) Complexity and robustness of the flavonoid transcriptional regulatory

network revealed by comprehensive analyses of MYB-bHLH-WDR complexes and their

targets in Arabidopsis seed. (1469-8137 (Electronic)).

161. Albert NW, et al. (2014) A conserved network of transcriptional activators and repressors

Page 43: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

regulates anthocyanin pigmentation in eudicots. (1532-298X (Electronic)).

162. Li P, et al. (2016) Regulation of anthocyanin and proanthocyanidin biosynthesis by Medicago

truncatula bHLH transcription factor MtTT8. The New phytologist 210(3):905-921.

163. Nesi N, Jond C, Debeaujon I, Caboche M, & Lepiniec L (2001) The Arabidopsis TT2 gene

encodes an R2R3 MYB domain protein that acts as a key determinant for proanthocyanidin

accumulation in developing seed. The Plant cell 13(9):2099-2114.

164. Bogs J, Jaffe FW, Takos AM, Walker AR, & Robinson SP (2007) The grapevine transcription

factor VvMYBPA1 regulates proanthocyanidin synthesis during fruit development. Plant

physiology 143(3):1347-1361.

165. Koyama K, et al. (2014) Functional characterization of a new grapevine MYB transcription

factor and regulation of proanthocyanidin biosynthesis in grapes. Journal of experimental

botany 65(15):4433-4449.

166. Xu W, et al. (2014) Complexity and robustness of the flavonoid transcriptional regulatory

network revealed by comprehensive analyses of MYB-bHLH-WDR complexes and their

targets in Arabidopsis seed. The New phytologist 202(1):132-144.

167. Matsui K, Umemura Y, & Ohme-Takagi M (2008) AtMYBL2, a protein with a single MYB

domain, acts as a negative regulator of anthocyanin biosynthesis in Arabidopsis. The Plant

journal : for cell and molecular biology 55(6):954-967.

168. Cavallini E, et al. (2015) The phenylpropanoid pathway is controlled at different branches by

a set of R2R3-MYB C2 repressors in grapevine. Plant physiology 167(4):1448-1470.

169. Jun JH, Liu C, Xiao X, & Dixon RA (2015) The transcriptional repressor myb2 regulates both

spatial and temporal patterns of proanthocyandin and anthocyanin pigmentation in Medicago

truncatula. The Plant cell 27(10):2860-2879.

170. Xu W, et al. (2013) Regulation of flavonoid biosynthesis involves an unexpected complex

transcriptional regulation of TT8 expression, in Arabidopsis. The New phytologist

198(1):59-70.

171. Marinova K, et al. (2007) The Arabidopsis MATE transporter TT12 acts as a vacuolar

flavonoid/H+ -antiporter active in proanthocyanidin-accumulating cells of the seed coat. The

Plant cell 19(6):2023-2038.

172. Zhao J (2015) Flavonoid transport mechanisms: how to go, and with whom. Trends in plant

science 20(9):576-585.

173. Zhao J & Dixon RA (2009) MATE transporters facilitate vacuolar uptake of epicatechin

3'-O-glucoside for proanthocyanidin biosynthesis in Medicago truncatula and Arabidopsis.

The Plant cell 21(8):2323-2340.

174. Kitamura S, Shikazono N, & Tanaka A (2004) TRANSPARENT TESTA 19 is involved in the

accumulation of both anthocyanins and proanthocyanidins in Arabidopsis. The Plant journal :

for cell and molecular biology 37(1):104-114.

175. Conn S, Curtin C, Bezier A, Franco C, & Zhang W (2008) Purification, molecular cloning,

and characterization of glutathione S-transferases (GSTs) from pigmented Vitis vinifera L. cell

suspension cultures as putative anthocyanin transport proteins. Journal of experimental botany

59(13):3621-3634.

176. Kitamura S, Akita Y, Ishizaka H, Narumi I, & Tanaka A (2012) Molecular characterization of

Page 44: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

an anthocyanin-related glutathione S-transferase gene in cyclamen. Journal of plant

physiology 169(6):636-642.

177. Shitan N (2016) Secondary metabolites in plants: transport and self-tolerance mechanisms.

Bioscience, biotechnology, and biochemistry 80(7):1283-1293.

178. Nobre AC, Rao A, & Owen GN (2008) L-theanine, a natural constituent in tea, and its effect

on mental state. Asia Pacific journal of clinical nutrition 17 Suppl 1:167-168.

179. Deng WW, Ogita S, & Ashihara H (2010) Distribution and biosynthesis of theanine in

Theaceae plants. Plant physiology and biochemistry : PPB / Societe francaise de physiologie

vegetale 48(1):70-72.

180. Ashihara H (2015) Occurrence, biosynthesis and metabolism of theanine

(gamma-glutamyl-L-ethylamide) in plants: a comprehensive review. Natural product

communications 10(5):803-810.

181. Casimir J, Jadot J, & Renard M (1960) Separation and characterization of

N-ethyl-gamma-glutamine from Xerocomus badius. Biochimica et biophysica acta

39:462-468.

182. Narukawa M, Morita K, & Hayashi Y (2008) L-theanine elicits an umami taste with inosine

5′-monophosphate. Bioscience, biotechnology, and biochemistry 72(11):3015-3017.

183. Vuong QV, Bowyer MC, & Roach PD (2011) L-Theanine: properties, synthesis and isolation

from tea. Journal of the science of food and agriculture 91(11):1931-1939.

184. Mu W, Zhang T, & Jiang B (2015) An overview of biological production of L-theanine.

Biotechnology advances 33(3-4):335-342.

185. Deng WW, Ogita S, & Ashihara H (2009) Ethylamine content and theanine biosynthesis in

different organs of Camellia sinensis seedlings. Zeitschrift fur Naturforschung. C, Journal of

biosciences 64(5-6):387-390.

186. Bernard SM & Habash DZ (2009) The importance of cytosolic glutamine synthetase in

nitrogen assimilation and recycling. The New phytologist 182(3):608-620.

187. Lea PJ & Miflin BJ (2010) Nitrogen assimilation and its relevance to crop improvement. Annu.

Plant Rev. 42:1-40.

188. Gregerson RG, Miller SS, Twary SN, Gantt JS, & Vance CP (1993) Molecular characterization

of NADH-dependent glutamate synthase from alfalfa nodules. The Plant cell 5(2):215-226.

189. Melo-Oliveira R, Oliveira IC, & Coruzzi GM (1996) Arabidopsis mutant analysis and gene

regulation define a nonredundant role for glutamate dehydrogenase in nitrogen assimilation.

Proceedings of the National Academy of Sciences of the United States of America

93(10):4718-4723.

190. Takeo T (1974) L-Alanine as a precursor of ethylamine in Camellia sinensis. Phytochemistry

13(8):1401-1406.

191. Takeo T (1978) L-Alanine decarboxylase in Camellia sinensis. Phytochemistry 17:313-314.

192. Kinoshita S, et al. (2009) The occurrence of eukaryotic type III glutamine synthetase in the

marine diatom Chaetoceros compressum. Marine genomics 2(2):103-111.

193. van Rooyen JM, Abratt VR, & Sewell BT (2006) Three-dimensional structure of a type III

glutamine synthetase by single-particle reconstruction. Journal of molecular biology

361(4):796-810.

Page 45: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

194. Lee BN & Adams TH (1994) The Aspergillus nidulans fluG gene is required for production of

an extracellular developmental signal and is related to prokaryotic glutamine synthetase I.

Genes & development.

195. Mathis R, Gamas P, Meyer Y, & Cullimore JV (2000) The presence of GSI-like genes in

higher plants: support for the paralogous evolution of GSI and GSII genes. Journal of

molecular evolution 50(2):116-122.

196. Doskocilova A, et al. (2011) A nodulin/glutamine synthetase-like fusion protein is implicated

in the regulation of root morphogenesis and in signalling triggered by flagellin. Planta

234(3):459-476.

197. Silva LS, Seabra AR, Leitao JN, & Carvalho HG (2015) Possible role of glutamine synthetase

of the prokaryotic type (GSI-like) in nitrogen signaling in Medicago truncatula. Plant Sci

240:98-108.

198. Ghoshroy S, Binder M, Tartar A, & Robertson DL (2010) Molecular evolution of glutamine

synthetase II: Phylogenetic evidence of a non-endosymbiotic gene transfer event early in plant

evolution. BMC evolutionary biology 10:198.

199. Gebhardt C, Oliver JE, Forde BG, Saarelainen R, & Miflin BJ (1986) Primary structure and

differential expression of glutamine synthetase genes in nodules, roots and leaves of

Phaseolus vulgaris. The EMBO journal 5(7):1429-1435.

200. Guan M, Moller IS, & Schjoerring JK (2015) Two cytosolic glutamine synthetase isoforms

play specific roles for seed germination and seed yield structure in Arabidopsis. Journal of

experimental botany 66(1):203-212.

201. Miao GH, Hirel B, Marsolier MC, Ridge RW, & Verma DP (1991) Ammonia-regulated

expression of a soybean gene encoding cytosolic glutamine synthetase in transgenic Lotus

corniculatus. The Plant cell 3(1):11-22.

202. Stanford AC, Larsen K, Barker DG, & Cullimore JV (1993) Differential expression within the

glutamine synthetase gene family of the model legume Medicago truncatula. Plant physiology

103(1):73-81.

203. Yadav SK (2009) Computational structural analysis and kinetic studies of a cytosolic

glutamine synthetase from Camellia sinensis (L.) O. Kuntze. The protein journal

28(9-10):428-434.

204. Martin A, et al. (2006) Two cytosolic glutamine synthetase isoforms of maize are specifically

involved in the control of grain production. The Plant cell 18(11):3252-3274.

205. Swarbreck SM, Defoin-Platel M, Hindle M, Saqi M, & Habash DZ (2011) New perspectives

on glutamine synthetase in grasses. Journal of experimental botany 62(4):1511-1522.

206. Bernard SM, et al. (2008) Gene expression, cellular localisation and function of glutamine

synthetase isozymes in wheat (Triticum aestivum L.). Plant molecular biology 67(1-2):89-105.

207. Yamaya T & Kusano M (2014) Evidence supporting distinct functions of three cytosolic

glutamine synthetases and two NADH-glutamate synthases in rice. Journal of experimental

botany 65(19):5519-5525.

208. Goodall AJ, Kumar P, & Tobin AK (2013) Identification and expression analyses of cytosolic

glutamine synthetase genes in barley (Hordeum vulgare L.). Plant & cell physiology

54(4):492-505.

Page 46: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

209. Carvalho H, et al. (2000) Differential expression of the two cytosolic glutamine synthetase

genes in various organs of Medicago truncatula. Plant science : an international journal of

experimental plant biology 159(2):301-312.

210. Yamamoto S, Wakayama M, & Tachiki T (2006) Cloning and expression of Pseudomonas

taetrolens Y-30 gene encoding glutamine synthetase: an enzyme available for theanine

production by coupled fermentation with energy transfer. Bioscience, biotechnology, and

biochemistry 70(2):500-507.

211. Yamamoto S, Wakayama M, & Tachiki T (2007) Characterization of theanine-forming enzyme

from Methylovorus mays no. 9 in respect to utilization of theanine production. Bioscience,

biotechnology, and biochemistry 71(2):545-552.

212. Unno H, et al. (2006) Atomic structure of plant glutamine synthetase: a key enzyme for plant

productivity. The Journal of biological chemistry 281(39):29287-29296.

213. Torreira E, et al. (2014) The structures of cytosolic and plastid-located glutamine synthetases

from Medicago truncatula reveal a common and dynamic architecture. Acta crystallographica.

Section D, Biological crystallography 70(Pt 4):981-993.

214. Yamamoto S, Wakayama M, & Tachiki T (2008) Cloning and expression of Methylovorus

mays No. 9 gene encoding gamma-glutamylmethylamide synthetase: an enzyme usable in

theanine formation by coupling with the alcoholic fermentation system of baker's yeast.

Bioscience, biotechnology, and biochemistry 72(1):101-109.

Page 47: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Supporting Figures  

0

1

2

Per

cent

age

(%)

0 20 40 60 80 100 120 140 160Depth (X)

160

320

0

80

240

400

FL-4 DAPI

Cou

nts

A B

200 400 600 8000

 

Fig. S1. Evaluation of the genome size of Camellia sinensis cv. Shuchazao by flow

cytometer and 17-mer analyses. (A) Determination of 2C DNA contents and

genome sizes of C. sinensis cv. Shuchazao and Glycine max samples by flow

cytometry. The term ‘C-value’ refers to the DNA content of an unreplicated haploid

chromosome complement. The left peak in the flow cytometry diagram indicates 2C

DNA of Glycine max at 52.64, and the right peak indicates 2C DNA of Shuchazao at

132.52. The determination was performed in multiple experiments. Compared with

that of soybean (2.5 pg), the genome size of Shuchazao was estimated to be 2.98±0.10

Gb or 2C DNA of 6.09±0.20 pg. (B) The distribution of 17-mer depth of high-quality

reads. Approximately 350 Gb of sequencing reads from short-insert libraries were

selected and then split into 17 bp sequences (17-mers) to plot the frequency (depth) of

those 17-mers. The X-axis represents the sequencing depth and the Y-axis represents

the frequency of those 17-mers at a given sequencing depth. Genome size was

estimated according to this distribution (see Supplementary Note S1). The frequency

exhibits a bi-modality due to heterozygosity.

Page 48: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

375 400 425 450 475 500Kb

Csi020o15

Scaffold934

0 25 50 75Kb

100

0 20 40 60 80Kb

Csi020o15

xpSc0053957

20 40 60 80Kb

0

1,725 1,750 1,775 1,800 1,825 1,850

Kb

Csi037A20

Scaffold372

0 25 50 75 100 125Kb

340 360 380 400 420 440Kb

Csi037L05

Scaffold4870

0 20 40 60 80 100Kb

0 10 20 30 40 50 60 70 80Kb

Csi037o01

xpSc0054496

10 20 30 40 50 60 70 80 90 100Kb

1,400 1,425 1,450 1,475Kb

Csi037o01

Scaffold1414

0 25 50 75 100Kb

50 75 100 125 150 175 200 225 250Kb

Csi037o01

Sc0000996

25 50 75 100Kb

0 125

1,310 1,320 1,330 1,340 1,350 1,360 1,370 1,380 1,390 1,400Kb

Csi044B12

Scaffold1073

0 10 20 30 40 50 60 70 80Kb

1,125 1,150 1,175 1,200 1,225 1,250 1,275 1,300 1,325 1,350Kb

Csi044B12

Sc0000003

25 50 75 Kb0

200 225 250 275 300 325

Csi037A20

Sc0000128

12.5 25 37.5 50 62.5 75 87.5 100 112.5 125Kb

0

Kb175150 350 375

125 150 175 200 225 250 275Kb

Csi037L05

Sc0001570

0 25 50 75 100 125Kb

725 750 775 800 825 850Kb

Csi037o08

Scaffold501

0 25 50 75 100 125 150Kb

CSS assembly CSA assembly

Page 49: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

175 200 225 250 275 300 325 350Kb

Csi092H17

Scaffold12502

0 25 50 75 100 Kb

1,380 1,400 1,420 1,440 1,460 1,480Kb

Csi106D7

Sc0000083

0 20 40 60 80 100 Kb

150 175 200 225 250 275 Kb

Csi205A01

Sc0000793

0 25 50 75 100 Kb

125 150 175 200 225 250 275 300 325Kb

Csi205B03

Sc0000918

0 25 50 75 100 Kb125

0 20 40 60 80 100Kb

Csi205E16

Scaffold1909

20 40 60 80 100Kb

225 250 275 300 325 350 375 400 425Kb

Csi205E16

Sc0000288

0 25 50 75 100

Kb

25 50 75 100 125 150Kb

Csi092H17

Sc0003205

0 25 50 75 100 Kb

475 500 525 550 575 600Kb

Csi205D12

Sc0001185

0 25 50 75 100 Kb

100 125 150 175 200Kb

Csi106D7

Scaffold1403

0 25 50 75 100 Kb

1,475 1,487.5 1,500 1,512.5 1,525 1,537.5 1,550 1,562.5 1,575Kb

Csi205A01

Scaffold411

0 12.5 25 37.5 50 62.5 75 87.5 100Kb

1,675 1,700 1,725 1,750 1,775 1,800 1,825Kb

Csi205B03

Scaffold546

0 25 50 75 100 125Kb

225 250 275 300Kb

Csi205D12

Scaffold596

0 25 50 75 100Kb

CSS assembly CSA assembly

Page 50: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

360 380 400 420 440Kb

Csi205J21

Sc0000726

0 20 40 60 80 100 120Kb

1,680 1,700 1,720 1,740 1,760 1,780Kb

Csi205k12

Scaffold754

0 20 40 60 80 100 120Kb

340 360 380 400 420 440Kb

Csi205k18

Sc0001650

0 20 40 60 80 100Kb

100 125 150 175 200Kb

Csi271P8

xpSc0053428

0 25 50 75 Kb

25 50 75 100 125 150Kb

Csi274K14

Scaffold52

0 25 50 75 100 125Kb

550 575 600 625 650 675 700 725 750Kb

Csi205H20

Sc0000683

0 25 50 75 100 125 Kb

120 140 160 180 200 220Kb

Csi205k12

Sc0001803

20 40 60 80 100 Kb0

25 50 75 100 125 150Kb

Csi274K14

Sc0004598

25 50 75 100 125 Kb0

400 425 450 475 500Kb

Csi205H20

Scaffold1357

0 25 50 75 100 125Kb

525 550 575 600 625Kb

Csi205J21

Scaffold3673

0 25 50 75 100Kb

375 400 425 450 475 500 525Kb

Csi205k18

Scaffold1875

0 25 50 75 100Kb

160 170 180 190 200 210 220Kb

Csi271P8

Scaffold1785

0 10 20 30 40 50 60 70Kb

CSS assembly CSA assembly

 

Fig. S2. Alignment of the CSS and CSA assembled scaffolds to 18 CSS sequenced

BACs. In each comparison, the upper line represents BAC sequences with its ID

number above, and the lower line shows CSS (left panel) and CSA (right panel)

scaffolds with its ID number below. The orange blocks show aligned regions between

Sanger-sequenced BACs and scaffolds.

 

Page 51: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

A. c

hine

nsis

C. s

. sin

ensi

sC

. s. a

ssam

ica

A. t

richo

poda

E. g

uine

ensi

s

T. c

acao

V. v

inife

ra

P. p

ersi

caP.

tric

hoca

rpa

A. t

halia

naM

. tru

ncat

ula

C. a

rabi

ca

0

9000

18000

27000

36000

45000

54000

Single-copy orthologsMultiple-copy orthologsUnique paralogs

Other orthologsUnclustered genes

Num

ber

of g

enes

 

Fig. S3. Clusters of orthologous and paralogous gene families in tea and 10 other

fully-sequenced plant genomes. Only the longest isoform for each gene was used.

Gene clusters (families) were identified using the OrthoMCL package with default

parameters

Page 52: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

183 151 119 100 79 10 0 million years ago

Gene families

Expansion / Contraction

+2128 / -3520

+795 / -2135

+5239 / -404

+1931 / -1595

+605 / -1346

+6 / -8

+14 / -80

+1354 / -3895

+856 / -1446

+26 / -1

+28 / -63

+20 / -211

+1020 / -1690

+1810 / -1001

+882 / -507

+3221 / -1866

+842 / -268

+692 / -1953

+78 / -251

+217 / -183

+27 / -28

+614 / -2033

MRCA(11127)

Elaeis guineensis

Vitis vinifera

Populus trichocarpa

Arabidopsis thaliana

Theobroma cacao

Medicago truncatula

Prunus persica

C.sinensis var.assamica

C.sinensis var.sinensis

Actinidia chinensis

Coffea arabica

Amborella trichopoda

Fig. S4. Expansion and contraction of gene families. The green and red numbers

indicate expanded and contracted gene families, respectively. Conserved gene

families are indicated in blue in the pie charts. MRCA stands for the most recent

common ancestor. 

Page 53: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

Grape

Tea plant

0

1000

2000

3000 (bp)

0

2000

3000

4000

5000

G1 G2 G3 G4 G5 G6 G7 G8 G9 G10

G11 G12 G13 G14 G15 G16 G17 G18 G19

(bp)

Fig. S5. The tea WGD identified by analysis of gene collinearity between tea and

grape orthologs. Collinear gene blocks between tea and grape genomes were selected

where the gene number per block was at least 3, with non-synteny gaps of less than 5

genes. The grape genes are arranged according to their gene order. Grape gene sets on

different chromosomes were displayed in blue lines, while the tea collinear gene

blocks were shown in red blocks. These results indicated that CSS carries ~2

duplicates of grape genome orthologs. 

Page 54: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to
Page 55: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

Divergence time (mya)

200100 150500 250 300

0

2

4

6

8

10

12

14

16

18

30 - 40 mya

90 - 100 mya

Per

cent

age

of a

naly

zed

gene

pai

rs

 

Fig. S6. Genome duplication in tea plant. The calculated Ks values of the 2-member

gene clusters were converted to divergence time. The y axis shows the percentage of

gene clusters with this degree of divergence. mya, million years ago. 

Page 56: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

Elaeis guineensis

Vitis vinifera

Populus trichocarpa

Arabidopsis thaliana

Theobroma cacao

Medicago truncatula

Prunus persica

Actinidia chinensis

Coffea arabica

Amborella trichopoda

103

108

107

111

116

80

105

118

152

183

C.sinensis var.sinensis

 

Fig. S7. Divergence time between CSS and 10 other plant species. Phylogenetic

relationships among these eleven plant species were analyzed using 320 single copy

genes. Protein sequences of single-copy gene families were firstly aligned using

MUSCLE (http://www.ebi.ac.uk/Tools/msa/muscle/). The coding sequences of the

genes were extracted based on the alignment results and concatenated to generate a

supergene for each species. A phylogenetic tree was constructed from the supergene

sequences by Mrbayes with the parameter set at 1,000,000 (1 sample per 100

generations). The best substitution model (GTR+gamma+I) determined by Modeltest.

Using Amborolla trichopoda as the out-group, two independent runs supported the

same topology. Number beside each node indicates the calculated divergence time

(mya, million years ago) of each lineage separation.

Page 57: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Per

cent

age

of g

ene

pair

s

4dTv distance (corrected for multiple substitutions)0 0.25 0.5 0.75 1 1.25 1.5

2.5

5

7.5

10

12.5

C.sinensis var.sinensis_vs_C.sinensis var.sinensis

C.sinensis var.sinensis_vs_T.cacao

C.sinensis var.sinensis_vs_V.vinifera

 

Fig. S8. Distribution of 4dTv distance between syntenically orthologous genes

among CSS, grape, and cacao genomes. The 4DTv (distance-transversion rate at

4-fold degenerate sites) values of the identified homologous blocks in tea-vs-tea,

tea-vs-grape and tea-vs-cocoa were calculated with the HKY substitution model.

Page 58: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

Myr

40

120

80

Tea plant Kiwifruit Coffee Cacao Arabidopsis Poplar Grape

Myr

40

120

80

20

100

60

γWGD

WGT

PAL

>130

C4H 4CL ANR ANS F3H F3’H FLS

LAR CHI DFR SCPL1A CHS F3’5’H UGT84A

ADC GDH GS GOGAT NMT

Fig. S9. Evolution of secondary metabolite-associated genes in seven

representative plant species. A total of 20 genes involved in plant secondary

metabolism were investigated. Duplication event(s) for each gene pair is(are) shown

along a timeline ranging from 0 to 200 million years ago.

Page 59: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

SCPLSCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPL

SCPLANS

CHS

CHS

CHS

DFR

DFR

DFR

DFR

DFR

FLS

FLS

PAL

PAL

PAL

Jumonji SFLUG

SET

HMG

E2F-DP

CPP

mTERFE2F-DP

HB

Trihelix

TUB

DBP

NF-YC

GRF

G2-like

OFP

OFP

TCP

TRAF

C2C2-CO-like

AP2ARR-B

BBR-BPC

ERF

BES1

Tify

GNATRAV

M-type

PLATZ

LIMWRKY

MYB-related

SBP

zf-HD

DBB

NAC

zf-HD

AUX/IAA

C2C2-YABBY

MIKC

C2C2-Dof

C3H

LOB

bZIP

C2C2-GATA

MYB

NF-YB

bHLH

OrphansPHD

C2H2

SNF2

B3

 

Fig. S10. Transcriptional regulation of catechin biosynthetic genes. A

co-expression network connecting structural genes in catechin biosynthesis with

transcription factors (TFs) provides insights into the regulation of catechin

biosynthetic genes. The color-filled hexagons represent the structural genes associated

with catechin biosynthesis that was highly (green) or weakly (red) expressed in bud

and leaf. Expression correlations between TFs (colored solid circles) and

catechin-related genes (colored solid hexagons) are shown with colored lines

(Pearson’s correlation test, P ≤ 1e-6).

Page 60: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

MtGSIIaZmGS1aCsGSII-1.1GMASCsTS1PtGS

Fig. S11. Alignment of ZmGSIa and MtGSIIa, with CsTSI, CsGSII-1.1, and

Pseudomonas taetrolens PtGS. Blue: Metal coordination; Yellow: Glutamate binding;

Red: ATP binding; Purple: Ammonia binding.

Page 61: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

0

2

4

6

10

30

50

The

anin

e co

nten

t (m

g.g

-1 d

ry w

eigh

t)

RT ST FL FR AB YL ML OL  

Fig. S12. The contents of theanine in different tissues of tea cultivar Shuchazao.

The dry weight contents of theanine were detected by HPLC analysis in 8 different

tissues of tea cultivar Shuchazao: apical buds (AB), young leaves (YL), mature leaves

(ML), old leaves (OL), young stems (ST), flowers (FL), young fruits (FR) and tender

roots (RT).

Page 62: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

LAR

_TE

A027582.1

SC

PL_T

EA

034028.1D

FR

_TE

A032730.1

AN

S_T

EA

015762.1A

NR

_TE

A022960.1

UG

T84A

_TE

A026127.1

LAR: TEA027582.1

SCPL: TEA034028.1

DFR: TEA032730.1

ANS: TEA015762.1

ANR: TEA022960.1

UGT84A: TEA026127.1

0.6 0.8

0.99

0.96

0.94

0.90

0.85

0.79

qRT-PCR

FPKM

0.2 0.4

PCC value

 

Fig. S13. Validation of the expressions of six key genes in catechin biosynthesis by

quantitative real-time PCR. The six selected genes were LAR (TEA027582.1), DFR

(TEA032730.1), ANS (TEA015762.1), ANR (TEA022960.1), UGT84A

(TEA026127.1) and SCPL1A (TEA034028.1). Pearson correlation coefficient (PCC)

values when comparing the expression levels from qRT-PCR analysis and from

transcriptome analysis of the selected genes are indicated in numerals in the red

squares.

Page 63: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

The

anin

e co

nten

t

0

2

4

6

8

10ContronlTreatment

1 3 6 9 12

Day after EA treatment

0

0.004

0.008

0.012

1 3 6 9 12

TEA015198.1

0

0.2

0.4

0.6

1 3 6 9 12

TEA011593.1

0

0.6

1.2

1.8

1 3 6 9 12

TEA004987.1

0

3

6

9

1 3 6 9 12

TEA004798.1

Rel

ativ

e le

vel

Day after EA treatment Day after EA treatment

Day after EA treatment

Rel

ativ

e le

vel

Rel

ativ

e le

vel

Rel

ativ

e le

vel

Day after EA treatment

TreatmentControl

A

B C

C D

mg.-1 dry weight

 

Fig. S14. The induced expression of TS and GS genes involved in theanine

biosynthesis and the content changes of theanine after treatment with ethylamine

(EA) hydrochloride aqueous solution on tea-cutting seedlings. Gene expressions

and theanine content in each time point were measured in three biological replicates,

and the average values of gene expressions and theanine content were listed in the

table.

Page 64: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

 

TEA003137.1

TEA023243.1

TEA024587.1

TEA003374.1

TEA014056.1

AT3G53260.1

100

91

65

PAL

TEA034002.1

TEA034001.1

TEA014864.1

TEA016772.1

AT2G30490.1

97

72

C4H

TEA027829.1

TEA002100.1

TEA025906.1

TEA034012.1

AT3G21230.1

99

91

4CLTEA033031.1TEA022883.1TEA018689.1TEA013101.1TEA033023.1TEA034003.1AT5G05270.1

100

100

66

100

CHI

TEA034051.1

TEA034021.1

TEA026294.1

TEA013315.1

CAI54277

100

64

F3’5’H

TEA016601.1

TEA034025.1

TEA010328.1

TEA010326.1

TEA006643.1

AT5G08640.2

100

46

84

FLS

ACA81427

ACA81428

AT3G51240.1

TEA023790.1

TEA034016.1

100

80

F3H

TEA027582.1

TEA026458.1

TEA021535.1

CAI26308

100

LAR

TEA023829.1TEA024758.1TEA024762.1TEA010588.1TEA034018.1TEA032730.1AT5G42800.1

100

100

10045

DFR

TEA022960.1

TEA009266.1

ADD51353.1

XP_002317270.1

AT1G61720.1

95

81

ANR

TEA010322.1

TEA015769.1

TEA015762.1

AT4G22870.2

74

ANS

TEA018665.1TEA034042.1TEA034044.1TEA034043.1TEA023331.1TEA023333.1TEA023340.1TEA034011.1TEA034019.1AT5G13930.1

100

61

52

3621

14

13

CHS  

Page 65: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

TEA034036.1TEA034032.1TEA034031.1TEA023451.1TEA034033.1TEA034034.1TEA034039.1TEA034055.1TEA027270.1TEA034056.1TEA020540.1TEA034017.1TEA010715.1TEA034049.1TEA034028.1TEA016463.1TEA016469.1TEA009664.1TEA000223.1TEA023432.1TEA023444.1AEE31604.2AEE35436.1

100

100

6887

100

9399

100

100

81100

56

64

8071

94

94

37

6125

SCPL

 

Fig. S15. Phylogenetic trees of the genes involved in catechin biosynthetic

pathway. Maximum likelihood (ML) tree for each catechin biosynthetic gene was

constructed using MEGA 7 with 100 bootstrap replicates.

Page 66: Supporting Information Appendix - PNAS · S1 Genome sequencing and de novo assembly S1.1 Plant materials for sequencing The tea plant, Camellia sinensis (L.) O. Kuntze, belongs to

Cc0

1 g

029

00

Cc0

1 g0

2890

83

Cc0

1 g0

0620

Cc0

1 g0

0540

Cc0

0 g0

9150

Cc01

g004

60

100

Cc05 g05420

Cc01 g02180

Cc01 g02210

96

Cc00 g33110

Cc01 g0223086

99

76

79

Cc00 g14730

Cc09 g07000

Cc02 g09350

Cc09 g06950Cc01 g00720

Cc09 g06970

Cc00 g24720

Cc09 g06960

99

100

Cc00 g30850

84

Tc0

8 g0

0247

0

Tc0

8 g

0024

90

100T

c01

g002

750

Tc0

0 g0

6796

0

Tc0

2 g0

0661

0

Tc10

g00

1800

Tc00

g00

8060

100

Tc10 g001820

100

TEA015791.1TEA015176.1

TEA022559.199

TEA028051.1

TEA028050.195

100100

85

TEA031962.1

TEA032424.1

100

TEA017731.1

TEA014699.1

TEA010250.1

TEA014705.1

100

Tc02 g030140

100

Tc02 g016270

0.1

Tea plant

Cacao

Coffee

Fig. S16. Phylogenetic tree of the N-methyltransferase (NMT) genes among tea

plant, cacao and coffee. Maximum likelihood (ML) tree for each NMT gene was

constructed using MEGA 7 with 1000 bootstrap replicates.