aiming off the target: studying repetitive dna using target ......2020/12/10  · 109 (novak et al.,...

31
1 Aiming off the target: studying repetitive DNA using target capture sequencing reads 1 2 Lucas Costa 1 , André Marques 2 , Chris Buddenhagen 3 , William Wayt Thomas 4 , Bruno Huettel 5 3 Veit Schubert 6 , Steven Dodsworth 7 , Andreas Houben 6 , Gustavo Souza 1 , Andrea Pedrosa- 4 Harand 1 5 6 1 Laboratory of Plant Cytogenetics and Evolution, Department of Botany, Federal University of Pernambuco, 7 Recife-PE, Brazil 8 2 Max Planck Institute for Plant Breeding Research, Cologne, Germany 9 3 AgResearch, Plant Functional Biology, Ruakura, New Zealand 10 4 New York Botanical Garden, Bronx, New York, United States of America 11 5 Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany 12 6 Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany 13 7 School of Life Sciences, University of Bedfordshire, Luton, UK 14 15 [email protected] 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for this this version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515 doi: bioRxiv preprint

Upload: others

Post on 05-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Aiming off the target: studying repetitive DNA using target capture sequencing reads 1

    2

    Lucas Costa1, André Marques2, Chris Buddenhagen3, William Wayt Thomas4, Bruno Huettel5 3

    Veit Schubert6, Steven Dodsworth7, Andreas Houben6, Gustavo Souza1, Andrea Pedrosa-4

    Harand1 5

    6

    1 Laboratory of Plant Cytogenetics and Evolution, Department of Botany, Federal University of Pernambuco, 7

    Recife-PE, Brazil 8

    2 Max Planck Institute for Plant Breeding Research, Cologne, Germany 9

    3 AgResearch, Plant Functional Biology, Ruakura, New Zealand 10

    4 New York Botanical Garden, Bronx, New York, United States of America 11

    5 Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany 12

    6 Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany 13

    7 School of Life Sciences, University of Bedfordshire, Luton, UK 14

    15

    [email protected] 16

    17

    18

    19

    20

    21

    22

    23

    24

    25

    26

    27

    28

    29

    30

    31

    32

    33

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    mailto:[email protected]://doi.org/10.1101/2020.12.10.419515

  • 2

    SUMMARY 34

    • With the advance of high-throughput sequencing (HTS), reduced-representation 35

    methods such as target capture sequencing (TCS) emerged as cost-efficient ways of 36

    gathering genomic information. As the off-target reads from such sequencing are 37

    expected to be similar to genome skims (GS), we assessed the quality of repeat 38

    characterization using this data. 39

    • For this, repeat composition from TCS datasets of five Rhynchospora (Cyperaceae) 40

    species were compared with GS data from the same taxa. 41

    • All the major repetitive DNA families were identified in TCS, including repeats that 42

    showed abundances as low as 0.01% in the GS data. Rank correlation between GS and 43

    TCS repeat abundances were moderately high (r = 0.58-0.85), increasing after filtering 44

    out the targeted loci from the raw TCS reads (r = 0.66-0.92). Repeat data obtained by 45

    TCS was also reliable to develop a cytogenetic probe and solve phylogenetic 46

    relationships of Rhynchospora species with high support. 47

    • In light of our results, TCS data can be effectively used for cyto- and phylogenomic 48

    investigations of repetitive DNA. Given the growing availability of HTS reads, driven 49

    by global phylogenomic projects, our strategy represents a way to recycle genomic data 50

    and contribute to a better characterization of plant biodiversity. 51

    52

    Keywords: Genome Skimming, holocentric, Reduced-representation sequencing, 53

    RepeatExplorer, Rhynchospora, satellite DNA, transposable elements 54

    55

    56

    57

    58

    59

    60

    61

    62

    63

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 3

    INTRODUCTION 64

    One intriguing aspect of eukaryotic genomes is the staggering 60,000-fold variation in 65

    DNA content among species (Elliott & Gregory, 2015). Advances in genomics have led to the 66

    discovery that most of this diversity is the result of variable amounts of repetitive DNA, 67

    commonly divided into tandem repeats and dispersed repeats (Weiss-Schneeweiss et al., 2015; 68

    Elliott & Gregory, 2015). Tandemly distributed satellite DNAs are remarkable for their fast 69

    evolution and variance in abundance and structure at all hierarchical levels (Novák et al., 2017; 70

    Ávila Robledillo et al., 2018). As for dispersed repetitive sequences, transposable elements 71

    (TE) are especially abundant in flowering plant genomes (Jurka et al., 2011; Galindo-González 72

    et al., 2017). Retrotransposons, particularly the ones possessing a long terminal repeat (LTR-73

    retrotransposons) account for most of this abundance (Weiss-Schneeweiss et al., 2015) with 74

    two major superfamilies being recognized (Ty1/copia and Ty3/gypsy) based on the order of 75

    the protein coding domains, and further divided into a number of lineages according to 76

    phylogenetic distance (Neumann et al., 2019). 77

    Contrasting with the previous notion that repetitive DNA was no more than “junk 78

    DNA”, cytogenomic studies in the last few decades helped to uncover possible roles for tandem 79

    and dispersed repeats. Specific satellite DNAs and retrotransposons have been found to have a 80

    functional and/or structural role in centromeres (Cheng et al., 2002; Nagaki et al., 2003; 81

    Houben & Schubert, 2003; Marques et al., 2015; Macas et al., 2015; Ribeiro et al., 2017). The 82

    role of LTR-retrotransposons in genome size variation led to investigations correlating these 83

    to heterochromatin distribution, ecological variables, community structure and plant 84

    distribution (Guignard et al., 2016; Van-Lume et al., 2017; Lyu et al., 2018; Souza et al., 2019). 85

    The activity of transposable elements in the host genome can also cause modifications to gene 86

    regulation and the formation of retrogenes, generating morphological innovations and 87

    impacting speciation processes (Schrader & Schmitz, 2019). Moreover, lineage-specific 88

    satellite DNAs have been widely used as efficient cytomolecular markers, allowing the 89

    identification of chromosome pairs and elucidating chromosome rearrangement and 90

    duplication events (Koo et al., 2011; Čížková et al., 2013; Ávila Robledillo et al., 2018; Ribeiro 91

    et al., 2020) . 92

    The fast evolution of repetitive DNA, with many satellite families being genus- or 93

    species-specific, impair their use in phylogenetic studies, as concerted evolution and 94

    homogenisation reduces sequence variability for comparative studies across taxa (Macas et al., 95

    2015; Mascagni et al., 2020; Ribeiro et al., 2020). Their abundances, however, have 96

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 4

    phylogenetic signal, as demonstrated by a method to reconstruct phylogenetic relationships 97

    using the abundance of different repetitive elements (Dodsworth et al., 2015). This method has 98

    proven useful to elucidate relationships in different groups of plants and animals (Dodsworth 99

    et al., 2017; Bolsheva et al., 2019; Martín-Peciña et al., 2019). Other methods have assessed 100

    the usefulness of repeat-based phylogenetic analysis. More recently, it was demonstrated that 101

    sequence similarity measures of repeated sequences can also be used as characters to resolve 102

    phylogenetic relationships (Vitales et al., 2019). In a similar approach, assembly and alignment 103

    free (AAF) methods can be applied to high complexity fractions of the genome, such as 104

    repetitive DNA, in a phylogenomic framework (Fan et al., 2015; Sarmashghi et al., 2019). 105

    In order to identify and characterize the diversity of repeats in a genome, the 106

    RepeatExplorer pipeline was created, using a graph-based clustering algorithm to group high-107

    copy sequences based on similarity, with posterior identification of protein-coding domains 108

    (Novak et al., 2013). RepeatExplorer can identify repetitive elements with approximately 0.1× 109

    genome coverage, most commonly known as genome skimming (GS), which is often sufficient 110

    to study high-copy nuclear and organellar DNA (Straub et al., 2012; Dodsworth, 2015; 111

    Dodsworth et al., 2019). The GS method is just one of several “reduced representation” 112

    methods of high-throughput sequencing. Throughout the last decade, a number of these 113

    sequencing methods have been proposed, generating high quality sequencing data at decreasing 114

    costs, such as restriction site-associated sequencing (RAD-seq, Eaton et al. 2016) and 115

    transcriptome sequencing (RNA-seq, Wang et al. 2017). 116

    Another widely used reduced representation method is target capture sequencing 117

    (TCS), in which several genomic probes are designed to “capture” and enrich specific low-118

    copy coding regions of the nuclear genome (Albert et al., 2007; Gnirke et al., 2009). These 119

    probes can be designed based on conserved regions retrieved from the alignment of several 120

    genomes of a divergent set of organisms (e.g. Anchored Hybrid Enrichment, Lemmon et al., 121

    2012) or by comparing transcriptomic data to search for a conserved set of orthologs across a 122

    group (Johnson et al., 2019). Since the development of TCS (Albert et al., 2007), many sets of 123

    probes have been developed and applied in both plants and animals (Cosart et al., 2011; 124

    Faircloth et al., 2012; Mandel et al., 2014; Ilves & López-Fernández, 2014; Sass et al., 2016; 125

    Heyduk et al., 2016; Schmickl et al., 2016). Moreover, the nature of these probe design 126

    approaches have allowed the development of universal probe sets, with the potential to be used 127

    across all of the angiosperms (e.g. Buddenhagen et al. 2016; Johnson et al. 2019). In addition 128

    to the advantage of being universal, high recovery rate of target regions (or the number of 129

    enriched targeted loci in the final library) is often achievable with very low “enrichment 130

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 5

    efficiency” (percentage of library reads successfully mapped to a target sequence) (Johnson et 131

    al., 2019), meaning that a TCS enriched library will often contain a high number of “off-target” 132

    reads. The use of these off-target reads, in combination with the enriched reads, has been 133

    referred to as “Hyb-Seq” (hybrid sequencing, Weitemier et al., 2014). As the off-target reads 134

    are often rich in high-copy DNA, Hyb-Seq approaches have been used to reconstruct whole 135

    plastomes and ribosomal DNA (Weitemier et al., 2014; Schmickl et al., 2016). 136

    To check whether off-target reads could also be used for identifying, and potentially 137

    quantifying, repetitive DNA, we selected the sedge Rhynchospora Vahl as a model. 138

    Rhynchospora is one of the largest genera of Cyperaceae Juss., comprising approximately 350 139

    species, but it is under-studied from a phylogenetic point of view (Thomas et al., 2009; 140

    Buddenhagen et al., 2016). Cytologically, Rhynchospora has been of great interest due to its 141

    holocentric chromosomes, which present the centromere dispersed along the sister chromatids 142

    rather than localized in a primary constriction (Bureš et al., 2013). Moreover, cytogenomic 143

    studies on R. pubera (Vahl) Boeckeler have led to the discovery of the first centromere-specific 144

    satellite DNA reported in a holocentric organism, Tyba (Marques et al., 2015). Subsequent 145

    studies confirmed the presence of Tyba in other Rhynchospora species (Ribeiro et al., 2017). 146

    Nevertheless, other non-centromeric satellites found in Rhynchospora species showed the 147

    typical block-like pattern on localized chromosomal regions (Ribeiro et al., 2017). 148

    Large-scale repeat analysis covering all major clades of Rhynchospora would provide 149

    valuable insights into the repeat evolution of this genus. Recently, using a set of probes 150

    developed by Anchored Hybrid Enrichment (Buddenhagen et al., 2016), a number of 151

    Rhynchospora species were sequenced using a TCS approach. Here, we assessed the quality of 152

    repeat characterization in five Rhynchospora species from this dataset. As a considerable part 153

    of the off-target reads can be sequences close to the original targeted loci (Dodsworth et al., 154

    2019), we searched for sequencing bias by comparing the target results with results obtained 155

    from genome skimming (i.e. unenriched libraries). We specifically addressed three questions: 156

    1) Can we characterize the repetitive DNA fraction of Rhynchospora genomes using TCS data, 157

    compared to GS data?; 2) Can we develop cytological markers from TCS clustering data?; and 158

    3) Can we use TCS data to reconstruct repeat-based phylogenetic trees? 159

    160

    MATERIALS AND METHODS 161

    162

    Plant material and sequence data 163

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 6

    Individuals of Rhynchospora cephalotes (L.) Vahl and R. exaltata Kunth were collected 164

    near the towns of Jacaraú (Paraíba, Brazil, voucher UFP87625) and Jaqueira (Pernambuco, 165

    Brazil, voucher JPB51537), respectively. These individuals were cultivated in (i) the 166

    experimental garden of the Laboratory of Plant Cytogenetic and Evolution at the Universidade 167

    Federal de Pernambuco (Brazil), (ii) the greenhouse of the Max Plank Institute for Plant 168

    Breeding Research (Cologne, Germany) and (iii) the greenhouse of the Leibniz Institute of 169

    Plant Genetics and Crop Plant Research (IPK Gatersleben, Germany). 170

    We downloaded available short read archive data of R. pubera (Vahl) Boeckeler (Marques 171

    et al., 2015, BioProject PRJEB9643) from the NCBI GenBank (www.ncbi.nlm.nih.gov). 172

    Genome skimming sequences of R. globosa (Kunth) Roem. & Schult. and R. tenuis Link were 173

    obtained from Ribeiro et al. (2017) and deposited on GenBank under BioProject 174

    PRJNA672922. TCS reads (150 bp) for R. cephalotes, R. exaltata, R. globosa, R. pubera and 175

    R. tenuis were obtained from Buddenhagen (2016) and deposited on GenBank under BioProject 176

    PRJNA672127. 177

    178

    DNA extraction and sequencing 179

    Genomic DNA from R. cephalotes and R. exaltata was extracted from leaves with 180

    NucleoBond HMW DNA kit (Macherey and Nagel, Düren, Germany). Quality was assessed 181

    with Agilent TapeStation and the gDNA was quantified by Qubit BR assay (Thermo). An 182

    Illumina-compatible library was then prepared from 400 ng input gDNA with an NEBNext 183

    Ultra™ II FS DNA Library Prep Kit for Illumina (New England Biolabs) with a total of four 184

    PCR cycles to introduce dual indexed barcodes. Libraries were sequenced in the Max Planck 185

    Genome Centre Cologne on a HiSeq2500 system with 2× 250 bp rapid mode using a HiSeq 186

    Rapid PE Cluster and Rapid SBS v2 kit. The new sequence data was deposited on GenBank 187

    under BioProject PRJNA672693. 188

    189

    Genome size measurements 190

    Novel DNA content measurements for R. cephalotes and R. exaltata were estimated by 191

    flow cytometry. Sample preparation was done according to Loureiro et al. (2007). Young 192

    leaves of each of the studied plants were chopped simultaneously with their respective 193

    reference standard, Solanum lycopersicum cv. Stupicke (2C = 1.96 pg, Dolezel et al., 1992) for 194

    R. cephalotes and Raphanus sativus L. cv. Saxa (2C = 1.11 pg, Dolezel et al., 1992) for R. 195

    exaltata in a Petri dish (kept on ice) containing 2 mL of Woody Plant Buffer (WPB). The 196

    sample was then filtered through a 30-µm disposable mesh filter (CellTrics, SYSMEX, 197

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    http://www.ncbi.nlm.nih.gov/https://doi.org/10.1101/2020.12.10.419515

  • 7

    Norderstedt, Germany) with following addition of 50 μg/mL propidium iodide (from a stock 198

    of 1 mg/mL; Sigma-Aldrich) and 50 µg/mL RNase (Sigma-Aldrich). Nine replicates per 199

    species were made. The samples were measured in a CyFlow Space flow cytometer 200

    (SYSMEX) equipped with a green laser (532 nm). Histograms of relative fluorescence were 201

    obtained using the software Flomax v.2.3.0. (SYSMEX, Norderstedt, Germany). Mean 202

    fluorescence and coefficient of variation were assessed at half of the fluorescence peak. The 203

    absolute DNA content (pg/2C) was calculated multiplying the ratio of the G1 peaks by the genome 204

    size of the internal standard. 205

    206

    Filtering of TCS reads 207

    As our aim was to characterize the repeat fraction of the sequenced species, we were 208

    only interested in the off-target reads, whose abundance is inversely proportional to the 209

    efficiency of target sequence enrichment. To get rid of the target reads, we filtered the raw TCS 210

    data by mapping them to a set of 256 sequences representing consensus sequences of the target 211

    loci that were enriched in the Rhynchospora dataset (Buddenhagen, 2016), saving the 212

    unmapped reads for the repeat characterization. Two different mapping algorithms were used 213

    for comparison: the Geneious read mapper v6.0.3, with custom sensitivity settings (60% 214

    Maximum mismatch per read, Index world length = 12, Maximum ambiguity = 8, Kearse et 215

    al., 2012) and the BowTie2 v2.4.1 mapper with high sensitivity settings (0 to 800 insert size, 216

    report all matches, Langmead & Salzberg, 2012), both implemented in the software Geneious 217

    v.7.1.9 (Kearse et al., 2012). Additionally, we uploaded a FASTA file with the 256 target 218

    sequences to RepeatExplorer (https://repeatexplorer-elixir.cerit-sc.cz/) prior to the analysis as 219

    a Custom Repeat Database (Novak et al., 2013). With this option we could exclude clusters of 220

    enriched gene sequences mistakenly identified as repeats from the analysis. In summary, we 221

    ended with one GS dataset (comprising of previously published datasets for R. globosa, R. 222

    pubera and R. tenuis and the newly sequenced R. cephalotes and R. exaltata data) and four 223

    different “target datasets” for each species: 1) Raw target capture reads; 2) Reads left after 224

    mapping with Geneious read mapper; 3) Reads left after mapping with BowTie Mapper and 225

    4) Reads left after exclusion of “enriched gene clusters” annotated according to a custom 226

    Repeat Database (RE). 227

    228

    In silico Repeat Analysis 229

    In order to compare the repeat composition observed in all different datasets, we 230

    employed the RepeatExplorer pipeline (Novak et al., 2013). Reads from all datasets of R. 231

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://repeatexplorer-elixir.cerit-sc.cz/https://doi.org/10.1101/2020.12.10.419515

  • 8

    cephalotes, R. exaltata, R. globosa, R. pubera and R. tenuis were uploaded to the platform, 232

    filtered by quality with default settings (95% of bases equal to or above the quality cut-off 233

    value of 10) and interlaced. Clustering was performed with default settings of 90% similarity 234

    over a 55% minimum sequence overlap. The Find RT Domains tool and additional database 235

    searches (Genbank) were used to identify protein domains for repeat annotation, and graph 236

    layouts of individual clusters were examined interactively using the SeqGrapheR tool (Novak 237

    et al., 2013). 238

    Although the entire set of reads from each dataset was uploaded to RepeatExplorer, we 239

    used the Read Sampling option on the clustering analysis to manually input the number of reads 240

    to be analysed, accounting for 0.13× of the genome of each species (Table 1). Only for R. 241

    globosa was this not possible, as no information on genome size was available for this species. 242

    In this case, we ran tests with 500,000, 1,000,000 and 2,000,000 reads. However, independently 243

    of the number of reads used as input, only 200,061 reads were analysed, always with similar 244

    results. Clusters with at least 0.01% genome abundance were automatically annotated and 245

    manually checked. The genome proportion of high copy repeats was estimated based on the 246

    number of reads of the annotated clusters. 247

    248

    Table 1 – Genome size (2C), number of reads from GS (genome skimming) and raw TCS datasets, percentage of 249

    mapped reads on Bowtie Mapper and Geneious Read Mapper datasets, percentage of reads identified on “target 250

    clusters” of RepeatExplorer (RE) dataset and number of reads analysed for all datasets on each species. 251

    Species 1C

    (Mbp)

    Number of

    GS read

    pairs

    Number of

    raw TCS

    reads d

    % of filtered TCS reads* Read pairs

    analysed

    BowTie Geneious RE

    Rhynchospora cephalotes

    (L.) Vahl 356.97 11,737,291 6,932,932 11.77 23.12 16.80 600,000

    Rhynchospora exaltata

    Kunth 244.5 21,384,261 5,151,874 13.02 20.65 18.45 422,199

    Rhynchospora globosa

    (Kunth) Roem. & Schult. - 4,000,000

    c 3,713,422 4.10 14.95 8.82 200,061

    Rhynchospora pubera

    (Vahl) Boeckeler 1,613.7

    a 4,000,000

    a 4,164,982 3.96 12.29 8.66 2,797,919

    Rhynchospora tenuis

    Link 381.42

    b 4,000,000

    c 4,826,084 10.18 20.64 6.36 661,128

    aMarques et al., 2015;

    bRibeiro et al,. 2018;

    cRibeiro et al., 2017;

    dBuddenhagen 2016 252

    *% of target reads idenfied by the different strategies 253

    254

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 9

    We used the TAREAN tool (Novák et al., 2017) available in the RepeatExplorer 255

    pipeline to annotate satellite DNAs. Satellites were named based on previous publications or 256

    by using the species abbreviation followed by SAT, a number based on the decreasing order of 257

    abundance and a hyphen followed by the number of basepairs of the monomer. The consensus 258

    monomer sequences of the identified satellite DNAs of each species were compared using 259

    DOTTER (Sonnhammer & Durbin, 1995). 260

    261

    Testing Method Performance 262

    To assess if the abundances of the different repeats observed in our raw and filtered 263

    target capture datasets were comparable, we compared their abundances with the ones observed 264

    in the GS datasets. For this, we used the abundance values of the individual repeat families 265

    (Table S1). As the data did not follow a normal distribution, we used Spearman’s rank 266

    correlation. The analysis was undertaken with the package stats implemented in the software 267

    R v4.0.2 (R Core Team, 2019). Correlation plots were constructed with the R package ggplot2 268

    (Wickham, 2016). 269

    270

    Repeat amplification, probe labelling and in situ hybridization 271

    Primers for the R. cephalotes Tyba variant found in the TCS datasets (see results) were 272

    designed based on the most conserved region of the consensus sequence (F: 5’-273

    AAGCTATTTGAATGCAATTATGTGC; and R: 5’AGCGTTTCTAGCCACATTTGA). 274

    Genomic DNA (40 ng) of R. cephalotes was used for PCR reaction with 1× PCR buffer, 2 mM 275

    MgCl2, 0.1 mM of each dNTP, 0.4 μM of each primer, 0.025 U Taq polymerase (Qiagen) and 276

    water. The PCR conditions were as follow: 94°C 2 min, 30 cycles of 94°C 50 s, 58°C 50 s and 277

    72°C 1 min and 72°C 10 min. PCR products were labelled with Atto488-dUTP (Jena 278

    Bioscience) with a nick translation labelling kit (Jena Bioscience). 279

    Mitotic chromosomes of R. cephalotes were prepared from root tips, pre-treated in 2 280

    mM 8-hydroxyquinoline at 10°C for 20 h and fixed in ethanol: acetic acid (3:1 v/v) for 2 h at 281

    room temperature and stored at −20°C. Fixed root tips were digested with 2% cellulose, 2% 282

    pectinase and 2% pectolyase in citrate buffer (0.01 M sodium citrate and 0.01 M citric acid) 283

    for 120 min at 37°C and squashed in a drop of 45% acetic acid. Fluorescent in situ hybridization 284

    was performed as described by Aliyeva-Schnorr et al., (2015). The hybridization mixture 285

    contained 50% (v/v) formamide, 10% (w/v) dextran sulfate, 2×SSC, and 5 ng/μl of the probe. 286

    Slides were denatured at 75°C for 5 min, and the final stringency of hybridization was 76%. 287

    288

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 10

    Immuno-FISH 289

    To visualize the centromeres of R. cephalotes, we performed immunostaining of the 290

    centromere-specific histone variant CENH3 with polyclonal antibodies developed for R. 291

    pubera (RpCENH3, Marques et al., 2015). Mitotic preparations were made from root 292

    meristems fixed in 4% paraformaldehyde in Tris buffer (10 mM Tris, 10 mM EDTA, 100 mM 293

    NaCl, 0.1% Triton, pH 7.5) for 5 minutes on ice under vacuum and for another 25 minutes only 294

    on ice. After washing twice in Tris buffer, the roots were chopped in LB01 lysis buffer (15 mM 295

    Tris, 2 mM Na2EDTA, 0.5 mM spermine 4HCl, 80 mM KCl, 20 mM NaCl, 15 mM β-296

    mercaptoethanol, 0.1% Triton X-100, pH 7.5), filtered through a 50 μm filter (CellTrics, 297

    Sysmex), diluted 1:10, and subsequently, 100 μl of the diluted suspension were centrifuged 298

    onto microscopic slides using a Cytospin3 (Shandon, Germany) as described by Jasencakova 299

    et al. (2001). Immuno-FISH with anti-RpCENH3 antibodies and the Tyba repeat was 300

    performed according to Ishii et al. (2015). We used rabbit anti-RpCENH3 (diluted 1:200) as 301

    primary antibody and detected it with Cy3-conjugated anti-rabbit IgG (Dianova) secondary 302

    antibody (diluted 1:200). Slides were incubated overnight at 4°C and washed three times in 303

    1×PBS before the secondary antibody were applied. 304

    305

    Microscopy 306

    For widefield microscopy, we used an epifluorescence microscope BX61 (Olympus) 307

    equipped with a cooled CCD camera (Orca ER, Hamamatsu). To achieve super-resolution of 308

    ~120 nm (with a 488 nm laser excitation), we applied spatial structured illumination 309

    microscopy (3D-SIM) using a 63×/1.40 Oil Plan-Apochromat objective of an Elyra PS.1 310

    microscope system and the software ZENBlack from Carl Zeiss GmbH (Weisshart et al., 311

    2016). 312

    313

    Comparative repeat phylogenomics 314

    We employed a repeat abundance-based phylogenetic inference method (see details in 315

    Dodsworth et al., 2015) to assess if repeat abundance identified in our TCS reads could be used 316

    to resolve phylogenetic relationships, using one of our filtered datasets (BowTie) and the GS 317

    dataset for comparison. First, we concatenated reads for our five species with 0.065× coverage, 318

    with species-specific codes for each set of reads, and ran a comparative clustering analysis 319

    (simultaneous clustering of all species on the dataset) on RepeatExplorer with default settings 320

    (Novak et al., 2013). As the sequences were coded with the species names, we could identify 321

    the number of reads that each species contributed to each of the generated clusters, which is 322

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 11

    proportional to genome. Parsimony analysis using repeat abundances as quantitative characters 323

    was undertaken as described by Dodsworth et al. (2015). 324

    To access the phylogenetic potential of repetitive elements based on sequence 325

    similarity, we used the alignment and assembly free (AAF) approach (Fan et al., 2015) using 326

    all reads identified as repeats by RepeatExplorer in the Bowtie dataset. AAF constructs 327

    phylogenies directly from unassembled genome sequence data, bypassing both genome 328

    assembly and alignment. Thus, it calculates the statistical properties of the pairwise distances 329

    between genomes, allowing it to optimize parameter selection and to perform bootstrapping. 330

    In order to compare our repeat abundance-based phylogeny with a nuclear marker-331

    based phylogeny, we extracted the aligned sequences of 256 loci gathered by target capture 332

    (Buddenhagen, 2016) of our five Rhynchospora species. For simplification, we used the most 333

    general model of DNA substitution GTR + I + G (Abadi et al., 2019). Phylogenetic 334

    relationships were inferred using Bayesian Inference (BI) as implemented on BEAST v.1.8.3 335

    (Drummond & Rambaut, 2007). Two independent runs with four Markov Chain Monte Carlo 336

    (MCMC) were conducted, sampling every 1,000 generations for 10,000,000 generations. Each 337

    run was evaluated in TRACER v.1.7 (Rambaut et al., 2018) to assess MCMC convergence and 338

    a burn-in of 25% was applied. We then obtained the consensus phylogeny and clade posterior 339

    probabilities with the “sumt” command. 340

    341

    RESULTS 342

    343

    Efficiency of target sequences filtering 344

    We generated GS data and genome size estimates for Rhynchospora cephalotes and R. 345

    exaltata to add to the already sequenced data of the other three Rhynchospora here analysed in 346

    order to compare repeat composition from TCS and GS data (Table 1). The Geneious datasets 347

    presented a higher number of filtered reads than the BowTie datasets (Table 1). R. pubera 348

    presented the smallest number of filtered reads among all five species, with both the BowTie 349

    and Geneious filters (3.96% and 12.29% respectively). R. cephalotes had the highest number 350

    of filtered reads among the Geneious datasets, while R. exaltata presented the largest 351

    proportion among the BowTie datasets. Using the Custom Repeat Database option of 352

    RepeatExplorer, the highest amount of “target clusters” was found on R. exaltata (18.45%) and 353

    the lowest amount was found on R. tenuis (6.36%). 354

    355

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 12

    Repetitive DNA content of different datasets 356

    To evaluate the quality of repeat characterization from TCS datasets, the genomic 357

    proportion of different repetitive element lineages was compared with those observed in the 358

    GS datasets (Table 2, Fig. 1). Generally, the proportion of the total repeat fraction observed in 359

    the GS datasets was smaller than the ones observed in all the TCS datasets (Table 2, Figure 360

    S1). The raw datasets presented the highest values for total repetitive fraction in almost all 361

    species, with the exception of R. globosa, in which the Geneious dataset showed a total of 362

    49.41% repeat proportion against 46.74% on the raw dataset. This is also reflected in the 363

    difference of number of clusters representing at least 0.01% of total genomic proportion formed 364

    in the clustering analysis. While the GS datasets presented cluster numbers varying from 155 365

    to 294, filtered TCS datasets ranged from 462 to 589 clusters and raw TCS datasets ranged 366

    from 628 to 751. These differences are mostly due to the discrepancy in the proportion of 367

    unclassified repetitive elements in the different datasets (Table 2, Fig. S1). Unclassified repeat 368

    proportion on GS varied from 6.21% in R. exaltata to 14.70% in R. globosa, while in TCS it 369

    accounted for to up to 43.63% (raw TCS of R. exaltata). Overall, proportion of unclassified 370

    repeats was smaller in all filtered datasets when compared to raw TCS (Fig. 2a). Additional 371

    mapping to the original 256 target regions and separated BLAST searches did not produce any 372

    matches for the unclassified clusters, suggesting that these could be either repetitive or non-373

    repetitive accidentally enriched genomic regions, too fragmented to be annotated. 374

    375

    Table 2 – Genomic abundance of repeat types for all datasets in the five analysed Rhynchospora species 376

    Species Dataset

    To

    tal

    Rep

    eat

    fra

    ctio

    n

    Un

    cla

    ssif

    ied

    Rep

    eats

    5S

    rD

    NA

    35

    S r

    DN

    A

    Sa

    tell

    ites

    Cla

    ss I

    /

    Un

    c. L

    TR

    s

    Cla

    ss I

    /

    LT

    R T

    y1

    /co

    pia

    Cla

    ss I

    /

    LT

    R T

    y3

    /gy

    psy

    Cla

    ss I

    /

    Pa

    rare

    trov

    iru

    s

    Cla

    ss I

    /LIN

    E

    Cla

    ss I

    I/

    Su

    bcl

    ass

    1

    Cla

    ss I

    I/

    Su

    bcl

    ass

    2

    R.

    cep

    ha

    lote

    s

    GS 14.81 7.39 0.06 0.40 3.76 0.66 0.69 0.95 0.29 0 0.58 0.03

    Raw 43.08 38.45 0.11 0.21 2.84 0.51 0.16 0.45 0.05 0.06 0.24 0

    Geneious 33.38 28.54 0.15 0.33 2.58 0.77 0.29 0.32 0.08 0 0.32 0

    BowTie 34.75 30.29 0.13 0.30 2.52 0.37 0.27 0.47 0.13 0.01 0.26 0

    RE 31.17 25.92 0.13 0.25 3.44 0.61 0.19 0.29 0.05 0 0.29 0

    R.

    glo

    bo

    sa

    GS 38.91 14.70 0.04 0.45 4.19 3.25 1.93 12.51 0.16 0.01 1.67 0

    Raw 46.74 27.05 0.11 0.76 1.63 0.84 2.68 12.12 0.17 0.12 1.26 0

    Geneious 49.42 25.20 0.10 0.79 1.86 1.08 2.96 15.50 0.20 0.13 1.60 0

    BowTie 45.59 23.84 0.12 0.69 1.74 0.39 2.89 14.25 0.18 0.11 1.38 0

    RE 42.13 20.52 0.12 0.84 1.79 0.93 2.94 13.29 0.19 0.13 1.38 0

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 13

    R.

    ten

    uis

    GS 17.99 8.23 0.02 0.38 2.61 0.37 5.64 0.47 0 0.03 0.24 0

    Raw 27.35 19.53 0.08 0.58 3.06 0.54 1.96 0.74 0.04 0.08 0.74 0

    Geneious 24.41 15.59 0.05 0.68 3.45 0.71 2.22 0.77 0.04 0.07 0.83 0

    BowTie 23.79 15.36 0.08 0.69 3.19 0.66 2.13 0.71 0.04 0.05 0.88 0

    RE 22.37 14.13 0.08 0.62 3.28 0.54 2.00 0.80 0.04 0.09 0.79 0

    R.

    exa

    lta

    ta

    GS 27.25 6.21 0.10 2.14 16.10 0 2.14 0.44 0 0.07 0.05 0

    Raw 49.87 43.63 0.10 1.10 0.91 0.57 1.43 1.72 0 0.13 0.28 0

    Geneious 44.28 33.53 0.16 1.31 3.69 0.67 2.01 2.13 0.02 0.11 0.65 0

    BowTie 43.67 33.96 0.11 1.04 3.39 0.60 2.24 1.63 0 0.09 0.61 0

    RE 41.58 31.53 0.12 1.20 3.48 0.65 1.94 1.99 0.01 0.15 0.51 0

    R.

    pu

    bera

    GS 26.67 9.87 0 0.61 3.87 2.12 6.78 0.85 0.46 0.22 1.89 0

    Raw 39.11 23.48 0 1.82 1.04 0.48 6.74 4.08 0.46 0.51 0.43 0.07

    Geneious 39.11 19.09 0 1.99 1.17 0.67 7.53 6.51 0.55 0.59 1.01 0

    BowTie 36.92 20.44 0 2.04 0.91 0.31 7.26 3.81 0.50 0.55 1.1 0

    RE 38.31 18.66 0 2.82 0.33 0.51 7.23 6.62 0.49 0.54 1.04 0.07

    377

    378

    379

    Figure 1 – Barplots representing genomic abundance of classified repeat types identified in every dataset of 380

    Rhynchospora cephalotes (a), R. exaltata (b), R. globosa (c), R. pubera (d) and R. tenuis (e). Bar colours represent 381

    different repeat types according to the caption at the lower right corner. 382

    383

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 14

    384

    Figure 2 – Comparison of genomic proportions of unclassified elements (a) and correlation index (r) with genome 385

    skimming data (b) between raw and filtered Target Capture datasets (RepeatExplorer, BowTie and Geneious) of 386

    all Rhynchospora species. 387

    388

    The filtering strategies did not largely impact satellite DNA abundance in the analysed 389

    species, with the exception of R. exaltata, in which it was possible to identify ~4× more satellite 390

    reads in the BowTie and Geneious datasets than in the raw datasets. Also in R. exaltata, there 391

    was a huge discrepancy between the satellite abundance observed in the GS dataset when 392

    compared to all TCS datasets (Fig. 1). However, the two satellite DNAs responsible for this 393

    abundance difference were also found in all TCS datasets (Table S2, Fig. S2), although in 394

    smaller proportions. The amount of satellite DNA was generally lower in the TCS datasets, 395

    with the exception of R. tenuis, in which all TCS datasets presented a small increase in satellite 396

    abundance when compared to GS (Table 2, Fig. 1). Although in some species some of the 397

    satellites found in GS data were not present in every dataset, the most abundant for each species 398

    could be identified in all of the TCS datasets. Similarly, some satellites found in TCS datasets 399

    were not found in the GS datasets, possibly being low-abundance satellites accidentally 400

    enriched (Table S2, Fig. S2). 401

    For mobile elements, there was a general agreement in the order of abundance of repeat 402

    types found in the GS and TCS datasets, with filtered datasets showing increase in the 403

    proportion of annotated elements when compared to the raw TCS dataset (Table 2, Fig. 1), with 404

    a few exceptions. For example, in R. exaltata, LTRs from the Ty3/gypsy superfamily were 405

    more abundant in the raw TCS, BowTie and RE datasets than in the GS datasets (Table 2, Fig. 406

    1). We also compared the abundances at lineage level (Table S1). Patterns of abundance of 407

    LTR families from GS and TCS datasets were similar, with most of the genomic abundance of 408

    Ty1/copia and Ty3/gypsy superfamilies being result of the amplification of up to four main 409

    lineages (Table S1). Generally, LTRs found in GS with abundance as low as 0.01% could also 410

    be identified in the target datasets. Surprisingly, in all five species, a greater diversity of LTR 411

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 15

    retroelements was observed in the target capture datasets when compared to GS (Table S1). 412

    This led to a few interesting discrepancies, such as in R. pubera, where target capture datasets 413

    showed high abundance of Ty3/gypsy/Retand (1 to 1.3%) and Ty3/gypsy/Tekay elements (0.46 414

    to 0.50%), despite those not being found in GS data. 415

    To statistically compare the results obtained in the different datasets, we checked for a 416

    correlation between the classified repeat abundances observed (at lineage level, when possible) 417

    in all target capture datasets with the ones observed in the GS datasets (Fig. 3). Although all 418

    tests showed significant correlations (p < 0.05), the strength of the correlation varied depending 419

    on the dataset. Raw TCS datasets had the weakest correlation in all five species, with filtering 420

    of the targeted sequences generally improving the correlation with the GS dataset (Fig. 2b). 421

    The strongest correlations for R. globosa (r = 0.92), R. pubera (r = 0.74) and R. tenuis (r = 422

    0.79) were observed with the Geneious dataset, while for R. cephalotes and R. exaltata the best 423

    correlation was observed on the BowTie dataset (r = 0.78 and 0.75). The RE filtering for R. 424

    globosa and R. tenuis did not improve the correlation when compared to the raw TCS dataset 425

    (r = 0.85 and r = 0.66 respectively). However, it also showed as strong a correlation as BowTie 426

    for R. cephalotes (r = 0.78) and as Geneious for R. pubera (r = 0.74). 427

    428

    Figure 3 – Correlation between repeat abundances observed on genome skimming and target capture datasets of 429

    Rhynchospora species. Genome skimming abundance values are represented on the x axis of each plot, while 430

    target dataset abundances are on the y axis. Spearman’s correlation index (r) for each case is plotted in the lower 431

    right corner. 432

    433

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 16

    Chromosomal localization of the satellite DNA found in the TCS dataset 434

    To test whether it was possible to use the TCS data to investigate the chromosomal 435

    repeat distribution by FISH, we chose the most abundant repetitive element found in the R. 436

    cephalotes TCS datasets. This repeat was a 172-bp satellite DNA with 60% sequence similarity 437

    to Tyba of R. pubera (Marques et al., 2015). The in situ hybridization pattern of the R. 438

    cephalotes Tyba (RcTyba) variant was similar to the distribution reported in R. pubera. Small 439

    foci appeared in interphase nuclei, and a continuous line along both condensed chromatids 440

    occurred at all pro- and metaphase chromosomes. Via immuno-FISH using a CENH3-specific 441

    antibody and RcTyba repeats, respectively, the holocentric centromere structure of R. 442

    cephalotes has been confirmed due to the co-localization of CENH3 and RcTyba (Fig. 4). 443

    444

    445

    Figure 4 – Co-localization of R. cephalotes Tyba repeats and CENH3 in interphase nuclei (a), prometaphase (b) 446

    and metaphase chromosomes (c) via 3D-SIM imaging. Both Tyba and CENH3 clearly indicate the presence of 447

    holocentromeres in the condensed mitotic chromosomes. 448

    449

    Repeat abundance and structure found in TCS data reflect phylogenetic 450

    relationships 451

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 17

    We used repeat abundances obtained by comparative clustering analysis of BowTie 452

    datasets of our five Rhynchospora species to reconstruct phylogenetic relationships. In the 453

    comparative clustering analysis of the GS dataset, 1,754,326 concatenated reads were analysed, 454

    forming 450 clusters with at least 0.01% genomic abundance. For the BowTie dataset, 455

    1,186,605 of concatenated reads were analysed, with 582 clusters representing at least 0.01% 456

    of total genomic abundance. Repeat composition varied among species, with the largest 457

    clusters of each species being almost or completely absent in the others (Fig. 5a). By using the 458

    first 150 most abundant clusters of the comparative analysis, we were able to reconstruct the 459

    phylogenetic relationships among the five Rhynchospora species for both the BowTie (Fig. 5b) 460

    and GS datasets (Fig. 5c) with high bootstrap support (mean BS = 100 and 98.3 respectively). 461

    Using the reads from all clusters identified in the BowTie dataset comparative analysis, the 462

    AAF analysis yielded the same relationships with high bootstrap support (mean BS = 100, Fig. 463

    4d). Branch lengths of the abundance-based analysis were significantly higher than for the AAF 464

    analysis. Despite this, the species relationships observed in the repeat-based phylogenies were 465

    congruent with the ones retrieved in the Bayesian analysis with 256 concatenated target loci, 466

    with R. cephalotes + R. exaltata forming a clade sister to R. pubera + R. tenuis, and R. globosa 467

    sister to both clades (Fig. 5e). 468

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 18

    469

    Fig. 5 – Phylogenomics of Rhynchospora species using genome skimming and target capture datasets a) Graphic 470

    representation of the 30 most abundant clusters originated from the comparative clustering analysis with the 471

    BowTie dataset. Height of rectangles represent the genomic abundance of each cluster, with colours indicating 472

    the repeat type according to the caption. Asterisks (*) indicate the clusters that represent Tyba, which is absent in 473

    Rhynchospora globosa. Below, phylogenetic trees obtained by repeat abundance of the GS (b) and BowTie (c) 474

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 19

    datasets, AAF of total reads in repetitive clusters (d) and Bayesian inference based on 256 target regions (e). Star 475

    in b represents the only node with support below 100 (BS = 99). 476

    477

    DISCUSSION 478

    479

    TCS data can be used to characterize the repeat composition of a genome 480

    We were able to find most of the repeat diversity of five Rhynchospora species using 481

    filtered and unfiltered TCS reads. Depending on the filtering strategy employed, the 482

    Rhynchospora data used here showed low abundance of on-target reads, indicating a 483

    considerable proportion of off-target reads suitable for repeat analysis. Target-based 484

    sequencing approaches often have varied enrichment efficiency, with on-target enriched reads 485

    representing as low as 5% of the final genomic library (Johnson et al., 2019). This is 486

    particularly the case with universal kits that are designed to work across large taxonomic 487

    groups, at the cost of enrichment efficiency. Thus, the off-target sequence reads from such 488

    libraries can be used in a similar fashion to genome skimming sequencing, an approach often 489

    described as Hyb-Seq (Weitemier et al., 2014; Dodsworth et al., 2019). 490

    Our results show that there is a moderate rank correlation between the abundances of 491

    annotated repeats obtained by analysing genome skimming and raw target capture datasets, and 492

    that this correlation increases when analysing off-target reads only. The significant correlation 493

    indexes showed that although annotated repeat proportions of our filtered target datasets are 494

    not identical to those observed in GS, repeat composition and order of abundance was highly 495

    similar. Although the Geneious dataset presented the highest proportion of filtered reads and 496

    highest correlation index with GS data for three of the five species, the BowTie and RE 497

    approaches were also sufficient for increasing the correlation index in most cases. Therefore, 498

    any of the filtering strategies presented here can produce sufficiently accurate results in regards 499

    to repeat identification and order of abundance when compared with the traditional GS 500

    approach. It is important to note that, although the patterns of repeat abundance are highly 501

    similar, the numerical difference observed for some repeat classes shows that combining GS 502

    and TCS data may produce inconsistent results, and exact quantification from these data should 503

    be taken with caution. 504

    Although the total abundance of satellite DNAs varied between GS and TCS datasets, 505

    we were able to identify the most abundant satellite DNA families of all five Rhynchospora 506

    species in all target datasets. While some low-abundance satellites (

  • 20

    abundance) found in GS data were not found in the target datasets, others with abundances as 508

    low as 0.09% were found in all datasets, suggesting that satellite abundance does not 509

    necessarily affect its presence in the off-target reads (Table S2). More importantly, various 510

    satellites found in the TCS datasets were not found in the GS dataset (Table S2), probably being 511

    low-abundant satellites accidentally enriched in the sequencing process. Low-abundant 512

    satellites may sometimes not be detected by RepeatExplorer, requiring additional filtering to 513

    be detected (Ruiz-Ruano et al., 2016), which could explain the absence of these satellites in 514

    our GS datasets. 515

    One of the benefits of using Rhynchospora as a model for this study was the fact that it 516

    possesses satellite DNAs with varying chromosomal distributions. Tyba clusters are dispersed, 517

    forming a linear distribution along the metaphase holocentromeres of R. pubera, R. ciliata, R. 518

    cephalotes and R. tenuis (Marques et al., 2015; Ribeiro et al., 2017; this study). However, other 519

    satellites in Rhynchospora, such as RgSAT1-186 in R. globosa form localized blocks (Ribeiro 520

    et al., 2017). Off-target sequences used in Hyb-Seq are often adjacent to the targeted gene 521

    regions (Weitemier et al., 2014; Dodsworth et al., 2019), rasing the possibility that our analysis 522

    would preferentially identify widespread repeats such as Tyba, which are interspersed with 523

    genic regions (Marques et al., 2015). However, RgSAT1-186, identified as sub-terminal 524

    clusters in the chromosomes of R. globosa (Ribeiro et al., 2017), was identified as the most 525

    abundant cluster on all the R. globosa TCS datasets, showing that the chromosomal distribution 526

    of a satellite DNA was not interfering in the randomness of the off-target reads (Fig. S1). 527

    As well as confirming its presence in the target datasets of R. pubera and R. tenuis, we 528

    were able to find novel Tyba variants in R. cephalotes and in R. exaltata, with 60% and 57% 529

    sequence similarity to RpTyba, respectively. R. globosa did not present any Tyba variant, in 530

    concordance with the results of Ribeiro et al. (2017). The abundance of Tyba in R. pubera was 531

    very low in the target datasets (~0.14%) compared to the GS results (~2.8% here, 3.6% in 532

    Marques et al. 2015). In our R. pubera TCS datasets, the most abundant tandem repeat was 533

    RpSAT5-287, which appeared as only the fifth most abundant in the GS dataset (Table S2). 534

    This satellite was not found in the previous R. pubera characterization and may indicate the 535

    potential to discover additional low-abundant sequences using TCS datasets. Although fast 536

    rates of evolution for satellite DNAs may lead to intraspecific abundance variation (Ceccarelli 537

    et al., 2011), this 20-fold difference is probably too high to be a product of differences between 538

    the individuals used for each sequencing method. Satellite abundances estimated by TAREAN 539

    depend on several factors, such as sequence coverage, monomer homogeneity and similarity 540

    with other repeats (Novák et al., 2017). As abundance information gathered by short reads can 541

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 21

    grossly underestimate the true abundance of this repeat type (i.e. Ribeiro et al., 2020), caution 542

    is needed when interpreting TCS-yielded genomic abundances, though similar caution is 543

    needed with GS data as well. 544

    Transposable elements, particularly LTR-retrotransposons, are often the largest fraction 545

    of repetitive DNA in plants (Galindo-González et al., 2017). In our GS datasets, 546

    LTR−Ty1/copia was the most abundant repeat type in three out of five species. The TCS 547

    datasets yielded similar results to GS, with a few discrepancies, especially in R. exaltata 548

    (predominance of Ty3/gypsy instead of Ty1/copia) and R. pubera (significantly higher 549

    proportion of Ty3/gypsy than in GS). In order to do a more detailed comparison, we used the 550

    individual lineage abundance values for our correlation analysis. The high correlation rates 551

    indicate that different LTR lineages, although varying in abundance, contributed similarly to 552

    the repetitive fraction in GS and filtered target datasets. Our results show that in addition to 553

    being able to find the majority of amplified lineages of LTR retrotransposons, we could also 554

    find lineages with abundances as low as 0.01% in the target datasets. We also find a higher 555

    diversity of LTR lineages in the target datasets than in the GS data, with a few of these “extra” 556

    lineages being over-abundant when compared to the GS dataset (Table S1). These lineages 557

    were absent in the GS results probably due to masking by the highly abundant satellite DNA 558

    clusters, which were underestimated in some of our TCS datasets. Nonetheless, the fact that 559

    we could identify even low abundance LTRs in the target datasets, coupled with the moderate 560

    correlation with the GS-yielded abundances, indicate that off-target reads from target 561

    sequencing may be sufficient to identify most of the LTR retrotransposons in a genome. 562

    563

    TCS data can be used to develop cytogenetic probes 564

    We tested whether we could use repeat information obtained from TCS data to develop probes 565

    for cytogenetic techniques such as fluorescent in situ hybridization (FISH). Highly abundant 566

    repetitive elements are frequently chosen for FISH experiments, as they are easier to visualize 567

    on chromosomes than low abundance repeats and can often be important components of 568

    predominantly heterochromatic and centromeric regions (Marques et al., 2015 Bilinski et al., 569

    2017; Ávila Robledillo et al., 2018). In our Rhynchospora species, the most abundant cluster 570

    in all datasets was a satellite DNA. For R. tenuis and R. globosa we found the same satellites 571

    found previously by Ribeiro et al. (2017) as the most abundant satellites (Tyba and RgSAT1-572

    186 respectively). 573

    Similar to previous results on other Rhynchospora (Marques et al., 2015; Ribeiro et al., 574

    2017), the Tyba variant from R. cephalotes presented a dispersed distribution in interphase 575

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 22

    nuclei and a line-like pattern along the sister chromatids of condensed chromosomes. Co-576

    localization with CENH3 further confirmed the holocentromere specific localization of Tyba. 577

    The conservation of the holocentromeric distribution of RcTyba on R. cephalotes strengthens 578

    its putative role in centromere function as proposed previously (Marques et al., 2015; Ribeiro 579

    et al., 2017). It also points to a remarkably old origin for this satellite and its holocentromeric 580

    association, sharing a common ancestor between 35-46 My (95% credible interval; 581

    Buddenhagen, 2016). A large-scale survey of Tyba in the entire Rhynchospora genus could 582

    help to further elucidate the evolution of this satellite and its association with the 583

    holocentromere. In light of our results, we believe that using off-target reads from target 584

    capture-based phylogenetic studies could be an additional way to find potentially informative 585

    cytological markers across a broader sampling and investigate their evolution in a broader 586

    context. 587

    588

    TCS data can be used to construct repeat-based phylogenies 589

    In order to test if repeat abundances observed in target datasets are accurate enough to 590

    reconstruct phylogenetic relationships, we applied the methodology described by Dodsworth 591

    et al. (2015) to our BowTie and GS datasets as well as using an assembly and alignment free 592

    method (Fan et al., 2015) using the total set of reads output by the comparative repeat analysis 593

    on the BowTie dataset. Dodsworth et al.’s approach takes into account the assumption that, as 594

    repeat abundance changes primarily through random genetic drift (Jurka et al., 2011), they can 595

    be used as selection-free characters for phylogenetic reconstruction. On the other hand, AAF 596

    methods have been shown to identify potentially useful markers for taxonomic resolution from 597

    genome skimming datasets (Bohmann et al., 2020). Both analyses were successful in 598

    reconstructing the major relationships between the five Rhynchospora species with maximal 599

    bootstrap support. Our results not only corroborate that AAF methods can be useful for 600

    repetitive sequence data, but also show that it can be employed on the off-target portion of TCS 601

    data. These sequence similarity-based approaches (e.g. Vitales et al. 2019) may be more 602

    appropriate for groups where no genome skimming data is available for verifying the accuracy 603

    of repeat abundance in the target dataset, when abundance-based approaches provide less 604

    resolved trees or when genome sizes are unknown. 605

    The phylogenetic relationships retrieved by the repeat abundance and AAF-based 606

    phylogeny were not only congruent with a Bayesian analysis of the 256 target regions, but also 607

    with recent studies in the genus based on other phylogenetic data (Buddenhagen et al., 2016; 608

    Ribeiro et al., 2018). Our results show the potential of repeats from off-target reads to be used 609

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 23

    as an additional phylogenetically informative dataset, from a completely different part of the 610

    genome typically used for phylogenetic studies. Target capture-based sequencing already 611

    offers the opportunity to construct robust phylogenies with hundreds of informative markers 612

    and also to assemble whole plastomes and other organellar DNAs via the off-target reads to 613

    infer phylogenetic relationships, and nuclear-organellar discordance (Dodsworth et al., 2019). 614

    Repeat-based phylogenies offer an additional strategy, based on the same TCS datasets, 615

    potentially uncovering nuclear intragenomic (in)congruence, while further increasing the 616

    usefulness of TCS datasets. The robustness of the repetitive DNA information obtained from 617

    our target datasets can prove useful in a variety of phylogenomic approaches, such as 618

    similarity-based repeat phylogenies (Vitales et al., 2019) as well as the alignment-free and 619

    abundance-based methods presented here. The ability to characterise genome composition and 620

    develop cytogenetic markers from TCS-derived repeat data add further value to these datasets, 621

    and insight into the genomic processed underpinning evolution in plants. 622

    623

    ACKNOWLEGMENTS 624

    This study was supported in part by the Coordenacão de Aperfeicoamento de Pessoal de Nıvel 625

    Superior–Brasil (CAPES, Finance Code 001), CAPES-PRINT project number 626

    88887.363884/2019-00 (LC), and CNPq (Conselho Nacional de Desenvolvimento Científico e 627

    Tecnologico; grant number 141037/2018-0 to LC). The authors are also grateful to Dr. 628

    Magdalena Vayo (Facultad de Agronomía, Uruguay) for providing comments and suggestions 629

    for the manuscript and to Msc. Erton Almeida for the collection of Rhynchospora cephalotes. 630

    631

    AUTHORS CONTRIBUTION 632

    LC and APH conceived the ideas and designed the study with important input from AM, GS, 633

    AH and SD; LC, AM, CB, BH, WT and VS collected data; LC, AM and GS performed data 634

    analysis; LC wrote the manuscript with comments from all the authors. 635

    636

    DATA AVAILABILITY 637

    The data that supports the findings of this study is openly available in the NCBI GenBank at 638

    [www.ncbi.nlm.nih.gov] under BioProjects PRJEB9643, PRJNA672922 and PRJNA672127. 639

    640

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    http://www.ncbi.nlm.nih.gov/https://doi.org/10.1101/2020.12.10.419515

  • 24

    REFERENCES 641

    Abadi S, Azouri D, Pupko T, Mayrose I. 2019. Model selection may not be a mandatory 642

    step for phylogeny reconstruction. Nature Communications 10: 934. 643

    Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, 644

    Middle CM, Rodesch MJ, Packard CJ, et al. 2007. Direct selection of human genomic loci 645

    by microarray hybridization. Nature Methods 4: 903–905. 646

    Aliyeva-Schnorr L, Beier S, Karafiátová M, Schmutzer T, Scholz U, Doležel J, Stein N, 647

    Houben A. 2015. Cytogenetic mapping with centromeric bacterial artificial chromosomes 648

    contigs shows that this recombination-poor region comprises more than half of barley 649

    chromosome 3H. The Plant Journal 84: 385–394. 650

    Ávila Robledillo L, Koblížková A, Novák P, Böttinger K, Vrbová I, Neumann P, 651

    Schubert I, Macas J. 2018. Satellite DNA in Vicia faba is characterized by remarkable 652

    diversity in its sequence composition, association with centromeres, and replication timing. 653

    Scientific Reports 8. 654

    Bilinski P, Albert PS, Berg JJ, Birchler J, Grote M, Lorant A, Quezada J, Swarts K, 655

    Yang J, Ross-Ibarra J. 2017. Parallel Altitudinal Clines Reveal Adaptive Evolution Of 656

    Genome Size In Zea mays. 657

    Bohmann K, Mirarab S, Bafna V, Gilbert MTP. 2020. Beyond DNA barcoding: The 658

    unrealized potential of genome skim data in sample identification. Molecular Ecology 29: 659

    2521–2534. 660

    Bolsheva NL, Melnikova NV, Kirov IV, Dmitriev AA, Krasnov GS, Amosova АV, 661

    Samatadze TE, Yurkevich OYu, Zoshchuk SA, Kudryavtseva AV, et al. 2019. 662

    Characterization of repeated DNA sequences in genomes of blue-flowered flax. BMC 663

    Evolutionary Biology 19: 49. 664

    Buddenhagen CE. 2016. A view of Rhynchosporeae (Cyperaceae) diversification before and 665

    after the application of anchored phylogenomics across the angiosperms. 666

    Buddenhagen C, Lemmon AR, Lemmon EM, Bruhl J, Cappa J, Clement WL, 667

    Donoghue MJ, Edwards EJ, Hipp AL, Kortyna M, et al. 2016. Anchored Phylogenomics 668

    of Angiosperms I: Assessing the Robustness of Phylogenetic Estimates. Evolutionary Biology. 669

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 25

    Bureš P, Zedek F, Markova M. 2013. Holocentric Chromosomes. In: Plant Genome 670

    Diversity Volume 2. Vienna: Springer Vienna, 187–204. 671

    Ceccarelli M, Sarri V, Caceres ME, Cionini PG. 2011. Intraspecific genotypic diversity in 672

    plants (P Donini, Ed.). Genome 54: 701–709. 673

    Cheng Z, Dong F, Langdon T, Ouyang S, Buell CR, Gu M, Blattner FR, Jiang J. 2002. 674

    Functional Rice Centromeres Are Marked by a Satellite Repeat and a Centromere-Specific 675

    Retrotransposon. The Plant Cell 14: 1691–1704. 676

    Čížková J, Hřibová E, Humplíková L, Christelová P, Suchánková P, Doležel J. 2013. 677

    Molecular Analysis and Genomic Organization of Major DNA Satellites in Banana (Musa 678

    spp.) (K Kashkush, Ed.). PLoS ONE 8: e54808. 679

    Cosart T, Beja-Pereira A, Chen S, Ng SB, Shendure J, Luikart G. 2011. Exome-wide 680

    DNA capture and next generation sequencing in domestic and wild species. BMC Genomics 681

    12: 347. 682

    Dodsworth S. 2015. Genome skimming for next-generation biodiversity analysis. Trends in 683

    Plant Science 20: 525–527. 684

    Dodsworth S, Chase MW, Kelly LJ, Leitch IJ, Macas J, Novak P, Piednoel M, Weiss-685

    Schneeweiss H, Leitch AR. 2015. Genomic Repeat Abundances Contain Phylogenetic 686

    Signal. Systematic Biology 64: 112–126. 687

    Dodsworth S, Jang T-S, Struebig M, Chase MW, Weiss-Schneeweiss H, Leitch AR. 688

    2017. Genome-wide repeat dynamics reflect phylogenetic distance in closely related 689

    allotetraploid Nicotiana (Solanaceae). Plant Systematics and Evolution 303: 1013–1020. 690

    Dodsworth S, Pokorny L, Johnson MG, Kim JT, Maurin O, Wickett NJ, Forest F, 691

    Baker WJ. 2019. Hyb-Seq for Flowering Plant Systematics. Trends in Plant Science 24: 692

    887–891. 693

    Dolezel J, Sgorbati S, Lucretti S. 1992. Comparison of three DNA fluorochromes for flow 694

    cytometric estimation of nuclear DNA content in plants. Physiologia Plantarum 85: 625–631. 695

    Drummond AJ, Rambaut A. 2007. BEAST: Bayesian evolutionary analysis by sampling 696

    trees. BMC Evolutionary Biology 7: 214. 697

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 26

    Eaton DAR, Spriggs EL, Park B, Donoghue MJ. 2016. Misconceptions on Missing Data in 698

    RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants. Systematic 699

    Biology: syw092. 700

    Elliott TA, Gregory TR. 2015. What’s in a genome? The C-value enigma and the evolution 701

    of eukaryotic genome content. Philosophical Transactions of the Royal Society B: Biological 702

    Sciences 370: 20140331. 703

    Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. 704

    2012. Ultraconserved Elements Anchor Thousands of Genetic Markers Spanning Multiple 705

    Evolutionary Timescales. Systematic Biology 61: 717–726. 706

    Fan H, Ives AR, Surget-Groba Y, Cannon CH. 2015. An assembly and alignment-free 707

    method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 708

    16: 522. 709

    Galindo-González L, Mhiri C, Deyholos MK, Grandbastien M-A. 2017. LTR-710

    retrotransposons in plants: Engines of evolution. Gene 626: 14–25. 711

    Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, 712

    Giannoukos G, Fisher S, Russ C, et al. 2009. Solution hybrid selection with ultra-long 713

    oligonucleotides for massively parallel targeted sequencing. Nature Biotechnology 27: 182–714

    189. 715

    Guignard MS, Nichols RA, Knell RJ, Macdonald A, Romila C-A, Trimmer M, Leitch 716

    IJ, Leitch AR. 2016. Genome size and ploidy influence angiosperm species’ biomass under 717

    nitrogen and phosphorus limitation. New Phytologist 210: 1195–1206. 718

    Heyduk K, Trapnell DW, Barrett CF, Leebens-Mack J. 2016. Phylogenomic analyses of 719

    species relationships in the genus Sabal (Arecaceae) using targeted sequence capture. 720

    Biological Journal of the Linnean Society 117: 106–120. 721

    Houben A, Schubert I. 2003. DNA and proteins of plant centromeres. Current Opinion in 722

    Plant Biology 6: 554–560. 723

    Ilves KL, López-Fernández H. 2014. A targeted next-generation sequencing toolkit for 724

    exon-based cichlid phylogenomics. Molecular Ecology Resources 14: 802–811. 725

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 27

    Ishii T, Sunamura N, Matsumoto A, Eltayeb AE, Tsujimoto H. 2015. Preferential 726

    recruitment of the maternal centromere-specific histone H3 (CENH3) in oat (Avena sativa L.) 727

    × pearl millet (Pennisetum glaucum L.) hybrid embryos. Chromosome Research 23: 709–728

    718. 729

    Jasencakova Z, Meister A, Schubert I. 2001. Chromatin organization and its relation to 730

    replication and histone acetylation during the cell cycle in barley. Chromosoma 110: 83–92. 731

    Johnson MG, Pokorny L, Dodsworth S, Botigué LR, Cowan RS, Devault A, Eiserhardt 732

    WL, Epitawalage N, Forest F, Kim JT, et al. 2019. A Universal Probe Set for Targeted 733

    Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids 734

    Clustering (S Renner, Ed.). Systematic Biology 68: 594–606. 735

    Jurka J, Bao W, Kojima KK. 2011. Families of transposable elements, population structure 736

    and the origin of species. Biology Direct 6: 44. 737

    Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, 738

    Cooper A, Markowitz S, Duran C, et al. 2012. Geneious Basic: An integrated and 739

    extendable desktop software platform for the organization and analysis of sequence data. 740

    Bioinformatics 28: 1647–1649. 741

    Koo D-H, Hong CP, Batley J, Chung YS, Edwards D, Bang J-W, Hur Y, Lim YP. 2011. 742

    Rapid divergence of repetitive DNAs in Brassica relatives. Genomics 97: 173–185. 743

    Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature 744

    Methods 9: 357–359. 745

    Lemmon AR, Emme SA, Lemmon EM. 2012. Anchored Hybrid Enrichment for Massively 746

    High-Throughput Phylogenomics. Systematic Biology 61: 727–744. 747

    Loureiro J, Rodriguez E, Dolezel J, Santos C. 2007. Two new nuclear isolation buffers for 748

    plant DNA flow cytometry: a test with 37 species. Annals of Botany 100: 875–888. 749

    Lyu H, He Z, Wu C-I, Shi S. 2018. Convergent adaptive evolution in marginal 750

    environments: unloading transposable elements as a common strategy among mangrove 751

    genomes. New Phytologist 217: 428–438. 752

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 28

    Macas J, Novák P, Pellicer J, Čížková J, Koblížková A, Neumann P, Fuková I, Doležel 753

    J, Kelly LJ, Leitch IJ. 2015. In Depth Characterization of Repetitive DNA in 23 Plant 754

    Genomes Reveals Sources of Genome Size Variation in the Legume Tribe Fabeae (A 755

    Houben, Ed.). PLOS ONE 10: e0143424. 756

    Mandel JR, Dikow RB, Funk VA, Masalia RR, Staton SE, Kozik A, Michelmore RW, 757

    Rieseberg LH, Burke JM. 2014. A Target Enrichment Method for Gathering Phylogenetic 758

    Information from Hundreds of Loci: An Example from the Compositae. Applications in Plant 759

    Sciences 2: 1300085. 760

    Marques A, Ribeiro T, Neumann P, Macas J, Novák P, Schubert V, Pellino M, Fuchs J, 761

    Ma W, Kuhlmann M, et al. 2015. Holocentromeres in Rhynchospora are associated with 762

    genome-wide centromere-specific repeat arrays interspersed among euchromatin. 763

    Proceedings of the National Academy of Sciences 112: 13633–13638. 764

    Martín-Peciña M, Ruiz-Ruano FJ, Camacho JPM, Dodsworth S. 2019. Phylogenetic 765

    signal of genomic repeat abundances can be distorted by random homoplasy: a case study 766

    from hominid primates. Zoological Journal of the Linnean Society 185: 543–554. 767

    Mascagni F, Vangelisti A, Giordani T, Cavallini A, Natali L. 2020. A computational 768

    comparative study of the repetitive DNA in the genus Quercus L. Tree Genetics & Genomes 769

    16: 11. 770

    Nagaki K, Talbert PB, Zhong CX, Dawe RK, Henikoff S, Jiang J. 2003. Chromatin 771

    immunoprecipitation reveals that the 180-bp satellite repeat is the key functional DNA 772

    element of Arabidopsis thaliana centromeres. Genetics 163: 1221–1225. 773

    Neumann P, Novák P, Hoštáková N, Macas J. 2019. Systematic survey of plant LTR-774

    retrotransposons elucidates phylogenetic relationships of their polyprotein domains and 775

    provides a reference for element classification. Mobile DNA 10. 776

    Novák P, Ávila Robledillo L, Koblížková A, Vrbová I, Neumann P, Macas J. 2017. 777

    TAREAN: a computational tool for identification and characterization of satellite DNA from 778

    unassembled short reads. Nucleic Acids Research 45: e111–e111. 779

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 29

    Novak P, Neumann P, Pech J, Steinhaisl J, Macas J. 2013. RepeatExplorer: a Galaxy-780

    based web server for genome-wide characterization of eukaryotic repetitive elements from 781

    next-generation sequence reads. Bioinformatics 29: 792–793. 782

    R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, 783

    Austria: R Foundation for Statistical Computing. 784

    Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. 2018. Posterior 785

    Summarization in Bayesian Phylogenetics Using Tracer 1.7 (E Susko, Ed.). Systematic 786

    Biology 67: 901–904. 787

    Ribeiro T, Buddenhagen CE, Thomas WW, Souza G, Pedrosa-Harand A. 2018. Are 788

    holocentrics doomed to change? Limited chromosome number variation in Rhynchospora 789

    Vahl (Cyperaceae). Protoplasma 255: 263–272. 790

    Ribeiro T, Marques A, Novák P, Schubert V, Vanzela ALL, Macas J, Houben A, 791

    Pedrosa-Harand A. 2017. Centromeric and non-centromeric satellite DNA organisation 792

    differs in holocentric Rhynchospora species. Chromosoma 126: 325–335. 793

    Ribeiro T, Vasconcelos E, dos Santos KGB, Vaio M, Brasileiro-Vidal AC, Pedrosa-794

    Harand A. 2020. Diversity of repetitive sequences within compact genomes of Phaseolus L. 795

    beans and allied genera Cajanus L. and Vigna Savi. Chromosome Research 28: 139–153. 796

    Ruiz-Ruano FJ, López-León MD, Cabrero J, Camacho JPM. 2016. High-throughput 797

    analysis of the satellitome illuminates satellite DNA evolution. Scientific Reports 6. 798

    Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. 2019. Skmer: assembly-799

    free and alignment-free sample identification using genome skims. Genome Biology 20: 34. 800

    Sass C, Iles WJD, Barrett CF, Smith SY, Specht CD. 2016. Revisiting the Zingiberales: 801

    using multiplexed exon capture to resolve ancient and recent phylogenetic splits in a 802

    charismatic plant lineage. PeerJ 4: e1584. 803

    Schmickl R, Liston A, Zeisek V, Oberlander K, Weitemier K, Straub SCK, Cronn RC, 804

    Dreyer LL, Suda J. 2016. Phylogenetic marker development for target enrichment from 805

    transcriptome and genome skim data: the pipeline and its application in southern African 806

    Oxalis (Oxalidaceae). Molecular Ecology Resources 16: 1124–1135. 807

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 30

    Schrader L, Schmitz J. 2019. The impact of transposable elements in adaptive evolution. 808

    Molecular Ecology 28: 1537–1549. 809

    Sonnhammer ELL, Durbin R. 1995. A dot-matrix program with dynamic threshold control 810

    suited for genomic DNA and protein sequence analysis. Gene 167: GC1–GC10. 811

    Souza G, Costa L, Guignard MS, Van-Lume B, Pellicer J, Gagnon E, Leitch IJ, Lewis 812

    GP. 2019. Do tropical plants have smaller genomes? Correlation between genome size and 813

    climatic variables in the Caesalpinia Group (Caesalpinioideae, Leguminosae). Perspectives in 814

    Plant Ecology, Evolution and Systematics 38: 13–23. 815

    Straub SCK, Parks M, Weitemier K, Fishbein M, Cronn RC, Liston A. 2012. Navigating 816

    the tip of the genomic iceberg: Next-generation sequencing for plant systematics. American 817

    Journal of Botany 99: 349–364. 818

    Thomas WmW, Araújo AC, Alves MV. 2009. A Preliminary Molecular Phylogeny of the 819

    Rhynchosporeae (Cyperaceae). The Botanical Review 75: 22–29. 820

    Van-Lume B, Esposito T, Diniz-Filho JAF, Gagnon E, Lewis GP, Souza G. 2017. 821

    Heterochromatic and cytomolecular diversification in the Caesalpinia group (Leguminosae): 822

    Relationships between phylogenetic and cytogeographical data. Perspectives in Plant 823

    Ecology, Evolution and Systematics 29: 51–63. 824

    Vitales D, Garcia S, Dodsworth S. 2019. Reconstructing Phylogenetic Relationships Based 825

    on Repeat Sequence Similarities. bioRxiv. 826

    Wang H-J, Li W-T, Liu Y-N, Yang F-S, Wang X-Q. 2017. Resolving interspecific 827

    relationships within evolutionarily young lineages using RNA-seq data: An example from 828

    Pedicularis section Cyathophora (Orobanchaceae). Molecular Phylogenetics and Evolution 829

    107: 345–355. 830

    Weisshart K, Fuchs J, Schubert V. 2016. Structured Illumination Microscopy (SIM) and 831

    Photoactivated Localization Microscopy (PALM) to Analyze the Abundance and Distribution 832

    of RNA Polymerase II Molecules on Flow-sorted Arabidopsis Nuclei. BIO-PROTOCOL 6. 833

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515

  • 31

    Weiss-Schneeweiss H, Leitch AR, McCann J, Macas J. 2015. Exploring the repeats’ 834

    landscape and its impact on genome evolution and plant diversification. In: Germany: Koeltz 835

    Scientific Books. 836

    Weitemier K, Straub SCK, Cronn RC, Fishbein M, Schmickl R, McDonnell A, Liston 837

    A. 2014. Hyb-Seq: Combining Target Enrichment and Genome Skimming for Plant 838

    Phylogenomics. Applications in Plant Sciences 2: 1400042. 839

    Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New 840

    York. 841

    842

    Supporting Information 843

    Table S1 – Detailed annotation of repetitive elements at lineage level for all datasets on all 844

    Rhynchospora species. 845

    Table S2 – Name and genomic abundance of the satellites found in all datasets of the five 846

    Rhynchospora species. 847

    Figure S1 - Barplots of genomic abundance of unclassified repeats in every dataset of all 848

    analysed Rhynchospora species. 849

    Figure S2 – Dotplot comparison of satellites found in all datasets of all analysed 850

    Rhynchospora species 851

    preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint

    https://doi.org/10.1101/2020.12.10.419515