accurate identification of adenosine deamination · 2016-08-04 · iii acknowledgments i would like...

Accurate Identification of Adenosine Deamination

by

Gavin Walter Wilson

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy

Molecular Genetics University of Toronto

© Copyright by Gavin Walter Wilson 2016

ii

Accurate Identification of Adenosine Deamination with RNA-seq

Gavin Walter Wilson

Doctor of Philosophy

Molecular Genetics University of Toronto

2016

Abstract

The eukaryotic transcriptome is further diversified by post-transcriptional processing, including

alternative splicing and RNA editing. The latter includes the modification of adenosine to inosine

(A-to-I) within structured transcripts. Uniquely, inosine has similar base-pairing properties to

guanine, which can have downstream consequences to RNA secondary structure or RNA-RNA

interactions. While RNA editing events were typically characterized on gene-by-gene basis,

advancements in high throughput RNA-sequencing technologies have allowed A-to-I editing to

be investigated on a global scale. However, accurate identification of A-to-I edits on a

transcriptome-wide scale is compounded by artifacts introduced by reverse transcription,

sequencing, and computational alignment, all of which can lead to false positive signals. RNA-

seq read alignment for the purposes of RNA editing calls can be affected by the accuracy of

spliced, gapped, multi-mapped, and mismatch alignments. To address these challenges, I

developed RNAsequel, a software package that runs as a post-processing step in conjunction

with an RNA-seq aligner. I benchmarked the accuracy of RNASequel using a combination of

human derived simulated and biological datasets and demonstrated a clear improvement in all

four of the aforementioned accuracy metrics compared to current RNA-seq alignment tools.

Next, I utilized RNASequel to identify clusters of A-to-I hyper-editing in 91 C.

elegans samples using a novel algorithm designed to mitigate common sources of false positive

calls that are difficult to mitigate during read alignment. This resulted in the most comprehensive

map of RNA editing in C. elegans to date with 197,890 sites within 10,941 clusters. I then

explored the localization of the clusters to genetic features and heterochromatin. Collectively,

these data show the extensive editing events in the worms, while concurrently demonstrating the

utility of RNASequel.

iii

Acknowledgments

I would like my sincere thanks to my supervisor Dr. Lincoln Stein, without his expertise on bioinformatics, this work presented in this thesis would not have been possible. I would also like to thank my committee members: Dr. Ben Blencowe, Dr. Michael Brudno and Dr. Quaid Morris. Their boundless insight, feedback and much-needed pressure to wrap up my projects were essential for the completion of my doctoral studies. I would like to thank my friend and colleague Ewan “RNA has G:U base-pairs” Gibb. The quality of my scientific writing would not be where it is today without his thoughtful and thorough advice and suggestions. My friendship with Ewan has resulted in many of the fondest memories from my graduate school experience in my masters and doctoral degrees. Many thanks to my colleagues and friends: Nardin Samuel Faiyaz Notta, Marc Perry, Shirley Tam, and Quang Trinh. Their friendship and constant scientific dialog has been one of the highlights of my doctoral experience. Finally, I would like to thank Nick Provart and Fritz Roth for their helpful comments that have improved the quality of my thesis.

iv

Every day is a new day. It is better to be lucky. But I would rather be exact. Then when luck comes you are ready.

Ernest Hemmingway – The Old Man and the Sea

v

Table of Contents

ACKNOWLEDGMENTS.......................................................................................................................................IIITABLEOFCONTENTS..........................................................................................................................................V

LISTOFTABLES...............................................................................................................................................VIIILISTOFFIGURES.................................................................................................................................................IX

ABBREVIATIONS.................................................................................................................................................XI

BACKGROUND................................................................................................................................................11

1.1 THEDYNAMICEUKARYOTICTRANSCRIPTOME........................................................................................................31.2 A-TO-IEDITINGINCAENORHABDITISELEGANS........................................................................................................51.3 NUCLEICACIDSEQUENCING........................................................................................................................................61.4 HIGHTHROUGHPUTSEQUENCING..............................................................................................................................81.5 ILLUMINASEQUENCINGARTIFACTS........................................................................................................................101.6 RNASEQUENCING......................................................................................................................................................111.7 HIGH-THROUGHPUTRNASEQUENCING................................................................................................................131.7.1 RNA-seqLibraryPreparation........................................................................................................................14

1.8 RNA-SEQLIBRARYPREPARATIONCHALLENGES.................................................................................................151.9 SEQUENCEALIGNMENTALGORITHMS....................................................................................................................161.10 RNA-SEQREADALIGNMENT.................................................................................................................................211.10.1 SegmentationApproaches...........................................................................................................................251.10.2 SeedandExtendApproaches......................................................................................................................27

1.11 CURRENTCHALLENGESMAPPINGRNA-SEQPAIRS...........................................................................................281.11.1 IdentifyingRNAeditingwithRNA-seq....................................................................................................291.11.2 IdentifyingRNAeditswithoutsequencing............................................................................................29

1.12 THESISOBJECTIVES.................................................................................................................................................29 ACCURATERNA-SEQREALIGNMENTWITHRNASEQUEL.............................................................312

2.1 RESULTS:......................................................................................................................................................................342.1.1 DevelopinganAccurateRNA-seqRealignmentTool..........................................................................342.1.2 RNASequelrealignmentleadstoimprovedalignmentaccuracy..................................................342.1.3 Realignmenttoasplicejunctiondatabaseimprovessplicedreadaccuracy...........................402.1.4 RNASequelrealignmentimprovesalignmentswithinsertionsanddeletions.........................482.1.5 RNASequelrealignmentincreasesmismatchtoleranceandaccuracy.......................................48

vi

2.1.6 RNASequelexecutionspeedandmemoryrequirements...................................................................512.1.7 RNASequelrealignmentimprovesalignmentcharacteristicsonbiologicaldatasets..........512.1.8 RNASequelrealignmentgeneratesmorerobustRNAeditingcalls..............................................56

2.2 DISCUSSION:.................................................................................................................................................................662.3 METHODS:....................................................................................................................................................................662.3.1 Referencegenomeandannotations...........................................................................................................662.3.2 BiologicalDatasets:...........................................................................................................................................672.3.3 AlignmentProtocols:.........................................................................................................................................672.3.4 RNASequelRealignment..................................................................................................................................682.3.5 SpliceJunctionDefinitionsandAlignmentScoring.............................................................................692.3.6 SpliceJunctionDiscoveryandSpliceJunctionIndexGeneration...................................................692.3.7 ContiguousandSplicedReadAlignment..................................................................................................702.3.8 EstimatingtheEmpiricalFragmentSizeDistribution.......................................................................702.3.9 ResolvingReadPairAlignments..................................................................................................................712.3.10 SimulatedDatasetBenchmarking............................................................................................................712.3.11 IdentifyingPutativeAdenosinetoInosineRNAeditingevents....................................................72

IDENTIFYINGRNAHYPER-EDITINGINC.ELEGANS........................................................................743

3.1 BACKGROUND..............................................................................................................................................................743.2 RESULTS.......................................................................................................................................................................743.2.1 ImprovementstotheRNASequelAligner.................................................................................................753.2.2 RNA-seqsampleprocessingandalignment............................................................................................763.2.3 Accurateandsensitiveidentificationofhyper-editing......................................................................783.2.4 Comparisonwithotherstudies.....................................................................................................................833.2.5 Clustersareenrichedfornon-codingelements.....................................................................................843.2.6 ClusteredA-to-Ieditreplicationandproperties...................................................................................863.2.7 AGlobalmapofA-to-Iediting.......................................................................................................................893.2.8 Introniceditsaredepletednearsplice-sites...........................................................................................953.2.9 Intergeniceditsandantisensetranscripts..............................................................................................993.2.10 3’-UTRclustersandpoly(A)Sites..............................................................................................................993.2.11 IdentifyingputativeA-to-Idependentaminoacidchanges........................................................101

3.3 DISCUSSION...............................................................................................................................................................1033.4 METHODS..................................................................................................................................................................1063.4.1 C.elegansgeneannotationsandreferencesequences.....................................................................1063.4.2 Samples.................................................................................................................................................................106

vii

3.4.3 RNA-seqpreprocessingandalignment...................................................................................................1063.4.4 WholeGenomeAlignmentandVariantCalling...................................................................................1073.4.5 IdentifyingpotentialA-to-Ieditingevents............................................................................................1073.4.6 Annotatingeditsandclusters......................................................................................................................1083.4.7 Chromosomalmaps.........................................................................................................................................1083.4.8 DetectionrecurrentA-to-Ieditingeventswithinsplicesites,polyadenylationsignals,and

codingregions...................................................................................................................................................................1093.5 APPENDIX..................................................................................................................................................................110

DISCUSSION................................................................................................................................................1214

REFERENCES.....................................................................................................................................................129

viii

List of Tables

Table 3.1 Mapping Rates.

Table 3.2 A-to-I clustered edit recurrence rates.

Table 3.3. Recurrent A-to-I edits that overlap splice sites

Table 3.4. Recurrent A-to-I edits that overlap annotated poly(A) signals

Table 3.5.1 C. elegans samples Processed in this study

Table 3.5.2 Clustered A-to-I edits (Supplementary File:

Wilson_Gavin_W_201606_PhD_worm_edits.txt)

Table 3.5.3 Extended 3’-UTRs (Supplementary File:

Wilson_Gavin_W_201606_PhD_utr_extensions.xlsx)

ix

List of Figures

Figure 1.1 Timeline of sequencing technology developments.

Figure 1.2 RNA-seq Library Preparation

Figure 1.3 RNA-seq Alignment Overview.

Figure 1.4 RNA-seq Alignment Strategies.

Figure 2.1 RNA-sequel realignment schematic.

Figure 2.2 Simulated dataset alignment rates.

Figure 2.3 Example simulated dataset alignment

Figure 2.4 Example simulated dataset alignment

Figure 2.5 Spliced read alignment rates for the simulated datasets

Figure 2.6 The number of correct splice junctions identified in each read stratified by the total number of true splice junctions for the simulated datasets.

Figure 2.7 Alignment characteristics for the first simulated dataset

Figure 2.8 Alignment characteristics for the second simulated dataset

Figure 2.9 Positional biases of false positives, false negatives, and true positives for splice junctions and gaps for the first simulated dataset.

Figure 2.10 Positional biases of false positives, false negatives, and true positives for splice junctions and gaps for the second simulated dataset

Figure 2.11 Positional biases of false positive, false negative and true positive mismatches for both simulated datasets.

Figure 2.12 Alignment rates for biological datasets with matched somatic SNP calls.

Figure 2.13 Alignment rates for 25 additional biological datasets.

Figure 2.14 Application of the RNASequel fragment size estimation and verification algorithm to the alignments produced by Tophat and STAR.

Figure 2.15 YH SNV and edit call comparisons.

Figure 2.16 GM12878-1 SNV and edit call comparisons.

Figure 2.17 GM12878-2 SNV and edit call comparisons.

x

Figure 2.18 Comparing SNV and edit call rates for STAR Ann. with two passes and STAR Ann. + RNASequel for the 25 other biological datasets.

Figure 2.19 Comparing the differences in edit calls after removing likely false positive alignments for 25 other biological datasets.

Figure 2.20 Identifying alignment issues that cause false positive variant calls for YH, GM12878-1 and GM12878-2.

Figure 2.21 Identifying alignment issues that cause false positive variant calls for the other 25 biological datasets.

Figure 3.1 Alignment rates for the C. elegans samples processed in this study

Figure 3.2 Summary variant identification and filtering steps.

Figure 3.3 Comparison of variant call rates for clustered and singleton edits.

Figure 3.4 Number of clustered A-to-I and non-A-to-I versus the number of uniquely mapped reads.

Figure 3.5 A-to-I edit call comparison with other studies.

Figure 3.6 A-to-I editing association with genetic elements.

Figure 3.7 A-to-I hyper- edit recurrence stratified by the overlapping genetic element and recurrence rate.

Figure 3.8 Properties of clusters by the dominant base and repeat type of the edits contained in the cluster.

Figure 3.9 Global A-to-I cluster localization.

Figure 3.10 Chromosomal distribution of clustered edits.

Figure 3.11 Global Pearson correlations between chromatin marks, A-to-I edits, and genetic features.

Figure 3.12 A-to-I editing events may have been missed within introns.

Figure 3.13 Saturation analysis of A-to-I hyper-edits.

Figure 3.14 Properties of edit clusters and repeat elements within introns.

Figure 3.15 Properties of intergenic edits antisense to annotated genetic elements.

Figure 3.16 Localization of 3’-UTR edits with respect to poly(A) sites.

xi

Abbreviations

ssRNA single-stranded DNA

dsRNA double-stranded DNA

RT Reverse Transcriptase

PCR Polymerase Chain Reaction

bp Base Pair

nt Nucleotide

cDNA Complementary DNA

rRNA Ribosomal RNA

RNA-seq High-throughput whole transcriptome sequencing

A, C, G, T, U, I Adenosine, Cytosine, Guanine, Thymine, Uracil, Inosine

dNTP deoxyribonucleoside tri-phosphate

rNTP ribonucleoside tri-phosphate,

ddNTP di-deoxyribonucleoside tri-phosphate

PAGE polyacrylamide gel electrophoresis

SW, NW Smith-Waterman, Needleman-Wunsch

BWT Burrows Wheeler Transform

SNP, SNV Single Nucleotide Polymorphism, Single Nucleotide Variant

RBP, dsRBP RNA binding protein, double-stranded RNA binding protein

lncRNA Long non-coding RNA

SNV Single nucleotide variant

1

Background 1Nucleic acids are the blueprints of biological life on earth; they encode the regulatory and coding

potential of an organism’s genome. Since the discovery of the structures of DNA and RNA

molecules there has been an intense scientific effort to develop technologies to sequence DNA

and RNA molecules (1). Sequencing technologies have evolved from being low-throughput and

labor intensive to high-throughput and automated (Figure 1.1) (2-4). This has been an

incremental process where previous technological advances are integrated or improved to create

new sequencing technologies. These innovations have progressed the capabilities of sequencing

from a single transfer-RNA to individual genes and finally to whole viral, bacterial and

eukaryote genomes and transcriptomes. Within the last decade the increase in throughput and

decrease in cost has been staggering (5-7). With the rise of high-throughput sequencing

technologies such as Illumina, the cost of sequencing a human genome is below ten thousand

dollars. This has permitted population scale genome sequencing consisting of thousands of

genomes. Sequencing technologies are quickly reaching a point where it is possible to sequence

a whole human genome for less than one thousand dollars. This is in stark comparison to the first

draft human genome sequence published in 2001 which was estimated to cost nearly three billion

dollars (8, 9). This has led to a new analysis bottleneck due to the storage, processing, mapping

and analyzing the tremendous amount of sequencing data. The cost of the analysis is quickly

approaching the cost of the sequencing (7). Novel algorithms, tools and optimizations have and

continue to be required to meet the aforementioned challenges.

While historically individual RNA species were sequenced, current technologies have facilitated

sequencing RNA on a global scale (10). Whole transcriptome sequencing has been crucial to our

understanding of the cellular transcriptome by increasing sensitivity compared to microarrays,

serial analysis of gene expression, and expressed sequence tag sequencing (10, 11). This had led

to the identification of novel or infrequent regulatory events and transcripts such as RNA editing

and long non-coding RNA (12-14). However, this sensitivity has revealed a new challenge to

separate biologically relevant transcriptional events from spurious events (biological noise).

2

Figure 1.1. Timeline of sequencing technology developments. (A) Major milestones in the development of sequencing technologies are indicating along the timeline. (B) High-throughput sequencer read lengths, machines are listed in the legend. For machines that produce paired-end reads the length of a single read is indicated (C) Number of reads produced (D) Throughput in gigabases (|number of reads| ! |read length| / 109), for machines that produce paired-end reads the read length was doubled. Note that a logarithmic y-axis is used for panels (C) and (D). Data retrieved from Nederbragt, Lex (2012): developments in NGS. Figshare: http://dx.doi.org/10.6084/m9.figshare.100940. Note that only a single data point is shown for the Sanger 3730xl and the Megabase 4500.

A

B C

D

3

1.1 The Dynamic Eukaryotic Transcriptome

One of the fundamental biological processes within a cell is the transcription of gene products.

This process is tightly regulated and catalyzed by the RNA polymerase family of proteins.

Transcriptional regulation is carried out at multiple-levels including chromatin state and

transcription factor binding (15). Perturbations in the regulation of a gene or genes can lead to

developmental disorders and cancer (15). After transcription there is another set of regulatory

layers, the so-called post-transcriptional regulatory mechanisms (16, 17). Some of these

mechanisms occur co-transcriptionally while the nascent RNA is being transcribed (17-21),

while others occur post-transcriptionally and include: 5’-end capping, 3’-polyadenylation, RNA

splicing, RNA editing, RNA interference, and nonsense mediated decay (16, 17).

Transcripts were originally thought to have their function encoded within a single contiguous

sequence. However, this assumption was challenged by the observation that eukaryotic nuclear

transcripts were much longer than the corresponding cytoplasmic transcripts (22-24). This led to

the identification of intron sequences that “split” the functional nucleotides of a transcript. The

introns are excised from of the pre-mRNA transcript by the spliceosome complex co- and post-

transcriptionally (19). Two of the key sequence motifs are present at the 5’- and 3’- end of the

intron sequence, these are called the 5’- and 3’- splice sites respectively (25-27). Combinations

of different splice sites can be used to generate different transcript isoforms in a process known

as alternative splicing (25, 28). Another important sequence feature of introns is the branch point

site that is used to form an intron lariat by ligating the 5’-end of the intron to a conserved

adenosine within the branch point site (26, 27). Finally, conserved sequence motifs within the

exonic and intronic sequences are bound by trans-acting splicing regulatory factors that can

promote or repress the inclusion of an exon in the final transcript (28). Perturbations in any of the

conserved motifs or the expression of splicing regulatory factors can lead to developmental

defects or promote disease development such as cancer (29, 30).

RNA editing is a post-transcriptional modification of RNA molecules, in which specific

nucleotides are deaminated (31, 32). Two types of nucleotide deamination have been identified:

adenosine to inosine (A-to-I) and cytidine to uridine (C-to-U) deamination (33, 34). Inosine

preferentially base pairs with cytisine and has less energetically favorable base pairs with uracil

4

and adenosine (35). Uridine preferentially base pairs with adenosine and has a less energetically

favorable base-pair with guanosine (35). A-to-I and C-to-U editing can have a number of down

stream effects on the modified RNA molecule since the deaminated base has altered base

pairing. The effects of editing include: (i) consequences to base pairing and stability, (ii) changes

in translated amino acid sequences due to changes in amino acid codons within the RNA

transcript, and (iii) effects on alternative splicing (34, 36-38). C-to-U deamination has is an

integral part of host viral defense against dsRNA substrates (33).

A-to-I editing is the dominant form of editing in metazoans, and it is catalyzed by adenosine

deaminases that act on RNA (ADARs), which target double stranded RNA (dsRNA) (31, 32, 34,

38). ADAR proteins are only found in metazoan species (39). ADAR proteins do not exhibit

sequence specificity, but do show flanking sequence preferences (40). ADAR’s have two

primary modes of RNA editing: selective editing which targets dsRNAs with specific structures

to promote the editing of specific bases and promiscuous editing where long stem-loop structures

are edited randomly (34). This editing can occur co-transcriptionally, suggesting a potential

regulatory role for post-transcriptional regulatory pathways such as alternative splicing and

localization due to A-to-I edits altering the secondary structure and sequence motifs within the

targeted transcript (19, 21, 38). ADARs play a critical role in development and the central

nervous system (CNS) (34, 38). ADAR knockouts in mice result in a lethal phenotype, while

knockouts in Drosophila melanogaster and Caenorhabditis elegans remain viable (41-44).

Selective A-to-I editing is critical for a glutamine to arginine substitution in the glutamate

receptor-2 (GluR2) protein, which alters Ca2+ permissibility through the AMPA receptor (45,

46). The depletion of the glutamine to arginine substitution in humans contributes to the

development of sporadic amyotrophic lateral sclerosis (ALS) and the lethal phenotype in mice

(42, 47, 48).

The extent of A-to-I editing and its role in a cellular function has not been fully explored.

However, the majority of RNA editing is associated with transposable elements capable of

folding into dsRNA structures such as Alu elements in humans (49, 50). Since ADAR proteins

can relax dsRNA structures they could affect the binding of any dsRNA specific protein (51, 52).

Furthermore, inosine could affect the binding of proteins recognizing conserved motifs that are

edited by ADAR proteins. One of the best examples of this is the competition between ADAR

proteins and the RNAi pathway for dsRNA substrates that was observed in C. elegans (53, 54).

5

Another example is the auto-regulation of ADAR2 in rats by altering its splicing patterns through

the modification of a AA dinucleotide splice site to AI, which acts as a canonical splice acceptor

site (55). There are other examples of A-to-I editing affecting RNA stability, RNA translation

RNA localization, miRNA, antiviral protection, and heterochromatic gene silencing (17, 34, 38,

56-61).

1.2 A-to-I editing in Caenorhabditis elegans

C. elegans has two ADAR genes (adr-1 and adr-2) with only adr-2 being catalytically active. D.

melanogaster encodes a single ADAR protein. The C. elegans adr-1 gene has a truncation of its

catalytic domain and plays a regulatory role for RNA editing (44, 62). Unlike mammals, ADAR

knockouts in C. elegans remain viable but have a reduced lifespan, chemotaxis defects and

transgene silencing. In Drosophila, knockout of the sole ADAR gene leads to viable flies with

normal lifespans but they exhibit strong behavioral defects (63). Finally, in murine models both

ADAR proteins are required for viability; ADAR1 is required for erythropoiesis and ADAR2 is

required for proper function of the AMPA receptor (42, 64).

ADAR proteins appear to act co-transcriptionally implying that editing on the pre-mRNA

molecule can compete with other dsRNA-binding proteins. In support of this, ADAR proteins

have been found to alter small RNA expression (53, 54). Small RNAs are non-coding RNA

molecules with a size of 21-26 bp. There are multiple classes of small RNAs in C. elegans and

these include: microRNAs, endogenous small interfering RNAs, and piwi-interacting RNAs

(65). These small RNA molecules participate in multiple regulatory pathways including

translation and gene expression (65). Knockouts of RNAi pathway components lead to the

suppression of ADAR knockout phenotypic defects (66). Furthermore, adr-1, adr2 double

knockouts have perturbed small-RNA expression suggesting that both ADAR proteins can

compete with RNAi for dsRNA substrates (53, 54). This competition may be an early

mechanism to distinguish exogenous and endogenous sources of dsRNA. In this model,

endogenous dsRNA undergo RNA editing which inhibits their processing with the RNAi

machinery. This does appear to be the case in higher order eukaryotes where knockouts of

dsRNA sensing pathways suppresses ADAR1 knockout phenotypes in mice (67). ADAR

proteins also commonly target intronic dsRNA structures (53, 68-70). RNA editing or ADAR

binding may contribute to the regulation of circular RNA biogenesis or alternative splicing (71).

6

A-to-I editing in C. elegans clusters into regions of hyper-editing to produce transcripts with 30

or more edits (44, 53, 69, 70). The clusters of hyper-editing are typically found within non-

coding DNA elements including: introns, 3’-UTR’s, and intergenic sequences (53, 68, 69). These

regions are associated with sources of dsRNA including inverted repeats and transposons (34,

53, 72). Finally, hyper-edited regions tend to be localized to the arms of the autosomal

chromosomes (68).

1.3 Nucleic Acid Sequencing

Nucleic acid sequencing has been an important area of research since the discovery of the

structure of DNA and RNA in 1962 (1, 73). The seminal methods developed during the early

days of nucleic acid sequence have been used extensively in the current generation of

sequencers. The primary drive for nucleic acid sequencing has been to reduce costs, decrease

sequencing time, and to increase sequencing read throughput and read lengths.

The first gene to be completely sequenced was the 77 nt yeast transfer-RNAAla in 1965 by Holley

et al (73). They utilized complete and partial digestions with a set of ribonucleases with known

specificity combined with ion-exchange chromatography to resolve the fragment sizes. This was

improved by using two-dimensional polyacrylamide gel electrophoresis of degradation products

(2-4 nts) labeled with the radioactive isotope 32P (74). This method facilitated the sequencing of

the first RNA gene, the ~460 nts RNA bacteriophage MS2 coat protein, and later the entire 3,569

nt MS2 RNA genome (75, 76). These mechanical sequencing methods were laborious and did

not scale well to larger genes and genomes.

The fundamental innovation that lead to DNA sequencing was primer-extension which used

sequencing by extension and combined three important observations (77): 1) that

deoxyoligonucleotides sequences (primers) can be annealed to template DNA to prime synthesis

with DNA polymerase (78, 79); 2) radiolabelled dNTPs can be used in the primer extension

reaction to extend a specific primer (78); 3) primer extension can be terminated by not using all

four of the dNTPs (78). The next major development took advantage of primer extension to

develop the “plus and minus” DNA sequencing method that used sequencing by extension to

sequence the 5,386 nt bacteriophage ΦX174 genome (first DNA genome sequenced) (80, 81).

7

In parallel to Sanger and Coulson, Maxam and Gilbert developed a simplified chemical

sequencing method capable of sequencing dsDNA or ssDNA in 1977 (82). The method consisted

of treatment of DNA 32P radiolabelled at one of its 5’-ends. The radiolabelled fragments were

treated with one of four different reagents that cleaved DNA at A+G nucleotides, G nucleotides,

C, nucleotides or C+T nucleotides. The fragments were then resolved in a PAGE gel to sequence

more than 100-200 bp per reaction. This method was easier to use compared to the “plus and

minus” method since it required only four reactions rather than eight and it could use both

dsDNA and ssDNA as template. Similar chemical sequencing methods were developed and

applied to sequencing 3’-radiolabelled RNA (83).

The Sanger sequencing method was developed in 1977 and represents one of the most important

innovations in nucleic acid sequencing (84). Sanger sequencing uses sequencing by synthesis

and chain-terminating dideoxynucleotides to sequence each of the standard dNTPs

independently. Each reaction consisted of the standard dNTPs, ssDNA template, a 5’-32P labelled

primer, and a lower concentration of one of the four ddNTPs. The ddNTPs lack a 3’-hydroxyl

preventing the formation of phosphodiester bonds between nucleotides, causing DNA

polymerase to terminate extension at the ddNTP. The four reactions were originally resolved on

a PAGE slab gel, but because of issues resolving ssDNA sequences with secondary structure,

this was later rectified by using thin denaturing polyacrylamide sequencing gels (85). The next

adaptation came by replacing the need for radiolabelled primers by labelling each of the four

standard ddNTPs with a different coloured fluorescent dye (86, 87). Dye-terminator based DNA

sequencing consisted of a mixture of the standard dNTPs and all four of the dye-labelled

ddNTPs. The sample could then be resolved in a single sequencing lane rather than four. The

sequences could be read with optics and automated software algorithms that greatly speed up the

sequencing process and increased sample throughput. These methods were further improved with

recombinant DNA polymerases and improved dye technologies. One downside of Sanger

sequencing was that it requires ssDNA template, which was resolved by the development of

ssDNA bacteriophage based cloning vectors (88). As the quality of sequencing reagents

increased the limiting factor became the sequencing gel which had a maximum read length of up

to ~700 bp. Furthermore, increasing the number of samples in a sequencing gel (up to 96) caused

difficulty with sample loading, lane tracking and the potential for overlapping bands between

samples (89).

8

There was a need to separate each sample into its own miniature gel, which was accomplished by

using polyacrylamide-filled capillaries, which eventually increased the read length to 1,000 bp

(90-93). The next major increase in throughput was the ability to replace the gel within a

capillary by using non-crosslinked polyacrylamide to sequence another sample and to arrange the

capillaries into arrays with 96 to 384 capillaries per sequencing machine (94, 95). The

sequencing reaction steps and loading of the capillaries could be automated permitting increased

throughput. Automated sequencing processes could read the sequence and produced base-

qualities for each sequenced nucleotide to indicate confidence in the base. The discovery of

thermostable polymerases and the development of PCR permitted the chain-termination reaction

to be subject to repeated cycles of denaturation, primer annealing, and extension (96-98). Cycle

sequencing reduced the amount of primer wasted during sequencing, increased the yield of the

terminated sequencing products and eliminated the requirement for ssDNA templates

1.4 High Throughput Sequencing

The huge cost and effort involved with sequencing the draft human genome continued the push

for more sequencing throughput and reduced sequencing costs. This lead to the development of

three novel cyclic sequencing methods that took advantage of innovative optical, microfluidic

and biochemical sequencing technologies: 1) 454 pyrosequencing (454); 2) Solexa / Illumina

sequencing by synthesis; 3) SOLiD sequencing by ligation. These methods heralded a revolution

in DNA sequencing and were capable of producing hundreds of megabases to terabases of data

in a single experiment. Common to all of these methods is the need to fragment a DNA library

into pieces that are sized according to the sequence read length and process. The complete

sequencing experiment can be completed in a week or two, cost significantly less than Sanger

sequencing, and produce base quality scores similar to automated Sanger sequencing methods.

The 454 sequencer was the first commercially available high-throughput sequencing system. The

system involves the fragmentation of a DNA library, adapter ligation and then the binding of

individual fragments to microbeads (99, 100). The ssDNA substrates are clonally amplified on

each bead using emulsion PCR. The beads are mixed with sequencing primer and DNA

polymerase to prepare them for nucleotide extension. The beads are deposited into picoliter-sized

wells on a fabricated flow cell. Each flow cell contains 1.6 million wells that have enough space

for a single bead. The beads in each well are sequenced by pyrosequencing using ATP

9

sulfurylase, luciferase, and apyrase (100). The ATP sulfurylase catalyzes the conversion of

pyrophosphate to ATP in the presence of adenosine 5’-phosphosuflate and the ATP and luciferin

is catalyzed to visible light by luciferase. Each of the four standard dNTPs is then washed across

the slide and the light is captured using a specialized camera. Unincorporated dNTPs and left

over ATP are degraded by the apyrase. The dNTP wash can be repeated up to one hundred times

producing 1,000,000 reads up to 500 bp in length. A typical sequencing experiment can produce

up to 700 megabases.

Illumina sequencing utilizes sequencing by synthesis with reversible termination to sequence

ssDNA fragments (101). There are two modes of Illumina sequencing: single-end and paired-end

sequencing. For single-end sequencing only one end of the template fragment is sequenced,

while for paired-end both ends of the template are sequenced in two steps. Paired-end

sequencing is the most common method of Illumina sequencing for genome and RNA

sequencing, while single-end sequencing is common for methods with small fragment templates

such as small-RNA-seq and ChIP-seq (10, 102).

Illumina libraries are prepared by ligating sequencing adapters to randomly generated and size

selected DNA fragments (101). The adapter-ligated fragments are amplified using adapter

specific PCR primers and a low number of cycles. The amplified fragments are size-selected

using beads or agarose gel and the fragment sizes are selected to minimize read overlap when

sequencing both ends of the fragment. Instead of clonally amplifying the DNA on beads using

emulsion PCR, the fragments are hybridized onto a lane within a flow cell (101, 103, 104). Each

lane of the flow cell is coated with oligonucleotides that are complementary to the sequencing

adapter. The library fragments are hybridized to the lane taking into consideration the density of

ssDNA molecules that can be sequenced within the lane. The hybridized ssDNA molecules are

amplified using bridge amplification to create clusters of ~1000 molecules (101, 103, 105). The

clustering step is essential for the visualization of the sequencing reaction. The clusters are then

linearized to ssDNA, a sequencing primer is annealed and recombinant DNA polymerase is

added to the flow cell. Illumina sequencing uses reversibly terminated dNTPs that have are 3’-

labelled with a fluorescent dye (101, 106). Each of the four dNTP’s is labelled with a different

color and the flow cell is flooded with all four of the dNTP’s at once. DNA polymerase

incorporates the dNTP’s and the flow cell is imaged to detect the colour emitted by each cluster.

The reversible terminator and fluorescent label is then cleaved off to free the 3’-OH group for the

10

next round of extension. A key innovation with Illumina sequencing is that clusters can be

regenerated, which permits the sequencing of the opposite end of the DNA fragment to produce

paired-end reads.

Currently, the Illumina HiSeq 4000 can sequence two flow cells each consisting of eight lanes.

Each lane produces ~300 million 2 x 150 bp paired-end reads in 3.5 days. The MiSeq is a

miniature version of Illumina’s HiSeq that can produce 25 million 2x300 bp paired-end reads in

a single lane flow cell. The cost of sequencing reagents and simplicity of the sequencing reaction

has made Illumina sequencing the method of choice compared to 454 and SOLiD sequencing.

1.5 Illumina Sequencing Artifacts

The huge throughput offered by Illumina sequencing is not without issues. The most common

issue with Illumina sequencers are errors that lead to false positive mismatches within the

sequenced read and GC content biases (4, 107-109). False positive mismatches occur due to two

sequencing issues, the first being cross-talk and the second phasing. Cross-talk errors occur due

to spectral overlap between the fluorescently labelled nucleotides. These cross-talk issues

manifest as A-to-C and G-to-T mismatches due to their similar emission spectra. Phasing occurs

due to the loss of synchrony of the sequencing reactions in a cluster due to issues with the

Illumina reversible-sequencing chemistry. Phasing can be subdivided into pre-phasing and

phasing errors. Phasing occurs due to the incomplete removal of a reversible terminated

fluorescent-base in a proportion of the molecules in a cluster, which causes some of the

sequencing reactions to lag behind. Pre-phasing occurs when sequencing of a molecule in a

cluster is advanced further than the rest of the molecules by missing the incorporation of

nucleotides without proper terminators. These mismatch errors are usually marked as bases with

low-base quality scores and tend to get progressively worse towards the 3’-end of the sequenced

molecule. This is reflected as a general degradation of base quality scores towards the end of the

read. Finally, the PCR amplification step during library preparation leads to a lower abundance

of GC-rich and GC-poor fragments (110, 111). While Illumina has mitigated these issues over

time, they have not been completely resolved.

11

1.6 RNA Sequencing

One of the difficulties with the understanding of gene structure and function relationships within

the eukaryotic transcriptome is the high level of transcript post-processing, which include RNA

editing, alternative splicing, and polyadenylation (16, 112, 113). Prediction of these events is

difficult using genome sequencing information alone (114). There is a need to determine and

quantify what RNAs are present within a sample and the extent of their post-transcriptional

modifications on a global scale. In general the ideal RNA sequencing experiment would: 1)

quantify the overall expression of a gene, 2) infer the RNA isoforms encoded within a gene, and

3) quantify the expression of each individual isoform, and 4) identify expressed variants, RNA

editing, and other post-transcriptional modifications. Sequencing technology is a viable method

to determine all of the aspects of RNA. By quantitatively sequencing the total RNA content of a

cell it is possible to calculate gene expression, isoform structures and isoform expression (115,

116). The majority of sequencing innovation is focused on sequencing DNA, however, the

discovery of reverse transcriptase (RT) in 1970, which acts to reverse transcribe RNA into

cDNA provided a means to apply DNA sequencing technology to RNA (117, 118).

Using RT to generate cDNA introduces issues compared to standard DNA sequencing that need

to be considered: 1) RT’s lower fidelity, 2) the presence of a RNAse H domain, and 3) RT

template switching effects (119-122). The lower fidelity of RT is attributed to the lack of a

proofreading activity, which leads to a higher error rate compared to DNA polymerases. The

RNAse H domain can degrade RNA templates and first-strand DNA synthesis products during

long incubations. The RNAse H domain can also limit full-length cDNA synthesis. Recombinant

RT proteins have improved fidelity and carry an inactive RNAse-H domain (122). RT template

switching occurs when the RT enzyme switches templates either inter- or intra-molecularly and

continues reverse transcription (119, 120). This can lead to the identification of false-positive

fusions or splicing events. The effect and rate of reverse transcriptase template switching has not

been fully explored, however, it has been shown to cause false positive trans-splicing, gene

fusion and cis-splicing events (119, 120). Collectively, the discovery of RT and the

advancements in DNA sequencing have provided a foundation for the development of novel

methods that permitted the identification and / or the quantification of post-transcriptional

regulation. Other important sequencing techniques include: expressed sequence tag (EST)

sequencing, serial analysis of gene expression (SAGE), high-throughput quantitative RT-PCR

12

based RNA isoform quantification, and RNA-seq (118, 123-127). These methods have permitted

scientists to evaluate the dynamics of gene expression and post-transcriptional regulation using

DNA sequencing technologies.

The eukaryotic transcriptome is complex due to its dynamic nature, which includes tissue and

cell type specific expression, alternative mRNA isoforms, overlapping transcripts and antisense

transcripts (15). The accurate identification of isoform structures within eukaryotic organisms is

challenging due to the presence of introns, which can range in size from a few 100 base pairs to

upwards of 500,000 base pairs (128). Alternative splicing confounds this because a gene can

have multiple isoforms that cannot be easily inferred from the genome sequence. The earliest

methods for sequencing the transcribed portion of an organism’s genome took advantage of the

poly(A) tails present on the majority of mRNA’s. The key steps involved reverse transcription

with poly(T) primers that would anneal to the 3’-poly(A) tail of an RNA transcript and

synthesize a complementary DNA sequence (118, 124, 125, 127, 129). The cDNA sequences are

then cloned into vectors and transformed into the appropriate organism for replication. Individual

clones can be sequenced using Sanger sequencing to produce reads up to 1kb in length. This

method is known as cDNA cloning with expressed sequence tag (EST) sequencing and was, and

continues to be, used to profile alternative splicing, allele specific expression, gene structure, and

RNA editing. As of this writing the NCBI’s dbEST has over 74 million EST sequences from a

variety of eukaryotic species (130). EST sequencing has been an essential complement to

genome sequencing. Genome sequencing studies generally combine both genome and EST

sequencing to better annotate and predict gene structures. EST sequencing has three major

shortcomings: the first is that at best semi-quantitative, the full transcript is not sequenced, and

the second is that some cDNA sequences are recalcitrant to cloning.

The lack of quantitative information from EST sequencing limits the techniques ability to

support the analysis of gene expression and post-transcriptional regulation. Methods have been

developed to quantify the RNA content of a cell such as microarrays but they are dependent on

gene annotations,which change over time as new genes and isoforms are discovered. The number

of probes that can be printed on a slide further limits microarrays (limit ~2 million). A

quantitative RNA method based on sequencing solves this issue since the sequences can be

remapped as genome annotations improve. The first method to quantify gene expression on a

global scale with sequencing was the serial analysis of gene expression (SAGE) method (123,

13

131). SAGE consists of reverse transcribing an RNA sample using poly(T) primers, fragmenting

the cDNA, and isolating short (14-20 bp) tags. The tags are concatenated together into

concatemers that are 800 bp - 1kbp in length. The concatemers are cloned and sequenced using

Sanger sequencing. Finally, the tags can be mapped back to the genome and the relative

abundance of the tags can be used to infer the abundance of the gene. SAGE provides a method

to analyze global gene expression; however, there are some disadvantages. Generating the

concatemer tags is a laborious process because short tags can have ambiguity in their mapping

back to the genome, the resolution of SAGE is low without sequencing many concatemer clones,

and isoform structures cannot be inferred or quantified. Variations of the SAGE tag-sequencing

method have been developed, these include cap analysis of gene expression (CAGE) and

longSAGE that uses 21 bp tags (131).

1.7 High-throughput RNA Sequencing

High-throughput sequencing technologies such as Illumina sequencing allow the global profiling

of the RNA content of a cell at single nucleotide resolution (10, 132-134). This whole

transcriptome sequencing approach - referred to as RNA-seq - is quantitative for gene and

isoform abundance estimates (10, 133). The single nucleotide resolution can be utilized for the

de novo identification of gene and isoform structure, splice junctions, RNA editing sites, RNA

methylation, and allele specific expression (12, 115, 116, 135, 136). RNA-seq has lead to

fundamental discoveries in biology such as the identification of novel RNA classes including

long non-coding RNA and circular RNAs, and the extent of genes alternatively spliced in the

human transcriptome (116, 137-139).

The preparation of a RNA-seq library consists of four steps: 1) RNA purification and size

selection 2) random fragmentation 3) cDNA synthesis 4) sequencing adapter ligation. Some of

these steps may be combined, the order of steps changed or the protocol modified, and additional

RNA selection and depletion steps may be used. The library preparation can also be modified to

preserve strand information for each fragment (140). The method used to generate the library

should be tailored to the experimental goals of the study and taken into consideration for the

downstream analysis of the sequencing data. The RNA-seq library is commonly sequenced as a

paired-end library with the Illumina sequencer.

14

1.7.1 RNA-seq Library Preparation

The first step when preparing a RNA-seq library is to determine whether poly(A)-selected,

poly(A)-depleted or total RNA is to be sequenced (141, 142). Poly(A) selection uses a Poly(T)

column or magnetic beads to select RNAs with poly(A) tails. This eliminates the majority non-

poly(A) RNA, including rRNA. However, some non-poly(A) transcripts with A-rich regions may

also be selected. Poly(A)-depletion consists of collecting the non-hybridized RNA after poly(A)-

selection, while total RNA does not use any enrichment methods. For both poly(A)-depleted and

total RNA samples, ribosomal RNAs are extremely abundant and must be depleted. Common

methods include duplex-specific nuclease treatment, subtractive hybridization and magnetic bead

based rRNA depletion (133, 143, 144). Total RNA and poly(A)-depleted RNA samples include a

greater abundance of pri-mRNAs, repeat elements, erroneously spliced transcripts and transcripts

that are targets for RNA editing. The next step can either be random fragmentation or cDNA

synthesis depending on the protocol. After double stranded cDNA fragments are generated,

sequencing adapters are ligated to the fragments, amplified and sequenced following the standard

Illumina paired-end DNA sequencing protocol. Stranded information for the RNA can be

preserved with modifications to existing protocols but require an additional layer of complexity

and create an additional source of errors and bias (140). There have been various commercial

“kits” developed by different vendors to encapsulate all of the library steps into an simplified

protocol. These kits have been optimized for different RNA input levels from single cells to

tissue samples, strand information preservation, and for poly(A)-selection or total RNA

sequencing.

The most commonly used library preparation method is the Illumina TruSeq protocol. The

TruSeq protocol uses two rounds of poly(A) selection followed by incubation at 94°C with

divalent cations to fragment the RNA and random hexamers to prepare it for first-strand cDNA

synthesis (Figure 1.2). Reverse transcriptase is used to generate the first cDNA strand. Second

strand cDNA is synthesized using DNA polymerase with RNase H to degrade the RNA template.

The fragments are then blunted and tailed with a single 3’-A and the Illumina sequencing Y-

adapters are ligated onto the fragments. The fragments are then sequenced following Illumina’s

paired-end sequencing protocol, which includes a linear PCR amplification step.

15

1.8 RNA-seq Library Preparation Challenges

In addition to the sequencing issues and biases associated with the Illumina sequencing platform

there are a number of issues unique to the preparation of RNA-seq libraries that can confound

down-stream analysis and / or obscure biologically relevant signals (141). These issues are

primarily due to biases during library enrichment, fragmentation and reverse transcription (141,

145, 146). Some of these biases can addressed computationally, while others are more difficult to

correct for and should be considered before making any biological conclusions.

Poly(A)-selection can lead to a fragment bias towards the 3’-end of transcripts and lower read

coverage at the 5’-ends of transcripts (10). The end-bias can affect gene-expression

Figure 1.2. RNA-seq library preparation schematic. For a description of the steps see section 1.7.1

(AAAAAAAAAAAAA) Selection and Fragmentation

First Strand cDNA Synthesis

Second Strand cDNA Synthesis

Adapter Ligation

Linear Amplificaiton

PCR Amplificaiton

Read 1

Read 2

Not Sequenced

Fragment Size

Paired-end Sequencing

16

quantification, variant calling and splice junction identification (147). This is particularly

problematic for transcripts with low expression. Biological noise is more pervasive in deeply

sequenced total RNA and poly(A) depleted samples and may contain additional transcripts that

are erroneously expressed or spliced (148, 149). These libraries also have greater abundances of

intronic sequences due to incompletely spliced pre-mRNA’s, and repeat elements such as

transposons. These additional sequences may not be relevant for the experiment and reduce the

sensitivity for transcripts with low expression levels. The library selection may contain

contaminating genomic DNA if the library is not properly DNAse treated; this can lead to the

false positive coverage of regions that are not normally expressed (149).

Library fragmentation leads to a read-bias towards longer reads since a longer transcript will

have more potential fragments than a smaller transcript (145, 147). This leads to a long transcript

having more sequenced pairs than a smaller transcript with similar expression levels.

Reverse transcriptase is not error correcting and has a higher error rate than high-fidelity DNA

polymerases (121). These are true mismatch errors that are not due to spectral overlap during the

sequencing reaction, so they may have high base qualities. This can lead to an increased false

positive mismatch calls compared to DNA libraries. This is particularly problematic for the

identification of RNA editing and variants. RT can switch templates intra- or inter- molecularly

during reverse transcription of structured regions leading to the generation of chimeric transcripts

(119, 120). These transcripts can cause the identification of false positive splicing events and

gene fusions. Random hexamer first-strand synthesis priming can lead to biases in the nucleotide

composition at the beginning of sequencing reads (145, 146, 150). The annealing of random

hexamer primers with mismatches can lead to mismatches in the sequencing reads compared to

the RNA fragment.

1.9 Sequence Alignment Algorithms

As sequencing technology produces longer and more numerous sequencing reads it has become

increasingly challenging to align or map these sequences to a reference genome, or assemble

them into contiguous sequences. The accurate mapping or alignment of the sequencing reads to a

database is essential for the downstream analysis of the sequencing data. One of the seminal

developments for sequence alignment algorithms was the development of the “global”

Needleman-Wunsch and “local” Smith-Waterman dynamic programming based alignment

17

algorithms (151, 152). These algorithms permit the alignment of two different sequences with

mismatches, gaps (insertions and deletions) and a reasonable computational complexity. The key

difference between the NW and SW algorithms is that the NW algorithm aligns the two

sequences completely from end to end (global alignment), while the SW algorithm can produce

smaller internal alignments (local alignment). Both of these algorithms were further refined by

the development of the affine gap penalty modification (153). The affine gap penalty uses

separate scores for opening and extending an insertion or deletion; this is more biologically

relevant since a longer gap can be generated by a single mutational event. Vectorizing the

algorithms using state-of-the-art CPU instructions further increased the performance of the SW

and NW algorithms (154-156). These methods can also be sped up using heuristics to reduce the

alignment space to the most relevant regions of the aligned sequences (157, 158). For example,

reducing the alignment space by limiting the number and size of insertions or deletions can

increase performance with a small reduction in sensitivity.

The next challenge for sequence alignment was to map reads to a database of known sequences

or chromosomes. The SW and NW algorithms would spend a significant amount of

computational time in regions that may not contain a relevant alignment, which became

infeasible for large databases or numerous sequencing reads. An algorithm was needed to

identify regions and sequences that are likely to contain a valid alignment rather than brute-

forcing the alignment across the whole database. Many tools have been developed to solve this

problem with various speed and sensitivity tradeoffs, these tools include FASTA(159),

BLAST(160), BLAT(161), and BWA(162). These tools offered a substantial speed improvement

for database searching compared to the SW or NW algorithms.

Early heuristic methods were developed to reduce the search space for sequence alignments by

populating a hash table of k-grams derived from a sequence database (159, 163). The k-gram

table uses a collection k-mers of length k and a step size s that indicates the distance between

each k-mer, for example with s = 2, every other k-mer would be retained. The hash table can be

constructed in linear time, has a customizable memory footprint and offers constant lookup

speed. A query sequence can be used to search the hash table by looking up the positions of all

the k-grams across the query sequence. Heuristics can be used to choose which regions should be

investigated for alignment. The k-size can influence mismatch tolerance, sensitivity and

performance of the algorithm. Smaller k values will be more sensitive, however, there will be

18

more potential regions of interest. The step size can be used to reduce the memory footprint and

increase performance of the hash table with a reduction in sensitivity. The typical memory

footprint for a DNA hash table using a two-bit encoding for the DNA alphabet is approximately

4(4! + !/!) bytes, where N is the number of nucleotides in the database. One of the

advantages of hash tables is that they can be constructed on machines with limited memory. One

disadvantage of hash tables for searches with low sequence similarity is that the k-size needs to

be small. To mitigate this a variant hash table utilizing “spaced seeds” that are generated using

non-continuous k-grams from a defined set of patterns is used to increase sensitivity without

reducing the k-mer size (164).

FASTA(159) was the first alignment tool to use hashing, heuristics and the SW algorithm to

generate gapped alignments. BLAST(160) was the first tool to use a hash table of the query

sequence to scan a large database for potential alignments. BLAST performance was further

increased with a modified version of the SW algorithm called X-drop that extends regions of

interest in each direction and stops if the alignment score drops below a threshold (158). BLAST

can be used to search very large databases such as GenBank.

A similar tool to BLAST, the BLAT(161) program, uses a hash table of all non-overlapping k-

grams in a database sequence and is capable of mapping cDNA sequences across introns. BLAT

maps reads across introns by stitching together anchors derived from seeds and takes advantage

of conserved splice-site information. BLAT is useful for the alignment of sequences derived

from cDNA libraries.

PatternHunter(164) was the first alignment tool to use a spaced-seed based hash table approach.

This offered an improved sensitivity and speed compared to traditional hash table approaches.

Tools optimized for long reads >200 bps were not optimal for the alignment of the short (35 –

150 bp) paired-end reads generated by high-throughput sequencers such as Illumina (163).

Moreover, existing tools were too slow to map the millions of reads generated by high-

throughput sequencers. Tools optimized for high-throughput read alignment generally map each

read from a pair independently and join the hits together using the fragment size distribution.

Two common approaches are used to construct an index (usually a hash table) to map a database

of reads to a database of reference sequences 1) indexing the reads and scanning the genome 2)

19

indexing the genome and scanning the reads (160, 163). The latter has become the dominant

indexing method since it is faster to index the genome once and use it for each experiment rather

than index the reads for each experiment.

The first short read sequencing aligners used hash tables. Eland (Cox. unpublished) and MAQ

(165), two of the first short read aligners indexed the reads and scanned the genome using a seed

and extend approach in a method similar to BLAST. MAQ permitted a configurable number of

mismatches per read while Eland was limited to two mismatches. As the number of reads and

their lengths increased it became infeasible to index all of the reads. The next generation of short

read alignments indexed the genome rather than the reads. These tools included SOAP (166) and

BFAST (167), which used hash tables and a seed and extend approach. To further increase

sensitivity for mismatches and gaps, short read alignment tools that used spaced seeds were

developed. These included including SHRiMP (168), which uses a vectorized version of the SW

algorithm. Among the disadvantages of these hash based alignment tools is that maximal

performance requires a large memory footprint. Furthermore, having a constant k-gram size

limits sensitivity. For example, MAQ requires ~14GB of memory for the human genome and

SHRiMP with its multiple-spaced seeds indexes requires ~48GB of memory. Clearly, new

memory efficient and sensitive index structure was needed.

Around 1999, bioinformaticians began exploring other full-text index options. These included an

index structure called the suffix tree, which consists of all the suffixes of the database sequence

stored in a tree (169, 170). The suffix tree can be used to find the occurrences of any query

length in linear time. The suffix tree permits occurrence searching with mismatches. This is in

contrast to hash tables that can only match strings with a specific k-size. Suffix trees were

previously used for whole genome comparisons (ie. mummer (171)), however, the memory

requirements were too large (typically ~12.5 bytes per base) (172). The next advancement came

by replacing the suffix tree with a suffix array, which is an array of all the suffixes in the

database sorted in lexicographic order (173, 174). Suffix arrays only require 4 bytes per base, but

lose the ability to solve some of the complex string matching problems compared to suffix trees.

Exact matches can be looked up using a suffix array in linear time with the proper auxiliary data

structures. Additional data structures can be used to supplement a suffix array to solve the same

set of problems as a suffix tree (174). Suffix arrays still required too much memory; roughly

20

48GB for the human genome, which was much more memory than most computers had in 2008;

therefore, an even smaller index was needed.

The next major advancement came with the invention of the Burrows-Wheeler Transform and

the FM-index (175, 176). The BWT is a transformation of the reference sequence that can be

derived from the suffix array. The BWT is utilized by the FM-index with additional auxiliary

data structures including: a down-sampled suffix array, an occurrence data structure and a

constant sized character count table (162, 163, 177, 178). The FM-index permits linear time

pattern matching with low memory requirements. The BWT can be stored in !/4 bytes, for 2 bit

encoded DNA sequences. The sampled suffix array typically uses 4N / 2! bytes where k defines

the sampling rate of the suffix array. A larger k increases performance at the expense of memory.

The occurrence data structure can take up between !/2 and ! bytes of memory. Increasing the

sampling rate does incur a performance cost since additional processing is required to find the

positions of a given occurrence. The FM-index can be used to find occurrences of strings with

mismatches but the performance is reduced as the number of mismatches increases beyond two

(178).

The first aligner to use a FM-index was BWT-SW (163, 177), which was a proof of concept that

a FM-index in conjunction with the SW algorithm can be used to generate local alignments with

mismatches and gaps. This implementation was slower than BLAST and BLAT, but was seminal

in demonstrating the utility of the FM-index for DNA alignment. BWT-SW led to the

development of multiple alignment tools that use the FM-index including SOAP2 (179), Bowtie

(178) and BWA (162). SOAP2 splits the read into three seeds and identifies their positions using

an FM-index. SOAP2 is limited to two mismatches or short gaps. Bowtie generates gapless

global alignments using a seed-and-extend strategy with a FM-index; this permits very fast

execution. To find seeds Bowtie uses a base quality aware backtracking algorithm, to find seeds

with a maximum of three mismatches. The seeds are then extended using the FM-index to

produce global alignments, Bowtie stops when a valid alignment is found. Bowtie2 (156) is a

refinement to Bowtie that uses a vectorized SW or NW algorithm to generate local or global

alignments respectively, this permitted Bowtie2 to map reads with more mismatches and across

gaps. Finally, BWA uses a backtracking algorithm similar to Bowtie but supports gaps and

higher numbers of mismatches at the expense of performance. BWA-mem (180) is a refinement

to BWA that uses the FM-index to find seeds and a SW and X-drop like algorithm to extend the

21

seeds and generate alignments. These alignment tools tend to require less memory, increase

performance and have similar or better sensitivity compared to hash table based alignment tools.

1.10 RNA-seq Read Alignment

Paired-end RNA-seq read alignment is difficult compared to contiguous read alignment due to

the non-contiguous nature of mRNA transcripts (34, 38, 181). Critically, RNA-seq aligners must

be able to identify short exonic alignments in regions that can be interspersed with introns that

can reach hundreds of thousands of kilobases in length, with the longest surpassing 500kb

(Figure 1.4) (182). High-throughput sequencing pairs are generally short (<150 bp each) and

derived from longer DNA fragments, making it challenging to identify exonic alignments due to

potentially very short exonic overlaps on either side of the splice junction. Mapping across splice

junctions is further challenged by the general deterioration of base quality (ie. higher mismatch

rate) at 3’-read ends, which can lead to the identification of false positive splice junctions and

variant calls (108).

22

RNA-seq alignment tools are generally divided into three main categories: 1) exon-first, 2) seed-

and-extend and 3) hybrid tools (182, 183). Exon-first tools map reads to the genome

contiguously using a traditional high-throughput sequencer aligner and then the unmapped reads

are processed for splice junctions. These methods rely on global alignments and may miss

spliced alignments with short exonic overlaps and alignments that have a suboptimal contiguous

alignment. Seed-and-extend methods map reads in a method similar to BLAT where anchors are

stitched together using a DNA index. Seed-and-extend approaches perform best when identifying

novel splice junctions (184). Hybrid methods combine an exon-first alignment with a seed-and-

extend method. An important difference between RNA-seq aligners is their capability to identify

non-canonical (ie. non-GT-AG) splice junctions. There have been more than thirty tools and

methods developed to map reads across annotated splice junctions, discover novel splice

Gene

Alignment

Fragment Size

Read 1

Read 2

Transcript

Paired-end Sequencing

-AAAAAAA

Repeat AlignmentInsertion Deletion Mismatches

Figure 1.3. RNA-seq alignment overview. A paired-end read must be mappable across introns. An example read pair that maps across an intron in two places is illustrated. Examples of insertions, deletions, mismatches and repeat alignments that must be supported are illustrated.

23

junctions de novo and map reads across combinations of annotated and novel splice junctions

(183, 184).

RNA-seq aligners must not only be able to map reads across splice junctions, but they must align

and detect that read-pairs map concordantly and in the correct orientation (183, 184). Generally,

for read-pairs to have a concordant alignment they should map to the same chromosome and map

to opposite strands. Pairs that do not have this orientation may have arisen due to library

preparation artefacts, genome rearrangements or false positive alignments. Most tools map read

pairs independently and subsequently their alignments are combined to generate concordantly

mapped pairs. The majority of RNA-seq alignment tools consider a pair concordantly mapped if

they are in the correct orientation and within a user-defined distance, which typically defaults to

500kb (183, 184). This long paired alignment distance threshold can lead to the false

identification of read pairs that map between genes or transcriptional units.

The first RNA-seq alignment methods were designed for single-end reads and used databases of

annotated splice junctions (116, 138). Modern RNA-seq aligners use one or more of the

following methods to map reads across splice junctions: 1) the use of a splice-junction database,

2) segmentation based alignments, and/or 3) seed-and-extend algorithms (Figure 1.2) (183, 184).

The first method tends to give the highest specificity and sensitivity if all of the splice junctions

are known, however, if the splice junction database is incomplete the sensitivity and specificity

can be reduced. The latter two methods are capable of the de novo identification of splice

junctions while the former is only capable of identifying novel splice junctions involving known

splice donor and acceptors. These methods increase sensitivity by identifying splice junctions

missing from databases of annotated splice junctions. These tools usually limit the maximum

size of a novel intron to 500kb. Utilizing gene annotations generally produces better quality

spliced alignments, since the number of splice junctions that must be discovered is reduced

(184). The alignment tools with the highest sensitivity and specificity are hybrid methods that

integrate existing splice-junction annotations with the identification of de novo splice junctions

(184).

24

Splice-junction databases are the first and most accurate method to map RNA-seq reads across

annotated splice junctions (116, 138). This method consists of generating a database of synthetic

sequences where each entry consisted of the concatenation of the exonic sequences upstream and

downstream of the splice junction. The length of the sequence flanking the splice junction is

usually set to the length of the sequencing read. The reads can then be mapped to this splice

junction database using a traditional contiguous read aligner such as Bowtie or BWA (162, 178).

Contiguous alignment tools have been engineered to sensitively map alignments with

mismatches and gaps but do not have the ability to align reads across splice junctions. This

method produces the most accurate spliced read alignments since accurate and error-tolerant

contiguous read aligners can be used for alignment. This prevents the identification of false

positive splice sites that may be generated due to poor mismatch and gap tolerance during

alignment. Candidate novel splice junctions can be detected by populating the database with

every combination of annotated 5’- and 3’- splice sites derived from the same strand (138). The

Figure 1.2. RNA-seq Alignment Strategies. The three most commonly used RNA-seq alignment strategies are illustrated. Alignments using a splice-junciton database (left column) where a set of synthetic sequences representing the spliced-mRNA are used for alignment. Alignments where an indexing strategy such as a hash table or suffix array are used to find exonic alignments that are “stitched” together to form spliced alignments (center column). Split read alignment algorithms (right column) where the read sequence is split into N pieces and each piece is independently mapped to a splice junction database and/or the genome reference.

Seed and Extend

Exon 1 Exon 2

Seed Alignment

Align Using Seeds

Exon 1 Exon 2

Junction Database

Align to Database

Resolve

Exon 1 Exon 2

Exon 1 Exon 2

Read Segment Alignment

Exon 1 Exon 2

Exon 1 Exon 2

Split Read

Align Segments

Extend Segments

Join Segments

Exon 1 Exon 2

A B C

25

combinations can be limited to genes or to a specified maximum intron size. This method does

not permit the de novo detection of splice junctions using unknown splice sites but may aid in the

identification of alternative splicing events. There are several drawbacks to using splice-junction

databases. The first drawback is that as sequencing reads get longer there is a greater probability

that they will span more than one splice junction and these reads will be incorrectly mapped.

Another issue is that if a splice junction is missing from the database a contiguous aligner may

generate a false positive alignment by forcing an alignment across a different splice junction by

incorporating incorrect mismatches and gaps. Finally, reads that cross novel splice junctions may

not be mapped at all leading to false negatives.

1.10.1 Segmentation Approaches

Segmentation based RNA-seq alignment approaches were the first alignment tools that were

widely accessible and capable of functioning on commodity hardware. Tophat (185) was the first

popular exon-first method, Tophat can align read-pairs, use gene annotations and identify de

novo splice junctions (185). Tophat uses a multi-step approach in which the reads are first

mapped contiguously to the genome with Bowtie(178) and unmapped reads are retained.

Regions with coverage are then assembled into islands, assessed for potential GT-AG splice

sites, joined together and filtered to remove false positives. The retained splice junctions are used

to form a database of synthetic splice junctions. Next, the unmapped reads are then mapped

against the splice-junction database using Bowtie and resolved back to genomic coordinates.

Individual reads from a pair are processed using the Tophat method independently and are then

assessed for concordant alignments using a specified insert size distribution. The biggest

disadvantage of using coverage islands to find splice junctions is that there is no support for non-

canonical splice junctions, in addition false positives are common since GT-AG sequences can

occur without being functional splice junctions. Finally, TopHat has no support for insertions or

deletions, which can lead to additional false positive splice junction predictions. However, the

two-step alignment process of mapping reads to a splice-junction database after a phase of novel

junction discovery is important and is commonly used today by RNA-seq alignment tools. The

method of mapping read-pairs independently and then assessing whether their alignments are

concordant is also common practice.

26

Segmentation-based RNA-seq alignment tools rely heavily on contiguous high-throughput

aligners. Novel splice junctions are identified de novo by splitting unmapped reads into multiple

segments and then mapping them independently to the genome sequence. The split read

alignments are used to infer splice junctions based on patterns of their alignment. Two popular

segmentation based RNA-seq aligners are MapSplice (186) and a revised version of Tophat

named Tophat2 (115, 185, 187). The primary differences between Tophat2 and MapSplice are

the usage of gene annotations and the patterns of segmented-reads they can utilize to identify

novel splice junctions (183). These segmentation-based algorithms are advantageous in that they

were less challenging to implement since they depend on existing RNA-seq alignment tools for

the majority of the alignments.

Tophat2 (187) uses a three-step approach to map RNA-seq reads. The first step is only used

when gene annotations are available: Tophat2 uses Bowtie2 to map reads to a modified version

of a splice-junction database that includes complete cDNA sequences of all annotated transcripts.

The second step consists of mapping the unmapped reads to the genome sequence. Finally, the

third step collects the remaining unmapped reads, segments them into 25 bp non-overlapping

segments and then independently maps the segments to the genome. Only two cases of segment

alignments are considered, the case where an internal segment (ie. a segment that isn’t the first or

last segment) is unmapped or a pair of consecutive segments does not map contiguously to the

genome. These two segment alignment cases are used to identify novel splice junctions by

searching for splice sites near the segments and joining them to identify novel splice junctions. A

splice junction database is generated with the discovered splice junctions and the unmapped

segments are mapped to this database and stitched together with the other segments. This

algorithm has several drawbacks. Tophat2 can miss novel exons shorter than the segment length,

as well as splice junctions with suboptimal segment alignments or segments with multiple splice

junctions.

MapSplice (186) uses a hybrid segmentation and seed-and-extend algorithm without gene

annotations to map reads across splice junctions. All of the reads are split into 25 bp non-

overlapping segments. The segments are mapped to the genome sequence and used as anchors

for a seed-and-extend approach to identify novel splice junctions. MapSplice is more sensitive at

splice junction discovery than Tophat2 since it does not rely solely on segment alignments.

Nonetheless, since the majority of splice junctions are annotated and MapSplice does not take

27

advantage of gene annotations, which reduces its accuracy compared to Tophat2 with

annotations.

1.10.2 Seed and Extend Approaches

Seed-and-extend based RNA-seq alignment tools have extended the ideas of BLAT (161) and

increased their performance. The two most commonly used seed-and-extend based RNA-seq

aligners are GSNAP (188) and STAR (189). These two tools are both capable of mapping

paired-end reads and are sensitive to mismatches and splice junctions, however, they differ

greatly in their implementation, performance and gapped alignment capabilities.

GSNAP (188) was one of the first seed-and-extend based RNA-seq alignment tools and utilizes a

hash table with a k-size of 12 and step size of 3 by default. GSNAP does not have to load the

entire index in memory but instead uses disk-based memory-mapping leading to a very small

memory footprint. However, using disk based memory-mapped indexes does lead to a massive

performance decrease compared to tools that use in-memory indexes. GSNAP uses a hash-table

to find candidate regions and merges the alignments to generate splice junctions to GT-AG, GC-

AG, and, AT-AC splice junctions. GSNAP is capable of mapping reads with a single gap that

can be longer than those supported by most RNA-seq aligners with the exception of Tophat2.

STAR (189), as of writing this is a state-of-the art RNA-seq alignment tool that is both very fast

and sensitive. STAR uses a suffix-array as its index, which permits very fast searches. Its main

drawback is that it requires more than 40GB of memory for a human sized genome. STAR uses

the suffix array to find Maximal Mappable Prefixes (MMP), to iteratively find sets of candidate

alignment regions. Unlike most RNA-seq alignment tools, STAR processes both pairs from a

read at the same time when generating the regions. The regions are stitched together to produce

alignments that can identify non-canonical splice junctions, multiple mismatches and a single

gap. STAR is also the fastest RNA-seq aligner and can map more than 300 million read-pairs

per hour with 6 threads.

Both GSNAP and STAR do not execute full SW or NW alignments and may miss optimal

alignments. These tools tend to be sensitive; however, they both emit high numbers of false-

positive splice junctions predictions and aligned mismatches (184).

28

1.11 Current challenges mapping RNA-seq pairs

Common RNA-seq alignment artefacts in combination with library preparation biases can

obscure biologically important signals. Alignment issues are caused by three primary

mechanisms: the interdependence between mismatch, gap and splice junction alignments on

accuracy; repeat alignment sensitivity; and resolving whether paired-end reads map

concordantly. The interdependence between gaps, splice junctions and mismatches is complex,

and there is a fine balance between preferring an alignment with mismatches, gaps, splice

junctions, or combinations of the above. For example, consider a contiguous alignment with a

mismatch or gap versus a splice junction with neither gaps or mismatches. The latter could be a

false positive splice junction and the former could correspond to false positive SNVs. A false

negative splice junction prediction can lead to alignments with gaps and mismatches within an

intron sequence rather than a splice. These issues tend to be dependent on how the alignment tool

scores a splice junction, gap and mismatch. Alignment tools such as STAR (189) and Tophat2

(187) do not penalize GT-AG splice junctions. This can lead to an increased false positive rate

for splice junctions rather than a contiguous alignment with mismatches. A low gap penalty can

also lead to incorrect alignments; for example, a gap could be used to prefer a GT-AG splice

junction and a gap versus a non-canonical splice junction.

Eukaryotic genomes have multiple classes of repeat elements including retrotransposons, DNA

transposons, miniature inverted-repeat transposable elements, and paralogous genes. For

example, 12% of the C. elegans genome and 45% of the human genome is composed of

transposable elements (8, 190-192). Some of the transposon classes are extensively repeated

throughout the genome; for example, Alu elements represent ~10% of the human genome and

occur more than 1 x 106 times. The C. elegans genome contains 3,327 Ce000087 insertions

comprising ~1.32% of the genome (190). Accurately mapping reads derived from these repeat

elements is difficult and important for the identification of RNA editing. Incorrectly handling

repeat alignments such as missed repeat alignments to paralogous regions or repeat elements has

led to the false positive identification of RNA edits (193). RNA-seq tools also tend to have low

repeat tolerances preferring performance to sensitivity. Finally, as previously mentioned many

RNA-seq alignment tools use a simple method to determine if read-pairs map concordantly,

which may lead to false positive read-pair alignments due to sequencing library preparation

artefacts or repeat regions.

29

1.11.1 Identifying RNA editing with RNA-seq

Adenosine deamination and cytidine deamination can be identified by aligning RNA-seq data to

the DNA sequence from the same individual. For example, A-to-I editing appears as an A-to-G

or T-to-C mismatch depending on the strand of the targeted RNA (194). However, the

differentiation between real RNA editing events and false positives is challenging. Mitigation of

common sequencing errors, alignment artifacts, and the removal of somatic DNA mismatches

are essential for the accurate identification of RNA edits (12). For example, the initial

identification of putative non-canonical RNA editing has more recently been demonstrated to

arise from false positives derived from sequencing and alignment artifacts (193, 195). Mismatch

tolerant alignments are also essential since RNA editing can occur in clusters of hyper-editing

with 30 A-to-I edits in a single 2x100 bp read (69).

1.11.2 Identifying RNA edits without sequencing

The presence of inosine bases in an RNA molecule can be verified using sequencing based

method termed inosine chemical erasure (ICE) (196). Inosine bases are chemically modified with

acrylonitrile to produce N1-cyanoethylinosine. The cyanoethylated inosine residues lead to the

arrest of reverse transcriptase extension during cDNA synthesis. A typical experiments involves

reverse transcriptase with specific primers using a normal RNA sample and an acrylonitrile

treated sample. The cDNA is subsequently PCR amplified and both reactions are sequenced

using Sanger sequencing. If inosine is present in the targeted RNA the sequencing data should

detect A-to-G mismatches in the normal sample and an absence of A-to-G mismatches in the

acrylonitrile treated sample. The ICE method provides a complimentary method to eliminate

false positive mismatches due to sequencing errors.

1.12 Thesis Objectives

My thesis is concerned with the development of methods for the accurate alignment of RNA-seq

pairs and the identification of SNV’s and RNA editing. Reflecting these goals, my work has

three major objectives: 1) designing an accurate RNA-seq alignment system, 2) benchmarking

the system against current RNA-seq aligners and 3) utilizing this tool to identify RNA editing in

the nematode C. elegans. In Chapter Two of my thesis I focus on the development and

benchmarking of RNASequel an RNA-seq alignment post-processing tool. The third chapter of

30

my thesis involves improving RNASequels performance and using it to accurately profile RNA

hyper-editing in C. elegans.

In Chapter Two, I describe the development and implementation of RNASequel, a software

package that runs as a post-processing step in conjunction with an RNA-seq aligner and

systematically corrects common alignment artifacts. Its key innovations are a two-pass splice

junction alignment system that includes de novo splice junctions and the use of an empirically

determined estimate of the fragment size distribution when resolving read pairs. I demonstrate

that RNASequel produces improved alignments when used in conjunction with STAR (189) or

Tophat2 (187) using two simulated human datasets (184). In addition, I show that RNASequel

improves the identification of adenosine to inosine RNA editing sites on human-derived

biological datasets. The strength of this software lies in applications requiring the accurate

identification of variants in RNA sequencing data, the discovery of RNA editing sites and the

analysis of alternative splicing.

In Chapter Three, I report improvements to RNASequel and demonstrate the utility of the

program by identifying the most comprehensive map of C. elegans RNA editing to date. The

original version of RNASequel required large amounts of temporary disk-space reducing

inhibiting its usefulness for the analysis of multiple datasets in parallel. In this chapter I describe

a modified version of RNASequel designed to eliminate its temporary disk space requirements;

this leads to a performance increase. The primary modification is that RNASequel uses BWA-

mem (180) as a software library rather than requiring four individual alignments per sample. I

used this new version to profile RNA hyper-editing in 91 C. elegans RNA-seq datasets. I used

this map of A-to-I editing to verify that edits are generally associated with non-coding RNA,

repeat elements and heterochromatin.

31

Accurate RNA-seq Realignment with RNASequel 2This work was published in Nucleic Acids Research (197).

The current generation of RNA-seq paired-end aligners suffers from shortcomings that obscure

biologically important signals, or which give rise to false signals. For example, the initial

identification of putative non-canonical RNA editing has more recently been demonstrated to

arise from false positives derived from sequencing and alignment artifacts (193).

A typical RNA-seq experiment consists of sequencing both ends of a cDNA fragment to generate

two reads (a read pair) separated by a variable length of sequence. The accurate alignment of

these read pairs is essential to the downstream analysis of an RNA-seq experiment, but RNA-seq

read alignment is challenging due to the non-contiguous nature of mRNA transcripts (181).

Critically, RNA-seq aligners must be able to identify exonic alignments in regions that can be

interspersed with introns that can reach hundreds of kilobases in length (182). To solve this issue

paired-end RNA-seq alignment methods typically apply a distance cutoff to exclude discordantly

mapped pairs. However, these cutoffs tend to be arbitrary and very liberal. For example, many

algorithms consider mapped pairs to be concordant up to a maximum distance of 500kb, which is

sufficiently high to catch the rare very long intron, but also is prone to incorrectly classifying the

more common case of discordant reads that are mapped incorrectly.

To facilitate the mapping of spliced reads while attempting to minimize common systematic

errors, various RNA-seq alignment methodologies have been developed. These methods include

tools that are dependent on, and optimized for, a specific short read alignment tool such as

Bowtie or BWA (156, 162, 178, 185, 186, 198-200). Other tools implement their own alignment

algorithms that may not be as accurate as traditional short read alignment tools, or which are less

tolerant to gaps and mismatches (188, 189). RNA-seq alignment methods also differ in their

usage of pre-existing splice junction databases. Most methods perform better when a splice

junction database is provided, but this hinders the identification of novel splice junctions, and

may not be feasible for less well-characterized species (184, 200). In addition, few splice

junction aware RNA-seq aligners are able to recognize and handle transcripts that span more

than one splice junction or contain a novel combination of existing junctions.

32

Other common artifacts that lead to issues with spliced alignments include 1) the identification of

false positive splice junction alignments due to short alignment overlaps on one side of the splice

junction, which is compounded by the reduction of base quality at read ends; 2) false positive

splice junctions due to reverse transcriptase and PCR template switching and splicing noise; 3)

splice junctions that are missed because the read has been incorrectly aligned to an intron

sequence rather than across a splice junction (119, 120, 184, 200, 201). These artifacts contribute

to false positives for calling insertions, deletions, splice junctions and mismatches. For example,

many false positive sites in predicted RNA edits tend to be located near splice sites due to

incorrectly spliced alignments (12, 193). These are compounded by issues relating to library

preparation such as errors generated by reverse transcription and random hexamer priming (146).

In general, RNA-seq aligners have a low default tolerance for insertions, deletions and

mismatches, which together increase the number of unmapped bases (soft clipping) at read ends

and miss alignments to regions with a high mismatch rate. Finally, poor repeat tolerance can also

lead to false positive mismatch calls by aligning a read pair to one paralogous gene while

missing the alignment to another.

One common method to compensate for spliced alignment artifacts is to execute a two-pass

alignment scheme (185, 200). A two pass alignment consists of two steps: 1) the alignment of the

reads to known splice junctions and the reference genome for the identification of novel splice

junctions; 2) the generation of a new index file including all, or a subset of, high confidence

novel splice junctions. This can drastically improve the spliced alignment of reads with low short

exonic overlaps.

To address the common causes of systematic artifacts in RNA-seq library preparation,

sequencing and alignment I have constructed an RNA-seq realignment program called

RNASequel. RNASequel utilizes the spliced-read output of any read mapper and de novo splice

junction identification algorithm to perform an error-tolerant realignment (Figure 2.1). It takes

advantage of an empirically determined fragment size distribution and annotated and novel splice

junctions to predict if a read pair maps concordantly. I have tested RNASequel against

STAR(189) and Tophat2 (187) for de novo splice junction prediction using real and simulated

datasets, and find increases in sensitivity and decreases in false positive predictions. I also show

that RNASequel has improved repeat alignment sensitivity that improves the identification of

potential single nucleotide variants and RNA editing sites.

33

RNASequel is implemented in C++ is available under the GNU Public License from:

(https://github.com/GWW/RNASequel).

Spliced Read Aligner

Filter Splice Junctions Gene Annotations

Generate Splice Junction Database

Read 1 Genome Alignment Read 1 Spliced Alignment Read 2 Genome Alignment Read 2 Spliced Alignment

Merge Read 1 Alignments Merge Read 2 Alignments

Estimate Fragment Size Distribution

Resolve and Output Read Pairs

Figure 2.1. RNA-sequel realignment schematic. A spliced read aligner is used to identify sample specific novel splice junctions that are used to generate a splice junction index. Read 1 and read2 from each read pair are independently mapped to the genome and splice junction index using a contiguous read aligner. Low quality alignments are removed, the genomic and splice junction alignments are merged and the read pairs are resolved using an empirically determined fragment size distribution.

34

2.1 Results:

2.1.1 Developing an Accurate RNA-seq Realignment Tool.

I have developed RNASequel, an accurate and error-tolerant paired-end RNA-seq realignment

tool, which functions as a post-processing step attached to an RNA-seq alignment algorithm. My

implementation allows the user to utilize his or her preferred aligner, and future-proofs the tool:

it can be used to improve the accuracy of any current or future RNA-seq alignment software that

emits its results in standard BAM format. The tool refines the splice junction predictions prior to

realignment by removing junctions that experience has shown are likely to be false positives, for

example junctions found only in the end of reads or junctions found with repeat alignments. To

improve paired-end alignment accuracy the reads from each pair are independently mapped to

the genome sequence (genomic index) and a database of splice junctions (splice junction index)

(Figure 2.1). An advantage of aligning the reads independently to the genome and splice

junction index is the reduction of indexing time and the disk space usage, since indexing the

reference sequence can take a long time and require gigabytes of disk space while indexing the

RNASequel-generated splice junction database is comparatively fast and produces indexes

~100MB. These four alignments can be performed in parallel using a computational cluster. The

genomic and splice junction database alignments for each read are merged and alignments are

discarded based on user-configured filtering parameters, which can be customized based on the

required repeat tolerance defined by the user. Lastly, I refine paired-end read analysis by

validating that each potential read pair alignment falls within an empirically determined fragment

size distribution. This is in contrast to most spliced alignment methods that consider a read pair

concordant if it aligns within a preset distance. The advantage of this method is that it improves

the detection of concordant read pairs and repeat-mapped pairs.

2.1.2 RNASequel realignment leads to improved alignment accuracy.

To benchmark RNASequel realignment I tested two different de novo splice junction prediction

tools, Tophat2 and STAR with gene model annotations (Tophat2 Ann. and STAR Ann) and

without annotations (Tophat2 and STAR). The novel splice junctions identified from each of

these tools was used for realignment with RNASequel. I also compared RNASequel realignment

against STAR with two passes where the splice junctions predicted in the first pass are used to

generate a new index for a second pass (STAR Two Pass and STAR Ann. Two Pass). Finally, to

35

benchmark RNASequel without de novo splice junctions, RNASequel was used with gene

annotations alone in a single pass alignment (Annotation Only). I chose Tophat2 because of its

popularity as one of the first RNA-seq alignment tools and STAR for its use within the

ENCODE project, its high accuracy and its fast alignment rate (184).

To determine the alignment characteristics of Tophat2, STAR and RNASequel I utilized two

simulated datasets that were previously used in an RNA-seq alignment benchmarking study were

used for benchmarking (184). Engström et al. using the Benchmarker for Evaluating the

Effectiveness of RNA-Seq Software generated the simulated alignments for both datasets.

(BEERS) (200). This software was designed to avoid gene model biases and generate read-pairs

with a normal fragment size distribution, novel splice junctions, mismatches, and gaps. BEERS

combined gene annotations from 11 sources to generate alternatively spliced gene isoforms

including intron retention events. An empirical distribution of gene expression scores from a

biological dataset was used to determine the underlying expression of each of the genes. Read-

pairs were then randomly generated for each gene to mimic the randomly selected expression

level and to have a fragment size that matches a normal distribution. Mismatches, insertions, and

deletions were randomly introduced into the read-pairs at a user-defined rates and additional

modeling of base-call errors and quality scores was performed by simNGS. Finally, one of the 11

annotation databases (Ensmbl for this work) can be used to provide alignment tools with known-

splice junctions.

Each dataset has roughly 3.7 x 107 2x76 bp read pairs (184). The second of the two simulated

datasets was generated with a higher mismatch (~3x more), gap (~5x more) and novel splice

junction rate (1.5x more).

The simulated datasets were generated to closely match a biological experiment, however, they

do fall short for the following two reasons: 1) the simulated alignment may not be the optimal

alignment due to a simulated read mapping to other genomic locations, and 2) the simulated

alignments do not include all of the artifacts in an RNA-seq experiment such as template

switching. Despite these shortcomings simulated datasets are important when benchmarking

RNA-seq tools because they provide a ground truth permitting the identification of correct and

incorrect alignments with combinations of splice junctions, mismatches and gaps.

36

Overall, RNASequel improved the number of reads that perfectly recapitulated the simulated

alignment; this was especially the case for the second simulated dataset (Figure 2.2A/B). For the

first simulated dataset RNASequel alignments produced the highest number of perfect

alignments, ~90% versus 80-87% for the other methods and with the second simulated dataset

RNASequel identified 12-20% more perfect alignments. The performance of the algorithms with

and without gene annotations was similar for both simulated datasets. Finally, Tophat2 had the

fewest number of partial alignments and the highest number of singleton alignments, likely due

to one read in the pair having more mismatches than Tophat2’s default cutoff. For both simulated

datasets RNASequel realignment demonstrated an increased repeat sensitivity, the number of

correct alignments to repetitive elements was typically ~4x higher for the first simulated dataset

and ~2x higher for the second dataset. This improved alignment accuracy is also reflected in

regions in both simulated datasets (Figure 2.3, 2.4).

37

A

B Percent (%)

Percent (%)

Figure 2.2. Simulated dataset alignment rates. Alignment rates as percentages of the total number of pairs for the first (A) and second (B) simulated datasets with the indicated alignment methods. For a description of the alignment types see the benchmarking methods description.

38

chr1: 155105667 – 155110998 bp

B

STA

R A

nn.

STA

R A

nn. P

lus

RN

AS

eque

l

A

Figure 2.3. (A) Alignment view of chr1: 155105667 – 155110998 bp for simulated dataset 1 comparing STAR Ann. with STAR Ann. plus RNASequel. The color of each alignment indicates how the alignment compared to the true alignment as indicated by the legend. Read pairs that were perfectly aligned by both tools are not shown. (B) The summary for all of the alignments in the indicated region.

39

chr20: 61439314 - 61475112 bp

B

STA

R A

nn.

STA

R A

nn. P

lus

RN

AS

eque

l

A

Figure 2.4. Alignment view of chr20: 61439314 - 61475112 bp for simulated dataset 2 comparing STAR Ann. with STAR Ann. plus RNASequel. The color of each alignment indicates how the alignment compared to the true alignment as indicated by the legend. Read pairs that were perfectly aligned by both tools are not shown. (B) The summary for all of the alignments in the indicated region.

40

2.1.3 Realignment to a splice junction database improves spliced read accuracy

A major challenge for de novo splice junction identification is that a single pass alignment

scheme may incorrectly align reads with short exonic alignments because the true splice junction

has not been discovered. To mitigate this issue I applied a filtering scheme to identify and

remove false positives that occurred due to repetitive region mappings, splice junctions occurring

exclusively in the ends of a read and/or non-canonical splice motifs. To maximize my ability to

align reads across multiple splice junctions I supplemented sample-specific splice junction index

with groups of novel and annotated splice junctions that could be spanned by a single sequencing

read. For both simulated datasets, realignment with RNASequel or STAR with two passes

increased the number of perfectly mapped spliced reads by 2-10% (Figure 2.5). When gene

annotations were present the number of perfect alignments increased by 4-10%. This was

particularly evident for reads that spanned multiple splice junctions, which demonstrates the

utility of my splice index alignment (Figure 2.6). RNASequel realignment had the lowest

number of incorrect spliced alignments and the highest number of perfect alignments compared

to STAR. The rate of incorrect alignments was higher when using Tophat2 for de novo splice

junction predictions. This may be due to Tophat2’s higher false negative rate. The importance of

including de novo splice junctions for alignment is highlighted by examining RNASequel using

only gene annotations which had the highest number of incorrect spliced reads. The number of

perfect spliced reads was more pronounced for the second simulated dataset where the number of

perfect alignments was increased by ~10% and the number of failed alignments decreased by 5%

without annotations and 2% with annotations for RNASequel realignment versus STAR with two

passes. Overall, RNASequel realignment had the highest precision for both annotated and novel

splice junctions (Figure 2.7A/B, 2.8 A/B). For annotated splice junctions RNASequel

realignment had the highest recall for both simulated datasets and comparable precision. The

increase was small for the first simulated dataset, but 7-30% higher for the second simulated

dataset. As expected, the recall and precision were highest when gene model annotations were

supplied.

41

A

B Percent (%)

Percent (%)

Figure 2.5. Spliced read alignment rates for the first simulated dataset (A) and the second simulated dataset (B). Perfect spliced alignments have all of the correct splice junctions, partial alignments have at least one correct splice junction and no incorrect splice junctions and failed alignments are unmapped reads or reads with an incorrect splice junction.

42

A

B

Percent (%)

Percent (%)

Figure 2.6. The number of correct splice junctions identified in each read stratified by the total number of true splice junctions for the first simulated dataset (A) and the second simulated dataset (B). Colored bars indicate the number of correctly identified junctions.

43

A B

C D

E

Figure 2.7. Alignment characteristics for the first simulated dataset. The recall and precision as a percentage of the number of correctly aligned reads for annotated junctions (A), novel junctions (B), insertions (C), and deletions (D). The alignment algorithms used are indicated according to the legend and the arrows indicate the improvement by RNASequel and are colored according to the legend. (E) Receiver-operator curve (ROC) demonstrating the relationship of correctly called sequence variants (Y axis) to the number of falsely-called variants (X axis) for each read pair across each of the alignment methods. Note that the X-axis scale is false positive variant calls per 100,000 reads.

44

A B

C D

E

Figure 2.8. Alignment characteristics for the second simulated dataset. The recall and precision as a percentage of the number of correctly aligned reads for annotated junctions (A), novel junctions (B), insertions (C), and deletions (D). The alignment algorithms used are indicated according to the legend and the arrows indicate the improvement by RNASequel and are colored according to the legend. (E) Receiver-operator curve (ROC) demonstrating the relationship of correctly called sequence variants (Y axis) to the number of falsely-called variants (X axis) for each read pair across each of the alignment methods. Note that the X-axis scale is false positive variant calls per 100,000 reads.

45

For the identification of novel splice junctions, RNASequel had a slightly lower recall rate due to

my filtering scheme, but a ~3-5% higher precision than STAR for the first simulated dataset. The

slight decrease in recall and the increase in precision demonstrates the tradeoff when applying a

filtering scheme to novel splice junctions prior to realignment. For the second simulated dataset,

RNASequel realignment increased the recall by 6-23% and the precision by 2-4%. I examined

the false negative splice junction alignments, and observed that majority of them (23-60%) were

within 15 bp of the 3’ end of the read sequence. These may have been missed due to the

simulated read quality degradation near the 3’ ends (Figure 2.9, Figure 2.10).

In summary, by generating a splice junction database and mapping the reads with an accurate

error tolerant realignment I have increased the splice junction accuracy, especially in the case of

datasets with high error rates.

46

Figure 2.9. The proportion of false positives (red), false negatives (blue) and true positives (green) by their position across each read sequence for junctions (first column), insertions (second column) and deletions (third column) from the first simulated dataset.

47

Figure 2.10. The proportion of false positives (red), false negatives (blue) and true positives (green) by their position across each read sequence for junctions (first column), insertions (second column) and deletions (third column) from the second simulated dataset.

48

2.1.4 RNASequel realignment improves alignments with insertions and deletions

Gapped alignments are a challenge for RNA-seq alignment. For example, a higher gap tolerance

threshold can result in additional false positive splice junction predictions by inserting a gap to

bridge an alignment to an incorrect splice junction. Furthermore, false positive gaps can be

inserted within an alignment that incorrectly aligns to an intronic sequence. To overcome this I

have combined RNASequel’s accurate splice junction indexing strategy with a gap tolerant

alignment using BWA mem followed by a trimming of alignments that map to intron sequences.

Using this approach, RNASequel increased the gap recall by ~20% compared to STAR and

Tophat2 (Figure 2.7C/D, Figure 2.8 C/D). The insertion precision was comparable between all

of the methods used while the deletion precision after RNASequel realignment was ~20-25%

higher compared to STAR. For each of the alignment algorithms the false negatives for

insertions and deletions tended to occur in the first and last 10 bp of each read where aligners are

more likely to soft clip the alignment rather than insert a gap (Figure 2.9, Figure 2.10).

Intriguingly, STAR alignments produced a higher percentage of false positive deletions through

the middle of the read compared to Tophat2 and RNASequel realignment. Tophat2 had a slightly

higher false positive rate near the read ends due to using an underlying global rather than local

alignment algorithm.

The effect of RNASequel’s increased gap tolerance is to reduce read artifacts such as

mismatches and incorrect splice junction calls due to incorrect gapped alignment.

2.1.5 RNASequel realignment increases mismatch tolerance and accuracy

High mismatch tolerance for RNA-seq alignment can lead to an increase in sensitivity, but it can

also lead to more false positive splice junction alignments or alignments that should be spliced

but are contiguously aligned into an intron sequence. The RNASequel splice junction filtering

step helps reduce some of these false positives while my attempt to trim alignments that overlap

splice sites near the read ends reduces many false positives. The simulated datasets are

dominated by alignments with low numbers of mismatches and to assess the performance of the

tools and RNASequel on read pairs with high and low levels of mismatches, I plotted the number

49

of true positive and false positive mismatches stratified by the true number of mismatches in

each read pair (Figure 2.7E, Figure 2.8E). RNASequel realignment had the highest mismatch

recall and precision compared to the other tools. Tophat2 had the lowest mismatch accuracy due

to a low mismatch tolerance by default. As observed in the splice junction and gap tests, the

majority of the false negative and false positive mismatches were near the ends of reads,

particularly the 3-prime end of the read (Figure 2.11). This is due to the higher number of

mismatches near the 3-prime end of the read from the simulated read quality degradation. It

should be noted that I could have improved the other tools’ accuracy by hand-optimizing their

alignment parameters, but I felt that the default parameters represented a typical laboratory use

case. Furthermore, adjusting the alignment tools mismatch parameters may lead to undesirable

alignment artifacts, for example, a higher false positive spliced read alignment rate.

50

Figure 2.11. The proportion of false positives (red), false negatives (blue) and true positives (green) mismatches by their position across each read sequence for the first simulated dataset (first column) and the second simulated dataset (second column).

51

2.1.6 RNASequel execution speed and memory requirements.

RNASequel realignment is reasonably fast. The splice junction index generation takes less than

15 minutes with a single thread of execution and less than a 1GB of memory. The reference and

splice junction alignment steps are dependent on the chosen alignment tool, for BWA-mem this

takes 2-3 hours per 100M reads with 16 threads for the reference alignment and 1 hour per 100M

reads for the splice junction index alignment. BWA-mem uses 40GB of memory for both

alignment types. The merge step processes ~35M pairs per hour with 8 threads and uses 20GB of

memory. It should be noted that all four of the BWA-mem alignments could be parallelized on a

computational cluster decreasing the RNASequel processing time substantially. As a comparison

STAR processes roughly 50M pairs per hour with 8 threads and ~60GB of memory. Tophat2

processes roughly 8M pairs per hour with 8 threads and <20GB of memory.

2.1.7 RNASequel realignment improves alignment characteristics on biological datasets.

Simulated datasets do not capture all of the potential sources of errors present in a biological

RNA-seq library. For example, there may be reads derived from spurious transcripts in non-

coding regions of the genome such as pseudogenes. There are also other sequencing errors

unique to a biological dataset such as reverse transcriptase template switching (119, 120). To

compare the alignment accuracy of RNASequel to Tophat2 and STAR, I applied my program to

three biological datasets, one derived from a lymphoblastoid cell line (YH) and two replicates

derived from a lymphoblastoid cell line GM12878 (142, 202). The YH RNA-seq sample used a

library that was poly(A) and ribosomal RNA depleted and was deeply sequenced to a depth of

~400M 2x90 bp pairs. The GM12878 samples were sequenced to a depth of ~100M 2x75 bp

poly(A) selected pairs. For all three samples RNASequel realignment led to the concordant

mapping of more read pairs. For the YH sample, realignment with RNASequel realignment led

to the concordant mapping of ~90% of the read pairs while Tophat2 mapped ~60% and STAR

mapped ~84% (Figure 2.12A). For GM12878-1 the paired alignment rates were ~80% for STAR

and RNASequel while Tophat2 mapped ~48% (Figure 2.12B). Finally, for the GM1278-2

sample RNASequel mapped ~80% of the pairs, while star mapped ~70% and Tophat2 mapped

~45% (Figure 2.12C). In all three of the cases RNASequel identified 0.3% to 6% more pairs as

repeat mapping compared to STAR and Tophat2. For the YH dataset STAR with two passes

mapped a similar number of repeat pairs to RNASequel while mapping 2-3 times less for the

52

GM12878-1 dataset. To further investigate the read mapping improvements conferred by

RNASequel I compared STAR Ann. plus RNASequel to STAR Ann. with two passes with an

additional 25 poly(A) RNA-seq samples from the ENCODE project (Figure 2.13). On average

RNASequel mapped 2.75% more pairs and identified an average of 5.5% more repeat mapped

pairs.

A

B

C

Percent (%)

Percent (%)

Percent (%)

Figure 2.12. Alignment rates for the YH (A), GM12878-1 (B) and GM12878-2 (C), RNA-seq datasets. Pairs were considered aligned if they were indicating as properly mapped by the alignment algorithm and discordant otherwise. Pairs where only one of the reads from the pair was mapped are indicated as singletons.

53

RNASequel realignment attempts to predict whether a read pair is concordant using the

empirically determined fragment size distribution, splice junction predictions and gene

annotations. To compare the effect of this on paired alignment I used my fragment size

determination algorithm on the alignments produced by STAR and Tophat2 to predict whether

the paired alignments have a valid fragment size using the junctions predicted by the tool and

gene annotations. I found that ~1-2% of the pairs uniquely mapped by STAR and Tophat2 had a

fragment size outside of the empirical range determined by my algorithm (Figure 2.14). It

should be noted that Tophat2 does take advantage of a user-provided fragment size mean and

standard deviation. These numbers were also similar for repeat alignments where all or a subset

of the alignments had a fragment size that was not within the empirically determined

Figure 2.13. Comparing pair-end read alignment rates for 25 ENCODE RNA-seq datasets with STAR Ann. with two passes and STAR Ann. with RNASequel. The types of alignments are indicated by the legend.

54

distribution. These represent a small proportion of the alignments that include cases where the

fragment size was outside of the tail of fragment size cases with missing splice junction

annotations and false positive alignments. For STAR ~60-80% for unique pairs and ~20-40% for

repeat pairs fall within 50bp of my confidence interval. However, these alignments can

contribute to artifacts in downstream analysis, especially when identifying variant or RNA

editing calls.

55

A

B

C

Figure 2.14. Application of the RNASequel fragment size estimation and verification algorithm to the alignments produced by Tophat and STAR. The percentage of the total number of read-pairs in the YH (A), GM12878-1 (B) and GM12878-2 (C) RNA-seq dataset is indicated. For unique alignments where the alignment did not fall within the empirically estimated range (Unique -> Unmapped), for repeat alignments where all of the alignments failed to fall within the estimated range (Repeat -> Unmapped), for repeat alignments where all but one of the alignments did not fall within the range (Repeat -> Unique), and cases where a subset of the repeat alignments did not fall within the range (Repeat -> Repeat).

56

2.1.8 RNASequel realignment generates more robust RNA editing calls.

In vertebrates, the ADAR family of enzymes is responsible for the conversion of adenosine to

inosine (A-to-I) in RNA (203). This type of RNA editing is thought to be used as a regulatory

mechanism (204). In RNA-sequencing, A-to-I edits manifest either as A-to-G or T-to-C

substitutions depending on the strand of the transcript. The identification of RNA editing sites

using RNA-seq is difficult due to a number of sequencing and alignment artifacts. To

demonstrate the degree to which RNASequel realignment improves RNA editing calls I

compared my realignment algorithm with Tophat2 and STAR with and without gene model

annotations. The potential nucleotide changes were then filtered to remove common sources of

false positives including alignments to tandem repeats and changes biased to the ends of reads. I

removed somatic polymorphisms (if available) or common polymorphisms in dbSNP (if genome

annotations were not available). Prior to filtering the YH and GM12878 datasets RNAsequel

realignment yielded comparable numbers (+/-1-3%) of A-to-I changes as STAR and Tophat2

RNASequel yielded 20% more A-to-I changes for the YH dataset and 8-10% fewer changes for

the GM12878 datasets (Figure 2.15, 2.16, 2.17). For non-A-to-I changes RNASequel yielded 4-

11% fewer changes compared to STAR and 23-40% fewer compared to Tophat2. I also

compared the total SNV calls between STAR Ann. with RNASequel and STAR Ann. with two

passes for 25 additional ENCODE RNA-seq samples. I found an average decrease in the number

of A-to-I calls by 0.52% and a decrease in non-A-to-I calls by 3.7% (Figure 2.18, 2.19). These

results suggest that RNASequel realignment leads to fewer potential false positives prior to

filtering than STAR and Tophat2. These results are also consistent with my simulated dataset

results that demonstrated the reduction in false positive mismatch calls facilitated by RNASequel

realignment compared to Tophat2 and STAR.

After filtering potential false positives I observed that RNASequel and STAR found similar

somatic SNV calls (~1% difference) (Figure 2.15, 2.16, 2.17). For Tophat2 alignments

RNASequel realignment yielded 20-40% more somatic SNV and dbSNP calls. I also observed an

average 3.1% reduction in dbSNP calls for ENCODE samples (Figure 2.18, 2.19). For A-to-I

calls I observed a comparable number of changes between STAR and RNASequel for the YH

dataset (~0.1-1% increase after realignment) and ~4-10% fewer changes for the GM12878

datasets and for Tophat2 alignments I found 2-3 times as many A-to-I calls. For non-A-to-I

changes I observed a 15-25% decrease in the number of calls after RNASequel realignment

57

compared to STAR and a 1.4-3 times as many compared to Tophat2. For the 25 ENCODE

datasets I found an average of 7.3% fewer A-to-I changes and 10.4% fewer non-A-to-I changes.

Combined together the simulated and biological results suggest that RNASequel realignment

yields fewer false positive SNV calls compared to STAR due to RNASequel realignment

reducing the number of non-A-to-I changes. Furthermore, for the YH-1 dataset I found more

somatic SNV’s suggesting an improved false negative score compared to STAR and Tophat.

Tophat2 uses a global alignment algorithm and low mismatch tolerance that leads to a higher

false negative rate for reads with more than 2 mismatches and a higher false positive rate at the

read end for reads with few mismatches. In conclusion, RNASequel realignment demonstrates a

general reduction in the number the false positives with minimal effect on the false negative rate.

58

A B

Difference After RNASequel

Figure 2.15. YH SNV and edit call comparisons. (A) The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), the number of retained A-to-I changes and non-A-to-I changes. (B) The difference in the number of calls for alignments with STAR compared to alignments with RNASequel and the difference after repeats and pairs with incorrect fragment sizes were removed (labeled with clean suffix). For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.

59

A B


Figure 2.16. GM12878-1 SNV and edit call comparisons. (A) The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), the number of retained A-to-I changes and non-A-to-I changes. (B) The difference in the number of calls for alignments with STAR compared to alignments with RNASequel and the difference after repeats and pairs with incorrect fragment sizes were removed (labeled with clean suffix). For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.

60

A B


Figure 2.17. GM12878-2 SNV and edit call comparisons. (A) The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), the number of retained A-to-I changes and non-A-to-I changes. (B) The difference in the number of calls for alignments with STAR compared to alignments with RNASequel and the difference after repeats and pairs with incorrect fragment sizes were removed (labeled with clean suffix). For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.

61

Figure 2.18. Comparing SNV and edit call rates for STAR Ann. with two passes and STAR Ann. + RNASequel. The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a dbSNP entry (dbSNP), the number of retained A-to-I changes and non-A-to-I changes for the 25 ENCODE RNA-seq samples.

62

Finally, to explore the features of RNASequel realignment that led to improved SNV and RNA

editing calls I assessed the impact of RNASequel’s improved repeat sensitivity and fragment size

estimation algorithms. To assess the impact of repeat mapped reads on calling of RNA editing

sites, I collected the union of reads that mapped across any of the variant sites by either the base

alignment program or the alignment program with RNASequel. Read pairs that were multi-

mapped by one tool and uniquely mapped by the other were discarded and the edit sites were

called and filtered again. To assess the impact of my fragment size determination algorithm on

identifying concordant read pairs I removed uniquely mapped reads that did not have a valid

fragment size as determined by my algorithm. I found that within the union of SNV supporting

alignments RNASequel marked 4-10% of the reads as multi-mapping, compared to STAR and

Tophat2, which marked ~1% as multi-mapping (Figure 2.20). STAR with two passes had 2.7%

multi-mapped read rate for the YH sample. I also identified more alignments marked as singleton


Figure 2.19. Comparing the differences in SNV and edit call rates for STAR Ann. with two passes and STAR Ann. + RNASequel before (All Pairs) and the difference after repeats and pairs with incorrect fragment sizes were removed (Cleaned Pairs). The difference in the number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), dbSNP entry (dbSNP), the number of retained A-to-I changes and non-A-to-I changes for the 25 ENCODE RNA-seq samples.

63

compared to STAR (1-10% versus 0%) and fewer than Tophat2 (1-9% versus 17-35%). For the

25 ENCODE samples I observed an average of 13% multi-mapped reads with RNASequel

versus 1.3% with STAR with two passes (Figure 2.21). RNASequel realignments mapped more

pairs where 0.6-4% of the reads were unmapped compared to STAR and Tophat2 where 4-25%

of the pairs were unmapped. RNASequel also mapped more of the alignments than STAR for the

25 ENCODE datasets 1% versus 4%. A portion of the alignments identified by STAR as

concordant pairs were marked as discordant pairs by RNASequel (0.1-0.8% of the alignments).

Tophat2 marked the highest proportion of reads as discordant but this was also the case for the

simulated and full set of reads for the biological datasets. Finally, my fragment size estimation

algorithm identified ~1% of the reads mapped as unique by STAR or Tophat2 as being

discordant. After removing the alignments that were marked as unique by STAR and reads

marked as discordant with my fragment size determination algorithm the difference in the

number of calls was lessened or increased in favor of RNASequel (Figure 2.15, 2.16, 2.17,

2.19). For example, the number of non-A-to-I edits identified by STAR is reduced after

removing reads that were marked as repeat mapping by either STAR or RNASequel.

RNASequel. Collectively, these results imply that the improvements in alignment characteristics,

particularly increased repeat sensitivity and improved identification of concordantly mapped read

pairs leads to an improved alignment for the purposes of calling SNVs and RNA edits.

64

A B C

Repe

at&&

Pair&(%

)&Discordant&

Pair&(%

)&Fragmen

t&Fail&(%

)&Singleton&

(%)&

Repe

at&

Singleton&(%

)&Unm

appe

d&(%

)&

Figure 2.20. Comparison of the alignment type between the union of all reads that support a genomic SNV, dbSNP entry, retained A-to-I change, or retained non-A-to-I change for YH (A) GM12878-1 (B) and GM12878-2 (C). The bar on the left indicates the percentage of alignment types for the labeled tool, the bar on the right indicates the alignment rate for the tool with RNASequel realignment. For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.

65

Repe

at&&

Pair&(%

)&Discordant&

Pair&(%

)&Fragmen

t&Fail&(%

)&Singleton&

(%)&

Repe

at&

Singleton&(%

)&Unm

appe

d&(%

)&

Figure 2.21. Comparison of the alignment type between the union of all reads that support a genomic SNV, dbSNP entry, retained A-to-I change, or retained non-A-to-I change for the 25 ENCODE RNA-seq datasets. The type of alignment being measured is indicated on the y-axis and each bar represents the percentage of the union of all reads.

66

2.2 Discussion:

By systematically mitigating common artifacts that occur during RNA-seq library preparation

and alignment, RNASequel increases the recall of splice junctions, gaps and mismatches while

decreasing the false discovery rate. When applied to the challenging problem of RNA editing

identification, the RNASequel post-processing method reduces the number of apparent false

positives without adversely affecting sensitivity. I have found that using RNASequel in

combination with STAR provides the best accuracy metrics. Crucially, I show that despite my

higher error tolerance, I identify fewer non-canonical edits compared to STAR on a biological

dataset. This implies that many potential RNA editing calls are due to systematic alignment

errors that can be mitigated with RNASequel realignment thereby strengthening the

interpretation of biological datasets. STAR is also preferred because it has better performance

characteristics than Tophat2. RNASequel realignment is agnostic to the underlying aligners used

for splice junction prediction and contiguous read alignment leading to an adaptable RNA-seq

alignment tool that can take advantage of new alignment methods. In the future, I am

investigating methods to improve the performance and disk space usage of RNASequel by

calling the underlying contiguous aligner as a library. I am also investigating methods to capture

aligned pairs that fall within the tail of the fragment size distribution to increase the number of

concordantly mapped pairs. The improvements facilitated by RNASequel realignment are useful

for the analysis of alternative splicing, gene and isoform expression, sequence variant calling and

RNA editing.

2.3 Methods:

2.3.1 Reference genome and annotations

I downloaded human genome build GRCh37 reference sequences and annotations from the

UCSC Genome Browser (genome.ucsc.edu) and created a gene annotation GTF file

(http://www.ensembl.org/info/website/upload/gff.html) from the knownGene and

knownIsoform tables (205). Reference sequences and annotations for chromosomes 1-22, X, Y

and M were used for GTF and fasta sequence construction.

67

2.3.2 Biological Datasets:

The lymphoblastoid derived ribominus and poly(A) depleted RNA-seq datasets were

downloaded from SRA043767 at the NCBI short read archive (142). The individual lanes were

merged together yielded a total of 421,836,549 2x90 base pair reads. A GFF list of genomic

single nucleotide variants for the individual the cell line was derived were downloaded from

(206): http://yh.genomics.org.cn/download.jsp

The GFF was lifted over to hg19 using the UCSC LiftOver tool and the hg18 to hg19 LiftOver

chain (205).

ENCODE long poly(A) paired-end RNA-seq datasets (202) for 27 samples were downloaded

from:

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeCshlLongRnaSeq/

Each replicate was analyzed independently.

GM12878 genome raw sequencing pairs were downloaded from the 1000 genomes project (207)

using SRA accession ERX000170 and mapped to the human reference genome using BWA mem

and the following parameters: “-k 15 -a -B 2 -M”. The alignment was piled up and any change

with at least 10x coverage and a 10% alternative allele frequency were retained.

2.3.3 Alignment Protocols:

I used STAR (189) version 2.3e and Tophat2 (187) version 2.10 with Bowtie2 (156) 2.1.0 for

spliced alignment with their default parameters with the following exceptions: for STAR

alignments with annotations the UCSC “genes.gtf” was provided with the “--sjdbGTFfile”

parameter. For two pass STAR alignments the “SJ.out.tab” file produced by the first pass was

provided to the genome generation step with the “--sjdbFileChrStartEnd” parameter.

For Tophat2 alignments the insert size was specified as “-r 78 --mate-std-dev 30” for the

simulated datasets and “-r 20 --mate-std-dev 40” for the lymphoblastoid dataset and “-r 59 --

mate-std-dev 40” for the GM12878 datasets. For annotated Tophat2 alignments the genes.gtf file

was provided using the “-G” parameter.

68

2.3.4 RNASequel Realignment

To generate a splice junction index the following command was used:

rnasequel transcriptome -g genes.gtf -r genome.fa –n {read size} -b

denovo_alignment.bam -o tx

The transcriptome index produces two files a text file containing the junction locations relative to

the genome sequence and a fasta file containing the junction sequences. The transcriptome index

generation utilizes the read size using the -n option. For alignments without annotations and de

novo splice junctions the -g option was not included and for annotations alone the -b option was

not included.

The individual reads from each pair were then mapped using BWA mem version 0.7.8 (198)

with the commands indicated below. It should be noted that these commands could be run in

parallel in a computational cluster.

BWA-mem indexing:

bwa index genome.fa

bwa index tx.fa

Reference genome alignment:

bwa mem –L 2,2 -k 15 -a -t 8 -B 2 {index} {reads1 or 2} | samtools view -bS - >

{ref 1 or 2.bam}

Splice junction index alignment:

bwa mem –L 2,2 -c 20000 -M -k 15 -a -t 8 -B 2 tx.fa {read 1 or 2} | samtools view

-bS -F 4 - > {juncs 1 or 2.bam}

Merging the pairs:

69

rnasequel merge -r genome.fa -g genes.gtf -j tx.txt ref1.bam -o align.bam

ref2.bam juncs1.bam juncs2.bam

2.3.5 Splice Junction Definitions and Alignment Scoring

I defined a canonical splice junction as any splice junction with the following motifs: GT-AG,

GC-AG, GC-TG, GC-AA, GC-GG, GT-TG, GT-AA, AT-AC, AT-AA, and AT-AG. The strand

of a splice junction was inferred based on gene annotations and the aforementioned splicing

motifs. Alignments were scored using the following scoring penalties: gap open = -8, gap

extension = -1, splice junction = -4, match = 3, mismatch = -3. For spliced alignments an extra

alignment penalty was added for each splice junction. A penalty of -3 was applied for GT-AG

splice junctions, -6 for other canonical splice motifs and -9 for all other splice motifs. To reduce

the chances of choosing an alignment with a long intron over an alignment with a shorter intron

and a lower score I applied an additional penalty for splice junctions with introns over a pre-

defined length (arbitrarily set at 64kb by default). For these long introns I applied a penalty of

−(log2(isize)−12) , where isize is the size of the intron.

2.3.6 Splice Junction Discovery and Splice Junction Index Generation

The splice junction databases combined reference annotations (if available) and novel splice

junction predictions from Tophat2 or STAR (if used). Only the novel splice junctions meeting

the following criteria were retained (used for analysis): 1) The splice junction must be observed

at least 8 bp away from the ends of at least one read; 2) there are at least 2 different alignment

positions mapping across the pair; 3) the predicted intron size is at least 21 bp and no more than

500 kb in length. For each novel junction I added to the database N base pairs of flanking

sequence on each side of the junction, where N should be chosen based on the size of a

sequencing read, for my case I used 76 or 90. To handle cases in which a read could span

multiple splice junctions, I supplemented my index by including multiple splice junctions on the

same annotated strand if a sequence of length N could span one or more downstream junctions.

Splice junctions with an ambiguous strand were considered on both strands. Finally, redundant

sets of spanning splice junctions were removed to minimize the database size. The splice

junction index can then be used with any contiguous read mapper.

70

2.3.7 Contiguous and Spliced Read Alignment

For mapping reads to the GRCh37 reference genome (contiguous alignments) and splice junction

indexes, I chose BWA-mem version 0.7.8 for its speed and accuracy. However any read mapper

can be used. Read 1 and read 2 from each pair were independently mapped to the reference

genome and the splice junction index. For each splice junction alignment, I resolved the

alignments back to the genomic co-ordinates and removed contiguous alignments. To avoid

alignment artifacts that occur due to reads improperly aligning to intronic sequences, alignments

were trimmed if they overlapped a splice site within 6 base pairs of the end of the alignment. For

each alignment the score was calculated as described above and I defined the minimum

alignment score to be 2×(!"#$%&' !"#$#); any alignments with a score less than this were

discarded. The retained alignments for read 1 and read 2 were then paired by identifying every

potential alignment combination that matched the following criteria: 1) the alignments were on

the same chromosome; 2) the alignments were in the correct orientation; and 3) the distance

between the read pair was less than 1Mb.

2.3.8 Estimating the Empirical Fragment Size Distribution

As noted earlier, the current generation of RNA-seq aligners use an arbitrary cutoff to remove

read pairs that map too far away from each other. RNASequel uses two different methods to

solve this problem in a more disciplined manner. Only read-pairs that mapped uniquely after

discarding alignments that had a score less than (the highest alignment score) – 12 were used for

fragment size estimation. I used a score difference of twelve, which equates to four mismatches

with my default mismatch penalty of three. This number can be adjusted if an increased repeat

sensitivity is desired. In the case in which a gene annotation file is available for the organism

under study, I estimated the expected fragment size distribution from the annotated gene model

introns. For organisms with gene annotations I identified pairs that mapped to long exons (>250

bp) that should be larger than the insert size of the library or pairs that mapped to single isoform

genes (115). In the case in which gene annotations were unavailable, I used maximum distance

criteria of 1,500 bp between the read pairs. In both cases we set a size cutoff to 1,500 bp and

required at least 100,000 fragment size observations. Both methods for estimating the fragment

size distribution may lead to rare cases where an intron is retained and the fragment size is

overestimated. Moreover, there is the possibility that a small fraction of the read-pairs are

71

aligned incorrectly with long fragment sizes. To compensate, the empirical distribution was

normalized and a confidence interval retaining the smallest 99% of the observations was applied.

2.3.9 Resolving Read Pair Alignments

To identify potential concordant read pairs I examined all of the potential combinations between

the alignments for read 1 and read 2 that were correctly oriented, mapped to the same

chromosome, and were less than 1 Mbp apart. For each of these potential pairs every potential

fragment size using different combination of splice junctions between the pairs were compared

to the empirically determined fragment size distribution. Each potential fragment was then

assigned a score of 10 x | (normalized fragment distribution score) / (maximum fragment

distribution score) | + (read 1 alignment score) + (read 2 alignment score). The highest scoring

pair was marked as primary; any pair with a score difference of less than twelve was marked as

secondary and the remaining alignments were discarded. The score difference when calling

repeat alignments should be carefully chosen based on the desired repeat tolerance, for my

purpose I found that twelve, which is equivalent to a difference of 4 mismatches was reasonable.

If no valid pairs were found using the fragment size distribution and the potential read pair was

uniquely aligned it was outputted and marked as discordant. Furthermore, I implemented two

different fallback methods depending on whether or not gene annotations were provided. Both of

these methods are optional and deactivated by default. In the case where gene annotations were

provided if both pairs mapped within the same annotated gene and were less than a user-defined

distance apart they were considered concordant. If no gene annotations were provided I

considered a pair concordant if the distance between the pair was at least a user-defined distance

apart. For alignments where there were no valid alignments for one of the reads in a pair I

reduced the score difference threshold to six, since I am only examining a single read rather than

both reads in a pair. The highest scoring singleton alignment was marked as primary and the

remaining alignments were marked as secondary.

2.3.10 Simulated Dataset Benchmarking

The simulated datasets were previously used to benchmark RNA-seq alignment programs

accuracy (184). The datasets were downloaded from ArrayExpress using the accession number

E-MTAB-1728 (184) and alignments that mapped to “random” and “NA” chromosomes were

removed. To simplify the comparison of alignment pipeline outputs to the “ground truth” of the

72

simulated datasets, I removed read pairs if either read had an edit distance of 25 or more. I left-

shifted gaps, trimmed spliced alignments with less than 8 base pairs of exonic overlap at the read

ends and converted spliced alignments into deletions for predicted introns with a length less than

21 bp. For repeat mapped alignments I considered only the primary alignment. An alignment was

considered perfect if the paired alignment exactly matched the true alignment. Partial alignments

overlapped the true alignment but may have been soft clipped (unmapped sequence at the 5’- or

3’- end of a read) or included alternate insertions or deletions. Singleton alignments were

classified as paired-alignments in which either read 1 or 2 was unmapped. For spliced read

alignment comparisons I counted a junction as correct only if the junction was present in the true

alignment. A spliced alignment was considered partially correct if it contained at least one of the

correct junctions and no incorrect junctions (this also encompasses the case in which the

alignment contained all of the correct junctions but some of them were lost due to soft clipping).

Finally, alignments that were mapped but did not meet the criteria for a perfect or partial

alignment were marked as failed.

2.3.11 Identifying Putative Adenosine to Inosine RNA editing events

The reads from the poly(A)-depleted YH lymphoblastoid cell line were mapped with the same

alignment algorithm combinations as the benchmarking datasets. The alignments were retained if

they had no more than two aligned ambiguous bases and no more than 10 soft clipped bases at

either end. The retained alignments were then searched for potential edits using the following

criteria to discard low quality calls: 1) positions mapping to tandem repeats using trf (208) or low

complexity and simple regions according to RepeatMasker were discarded, 2) for positions

overlapping an inverted repeat annotated by einverted (209) or a repeat element identified by

RepeatMasker I used a less stringent coverage criteria and required at least 10x coverage and a

10% alternative allele frequency, for positions with no repeat overlap I required 16x coverage

and a 20% alternative allele frequency, 3) at least one of the reads supporting the alternative base

were outside of the first and last 8 base pairs of the read ends, 4) potential changes for which

more than 90% of the supporting reads contained an insertion or deletion were removed, 5)

Potential sites where more than 70% of the supporting alignments contained different kinds of

mismatches were discarded. After removing low quality calls, I also discarded changes found in

the UCSC Genome Browser “Common SNP” track, which is derived from dbSNP v141 if no

73

genome sequence was available. For the GM12878 and YH datasets SNPs that were called from

genome sequencing data were discarded (see Supplementary Methods) (206, 207).

74

Identifying RNA Hyper-Editing in C. elegans 3

3.1 Background

Recently, there have been three studies that have profiled RNA hyper-editing on a global scale in

C. elegans with RNA-seq (68, 69, 210). These studies have used poly(A)-selected, total RNA

samples and/or immunoprecipitated RNA. These aforementioned studies used differing edit

calling methods and these methods are difficult to replicate without the original source code.

These issues prevent the construction of a comprehensive and consistent database of RNA

editing sites in C. elegans. To mitigate this, I have mapped and identified A-to-I edits and hyper-

edited regions in 91 C. elegans paired-end RNA-seq samples derived from the aforementioned

studies and additional modENCODE RNA-seq samples (Table 1) (68, 69, 210, 211). These

libraries include 10 adr-2(lf) strains that lack A-to-I editing activity as negative controls.

I have engineered an updated version of RNASequel designed for fast sample processing and

high-mismatch tolerance to align the RNA-seq reads. I have also designed a sensitive RNA

editing identification pipeline to generate, to my knowledge, the most comprehensive map of

adr-2 dependent RNA editing. Consistent with previous reports, my expanded map of A-to-I

edits are strongly associated with non-coding RNA, inverted repeats, transposons and

heterochromatin. Finally, I compiled a list of potential edits within coding exons that lead to

amino acid changes. This expanded and refined reference database of ADAR2-dependent edits

will provide a critical resource for future studies on the biological role of adenosine deamination

in C. elegans.

3.2 Results

To comprehensively detect C. elegans RNA hyper-edited clusters I downloaded 91 RNA-seq

datasets from various stages and library-selections (Table 3.5.1). Included in the samples are 10

negative controls where ADAR2 (adr2(lf)) is catalytically inactive. It is expected that these

samples will not have an enrichment of A-to-I edits and that any A-to-I edits observed will be

false positives. To efficiently map the RNA-seq samples I built a new version of RNASequel that

is faster, more mismatch tolerant and does not produce temporary alignment files. This version

facilitated the quick processing and accurate alignment of the samples. Next, I developed an edit

identification pipeline capable of the sensitive detection of hyper-edited clusters, while

75

maintaining a low rate of non-canonical changes. To validate this pipeline, I demonstrate an

enrichment of edit calls in wildtype samples, a depletion of edits in adr2(lf) samples, and a high

overlap of edit calls with previous studies. Finally, I explored the association of the discovered

hyper-editing regions with repeat elements, heterochromatin, introns and 3’-UTRs.

3.2.1 Improvements to the RNASequel Aligner

To improve the performance and disk space usage of RNASequel for the analysis of multiple

samples and the detection of hyper-editing I made several modifications to the program:

1. The first modification altered the splice-junction indexing generation tool to work with

piped output, which eliminated the need for the temporary alignment file generated by

STAR resulting in run-time performance increases of the novel splice junction detection

step.

2. To eliminate potential false negative mismatch calls due false positive spliced-alignments

to avoid mismatch alignments. The splice-junction stringency of the splice-junction

filtering step was also modified. Because the majority of C. elegans introns are smaller

than humans, I modified the maximum intron size to 32kb and increased the filtering

parameters to require at least 5 unique pairs outside the first and last aligned 15 base pairs

of the read in order for the novel splice junction to be retained.

3. During the alignment step I replaced the four BWA-mem alignments with direct library

calls to BWA mem, eliminating intermediate alignment bam files. I also implemented a

pre-filter step using BWA-mem that discards pairs mapping to rRNA sequences. This

eliminated the four temporary bam files created by the BWA-mem alignment and the

need to run four alignments in parallel and increased performance. To permit alignments

with multiple-mismatches the score filter previously developed with RNASequel was

removed.

4. To resolve issues with fragment sizes falling near the right-tail of the distribution, the

fragment size distribution was smoothed by averaging the the distribution within a sliding

window with 5 bp at each side. The original 0.99 confidence interval cutoff was extended

until the observation count was below: 0.05 x |the observation count at the 0.99

confidence interval cutoff|.

76

3.2.2 RNA-seq sample processing and alignment

Accurately mapping RNA-seq pairs derived from C. elegans samples is paramount for accurate

A-to-I edit calling. One aspect of nematode biology that complicates RNA-seq alignment is the

presence of trans-spliced leader sequences at the 5’-end of many C. elegans mRNA transcripts

(212). These trans-spliced leaders cannot be mapped as splice junctions since they are derived

from independent gene products that may not even be on the same chromosome. These

sequences can be misaligned upstream of the spliced-leader acceptor site causing false positive

mismatch calls. To mitigate this I trimmed the spliced leader sequences off the ends of read-pairs

for each sample.

The 91 C. elegans RNA-seq samples were mapped to the C. elegans genome and the E. coli

OP50 genome with STAR and the improved version of RNASequel (Figure 3.1, Table 3.1). In

general the proportion of uniquely mapped pairs was highest for the poly(A) samples. The total

RNA samples had more repeat mapped pairs, rRNA contamination and unmapped pairs

compared to the poly(A) samples.

Table 3.1. Mapping Rates Sample Pairs Unique Repeat Singletons Repeat Singletons rRNA Unmapped

Poly(A) mean 32,572,076 83.37% 2.51% 1.13% 0.07% 9.09% 3.83%

Poly(A) STD 30,007,517 14.82% 1.34% 0.77% 0.1% 14.63% 4.79%

Total RNA mean 51,625,071 45.61% 4.93% 1.79% 0.65% 34.25% 12.78%

Total RNA STD 50,422,695 28.69% 4.56% 1.17% 0.91% 28.08% 12.71%

adr2(lf) mean 70,928,625 86.06% 4.25% 0.12% 0.03% 8.47% 1.06%

adr2(lf) STD 28,275,978 24.38% 0.96% 0.2% 0.06% 23.52% 0.98%

77

Figure 3.1. Alignment rates for the samples processed in this study. The samples are grouped by whether they are poly(A) selected, total RNA, or adr2(lf) mutants. The color bar indicates the developmental stage of the sample, the first horizontal bar chart indicates the rates for each alignment type, and the second horizontal bar chart indicates the number of clustered A-to-I and non-A-to-I variants.

Pol

y(A

) N

= 4

4 To

tal R

NA

N =

37

adr2

(lf)

N =

10

78

3.2.3 Accurate and sensitive identification of hyper-editing

One of the challenges associated with identifying hyper-edited regions in C. elegans is that they

are commonly associated with low abundance transcripts in RNA-seq experiments; for example,

transposons and introns (53, 68, 69, 210). Therefore, a pipeline to filter and identify RNA editing

in regions of low coverage is essential. The downside of reducing coverage requirements is that

sequencing errors may be misinterpreted as edit calls. To mitigate this issue I have developed a

hyper-editing identification pipeline that filters common sources of false positives and cluster the

edits into hyper-edited regions. Common sources of false positives and how they were mitigated

are indicated:

1. Sequencing errors that are reflected by low base quality scores were discarded by requiring bases to have a base-quality score of at least 25.

2. False positives due to the degradation of read quality towards the 3’-end of sequencing reads and incorrectly mapped reads associated with splice junctions and spliced leaders. These errors generally occur near the read ends and I eliminated many of them by discarding variants with alignment support that were exclusively in the first and last 5 bp of the read.

3. SNVs identified from genome sequencing data were discarded.

4. Error prone alignments across tandem repeat elements can lead to false positive variants. Removing potential variant calls that overlap tandem repeats mitigated these issues.

The alignments for all 91 samples were individually piled-up and variants with at least 5x

coverage and a frequency of at least 5% were retained for processing with my filtering pipeline

(Figure 3.2). The majority of variants were discarded in the base quality and read-end filtering

steps. After removing common false positive calls I applied a clustering step to select for variants

of the same type that occur within close proximity (< 100 bp). Variants that did not cluster were

retained as singletons. Variants that fell into clusters were labeled as clustered and counted

individually. To validate these calls I checked for significant enrichment for A-to-I changes from

the adr(wt) compared to adr(lf) samples. I observed a significant enrichment for all of the

comparisons and a higher significance for clustered edits (p < 0.0001) compared to singleton

edits (p < 0.05 and p < 0.001) (Figure 3.3A). This verifies previous observations that ADAR2 is

the sole source of A-to-I editing in C. elegans and demonstrates that my pipeline does find

enrichment in A-to-I changes.

79

Figure 3.2. Summary variant identification and filtering steps. For each sample the number of edits discarded for the base quality, read-end, genome and tandem repeat steps are indicated as well as the number of clustered and singleton edits retained. The changes are split into A-to-I and non-A-to-I changes. Note the A-to-I counts are divided by 103 and the non-A-to-I changes are divided by 104.

Pol

y(A

) N

= 4

4 To

tal R

NA

N =

37

adr2

(lf)

N =

10

80

I compared the numbers of A-to-I and non-canonical variant calls for clustered edits and edits

that occurred as singletons. I observed that the majority of the clustered A-to-I changes for

adr2(wt) were enriched compared to the adr2(lf) samples (Figure 3.3 B,C). For singleton edits,

the percentage of A-to-I changes was markedly lower than clustered edits (median < 40%

compared to a median >90% for clustered edits), which suggests a higher false positive rate.

There is the possibility that a proportion of these mismatches are due to wobble base-pairing

with inosine that leads to A-to-T or T-to-A mismatches (213). For example, the “N2e-DMM402-

N2eall_L1-V” sample had 87,049 singleton A-to-I variants and 116,604 singleton non-canonical

variants compared to 21,605 clustered A-to-I changes and 281 clustered non-canonical changes.

Furthermore, the number of singleton A-to-I changes was strongly correlated with the number of

non-canonical changes, while clustered edits were weakly correlated (R2 = 0.86 and 0.36

respectively), which suggests that singleton edits may be associated with the overall sequencing

error rate for the sample. Based on these observations I have chosen to focus on clustered edits

which have a lower calculated false positive rate and are more likely to have a direct effect on

RBP binding and / or RNA secondary structure.

81

Typically, an increase in the number of uniquely mapped read pairs lead to an increase in the

number of detected variants (poly(A) R2 = 0.56, total RNA edits R2=0.26, and other changes

R2=0.02 and 0.04 respectively) (Figure 3.4). The total RNA samples had significantly more

clustered edits per unique pair than the poly(A) samples, 1.79 x 10-4 +/- 7.14 x 10-5 versus 5.40 x

10-4 +/- 5.88 x 10-4 average edits per unique pair (P = 8.83 x 10-8; two-tailed Mann-Whitney U).

A

B

Figure 3.3. Comparison of variant call rates for clustered and singleton edits. (A) Boxplots showing the percent of potential A-to-I changes for clustered and singleton variant calls . P-values were calculated using a two-tailed Mann-Whitney U test (*P < 0.05, ***P < 0.001, ****P < 0.0001) . Scatter plots comparing the number of A-to-I and non-A-to-I changes for clustered (B) and singleton (C) calls. Note that the axis in B and C are scaled in thousands.

C

R2 = 0.83

R2 = 0.36

82

Interestingly, the most edits within a sample are observed in the “N2 4” total RNA sample which

had 53,227 identified editing sites. The adr2(lf) samples had the lowest average edits per unique

pair at 1.41 x 10-6 +/- 1.57 x 10-6.

The clustered A-to-I changes from all of the adr2(wt) samples were merged yielding 197,890 A-

to-I edit sites within 10,941 clusters (Table 3.5.2 Supplementary File:

Wilson_Gavin_W_201606_PhD_worm_edits.txt). To investigate the contribution of repeat

alignments to potential A-to-I edit calls I used my filtering pipeline on alignments including

uniquely mapped pairs and primary repeats (ie. the highest scoring alignment for each multi-

mapped pair), and all alignments regardless of their repeat status. This lead to substantial

increase in the number of A-to-I changes 273,582 in 12,598 clusters for primary repeats and

Figure 3.4. Number of clustered A-to-I and non-A-to-I versus the number of uniquely mapped reads. The figures in the 2nd column are zoomed in versions of the figures in the first column. Sample types are indicated in the legend.

Poly(A) R2 = 0.56 Total RNA R2 = 0.26 adr2(lf) Poly(A) R2 = 0.92

PolyA R2 = 0.20 Total RNA R2 = 0.03 adr2(lf) Poly(A) R2 = 0.68

83

409,008 edits in 15,271 clusters for all repeats. This led to substantial increase in the number of

reported A-to-I changes 273,582 in 12,598 clusters for primary repeats and 409,008 edits in

15,271 clusters for all repeats. Collectively, these data suggest the number of edits may be

underestimated due to ambiguous read mappings.

3.2.4 Comparison with other studies

To validate my reported edited sites I compared the overlap between the calls identified in this

study for uniquely mapped reads, primary alignments and repeat alignments (Figure 3.5). I

found a 78-82% overlap with the Zhao et al. study and a 36 - 74% overlap with the sites

identified by Whipple et al (68, 69). The lower overlap with the Whipple et al. dataset may be

due to differential handling of overlapping read pairs. My alignment pipeline discards paired-

alignments with an alignment length less than the read size while Whipple et al included them in

their study. The differences between my calls and the other studies may be due to their different

alignment and edit calling protocols. For example, Zhao et al. calls edits within using a single

supporting read while I require at least three. Nonetheless, I identify ~155,000 additional editing

sites derived from uniquely mapped reads, which is substantially more the aforementioned two

studies due to the additional samples processed in this study and my attempts to maximize

sensitivity.

84

3.2.5 Clusters are enriched for non-coding elements

Early reports have suggested the majority of ADAR2-dependent RNA editing clusters occur in

non-coding sequences (53, 68, 69). These regions are defined as 3’-UTRs, introns, lncRNAs, and

transposable elements. My work has identified two issues that may confound correct editing calls

in C. elegans, largely stemming from incorrect genome annotation. First, I have determined by

RNA-seq analysis that the majority of C. elegans 3’-UTR are consistently longer than reported.

Second, a number of the hyper-edited clusters are predicted to occur within unannotated introns.

A B

C

Figure 3.5. A-to-I clustered edit call comparison with other studies. Overlap between the clustered edit calls in this study (red), Zhang et al (green) and Whipple et al (blue). The size of the circles and overlap are proportional to the number of edits. For edits in (A) unique alignments (B) primary repeat alignments ie. the highest score alignment for each pair and (C) repeat alignments.

85

These introns can be within known genes or within intergenic regions. To address the UTR

length issue, I extended the annotated transcripts using RNA-seq coverage (see methods), while

the unannotated intron issue was managed by constructing a list of novel introns that were

supported by at least 5 unique pairs in at least 5 samples. This resulted in the identification of

10,568 novel splice junctions and the extension of at least 10 bp in 16,101 3’-UTR’s, correcting

earlier annotations in WormBase (Table 3.5.3 Supplementary File:

Wilson_Gavin_W_201606_PhD_utr_extensions.xlsx). The clustered edit calls were then

compared to the novel 3’-UTR positions, introns and gene annotations based on the inferred

strand of the edit to identify the type of base modified.

To explore the genetic elements and repeat types associated with clustered edits, I further

stratified the types of elements edited by their overlap with inverted repeats and transposons. It

should be noted that exonic overlap indicates overlap with coding exons while ncExonic

indicates overlap with non-coding exons. Next, I calculated the rate of clustered editing for each

type using a novel strategy to estimate rates of editing for each base / repeat type and also

expressed the data as a percentage (see methods) (Figure 3.6). Higher rates of edits (1-3 orders

of magnitude greater) were generally observed for elements associated with inverted repeats and

transposons. The majority of edits (~51.65%) observed was associated with intronic sequences;

followed by intergenic sequences (~32.68%), 3’UTRs (6.11%), and cistronic sequences (2.84%).

Finally, 4.86% of clustered edits were within annotated exons and may be associated with non-

coding isoforms or transcripts misclassified as coding.

86

3.2.6 Clustered A-to-I edit replication and properties

The majority of editing sites (64.7%) did not recur in another sample and only ~2.5% recurred in

10 or more samples (Table 3.2). Intronic, cistronic, and 3’-UTR derived edits had the highest

rates of recurrence, while exonic and mixed derived edits had the lowest. The low rate of

recurrence may be due to the rare frequency of edits and the low abundance of their RNA

molecules. This is confounded by the wide variety of samples used in this study and the

heterogeneous nature of whole worm samples. In general, if a sample had a high frequency for a

A

B

C

Figure 3.6. A-to-I editing association with genetic elements (A) Box plots of the rate of editing within the specified base and colored repeat element type (See Methods for rate calculation) IR = Inverted Repeat, Tpn = Transposon. Means are indicated with stars and medians are indicated with horizontal lines. Samples with 0 observations for the base / repeat type combination are not included (B) The percent of the total number of samples included for each rate type. (C) The percent of the total number of edits represented by each combination.

87

given edit site, it recurred more often than a sample with a low edit frequency (Figure 3.7A,C).

A similar result was observed for the read coverage across a given edit, when the edit occurred

with high coverage in at least one sample it recurred more often (Figure 3.7B,C).

Table 3.2 A-to-I clustered edit recurrence rates Genetic Element 1 Sample 2-4 Samples 5-9 Samples 10-19 Samples 20+ Samples

All 59.05% 29.30% 7.68% 2.62% 1.35% Exonic 86.37% 11.31% 1.56% 0.38% 0.38% ncExon 62.72% 24.97% 6.86% 2.43% 3.02 5'-UTR 68.72% 20.30% 5.77% 3.35% 1.86% 3'-UTR 54.57% 23.76% 9.14% 6.95% 5.58% Intronic 51.18% 35.28% 9.73% 2.88% 0.92% Cistronic 39.24% 27.11% 12.50% 9.72% 11.42% Mixed 86.66% 11.05% 1.87% 0.29% 0.29% Intergenic 69.50% 24.15% 4.77% 1.14% 0.44

88

To determine the extent of hyper-editing within clusters I counted the number of edits per-read

pair for each sample. The median number of edits per read pair read pairs was ~5 for each cluster

type (Figure 3.8A,B). The most edits observed within a read pair was 29 and this was only

observed 6 times in 4 different clusters. For example, there were three pairs with 29 edits within

the 3’-UTR of gmn-1, this region has a 686 bp cluster with 165 unique edit sites. The majority of

clusters had between 5 and 64 edits per cluster (Figure 3.8B,C).

A

B

C

Figure 3.7. A-to-I hyper- edit recurrence stratified by the overlapping genetic element and recurrence rate. (A) The maximum edit frequency observed for the samples supporting the edit. (B) The maximum log2 coverage. (C) The number of edits in each group for the previous plots. Recurrence rates are indicated in the legend.

89

3.2.7 A Global map of A-to-I editing

Previous studies have suggested that clustered edits are associated with heterochromatin due to

their localization to the arms of the autosomal chromosomes (68). To more rigorously test this

association I have created chromosomal maps showing their localization in relation to repetitive

elements, heterochromatin marks including H3K9 methylation and HPL-2 binding, gene density

and intron lengths (Figure 3.9, 3.10A, 3.11) (214, 215). In general the majority of intronic and

Figure 3.8. Properties of clusters by the dominant base and repeat type of the edits contained in the cluster. IR = Inverted Repeat, Tpn = Transposon (A) The average number of edits per read pair summed across all samples for each cluster (B) The percent of clusters classified to each base and repeat combination (C) The log2 (total number of edits per cluster) for each base and repeat combination. The total number of clusters for each base type are indicated at the bottom.

A

B

C

A

B

C

90

intergenic edits were localized to the arms of autosomal chromosomes, which are correlated with

inverted repeats, transposons, long introns, and heterochromatin marks and anticorrelated for

gene density. The opposite is observed for 3’-UTR clusters which were more uniformly

distributed across the chromosomes and are enriched for gene density and depleted for long

introns, inverted repeats and transposons. The correlations between clustered edits and other

genomic features were lower than expected, for example, the correlation between intronic edit

density and inverted repeat density was only 0.40. This due to edits being relatively sparse and

that every predicted inverted repeat may not be transcribed or properly folded into dsRNA. To

test if the localization of the clustered edit types were different from the uniform distribution I

used the Kolmogorov-Smirnov (K-S) test (Figure 3.10B). The p-values for all of the tests were

less than 10-10, and the K-S test statistic was generally highest for intronic edits, suggesting they

have the strongest difference from the uniform distribution.

91

Chromosome I

Chromosome III

Chromosome II

Figure 3.9. Global A-to-I cluster localization. Chromosomal maps illustrating edit cluster density (black lines) and the cumulative distribution of edits (red line).The maximum number of clusters in a 2kb bucket is indicated to the right of the boxes. Z-scores for repeat, gene and intron densities, and z-scores for heterochromatin markers with blue indicating positive scores and red negative scores. For more detail see the methods.

Chromosome V

Chromosome X

Chromosome IV

Z"score(

92

A B

Figure 3.10. Chromosomal distribution of clustered edits. (A) The cumulative density of each edit across each C. elegans chromosome. For comparison, a uniform distribution is included. The type of the edit is colored as per the legend. The heatmaps below the chromosomal maps are the normalized z-scores for HPL-2 from Figure 3.9. (B) The Kolmogorov-Smirnov (K-S) test statistic for each edit type compared to the uniform distribution. The lower the K-S test the more likely it is that the two samples have the same distribution.

93

Despite generating the most comprehensive set of A-to-I RNA edits in worms I suspect that this

map is not complete. As a method to test the completeness of this method I focused on introns

containing potential sources of dsRNA. I examined the proportion of introns with edits and

inverted repeat structures stratified by their size (Figure 3.12). I found that a substantial fraction

of introns with repeat structures do not have any called edits (40-90%). This may be due to the

Figure 3.11. Global Pearson correlations between chromatin marks, A-to-I edits, and genetic features. Only buckets with at least one edit cluster were included in the correlations. For edit clusters the edits were binned into 5kbp bins and the z-scores were calculated based on the number of clusters. For chromatin marks and genomic features the normalized z-scores were calculated as per the methods.

-1 +1

Pearson Correlation

Clu

ster

ed

Edi

ts

Gen

omic

Fe

atur

es

Chr

omat

in

Mar

ks

Clustered Edits

Genomic Features

Chromatin Marks

94

introns not being captured for sequencing, poor mappability, or that some inverted repeats may

not be targets for RNA editing. However, this still suggests that there may be some intron edits

that were missed in this analysis.

To further explore the comprehensiveness of my hyper-editing site database I examined the

saturation of number of new sites identified after the addition of a wildtype adr-2 sample. The

results are visualized with a rarefaction curve (Figure 3.13A). I observed that the number of sites

continued to increase but the number of new sites identified with each additional sample

declined. I speculated that this is in part due to samples with reduced sequencing depth and

found that to be the case (Figure 3.13B). The ten samples with the highest number of uniquely

mapped pairs represented 43.55% of uniquely mapped reads from all of the samples and 59.06%

of the clustered A-to-I edits. There is an increase in the number of new sites identified with

additional samples. Collectively, the intronic inverted repeats, multi-mapped reads, and the

Figure 3.12. A-to-I editing events may have been missed within introns. The proportion of introns containing inverted repeats (IR Only), transposons (Tpn Only), both transposons and inverted repeats (Tpn and IR), transposons or inverted repeats (Tpn or IR), and A-to-I hyper-editing clusters (Edit Cluster). The percentage of all the introns at the specified size as indicated.

95

saturation analysis suggest that despite the large number of sites identified in this study the map

of A-to-I edits in C. elegans is not complete.

3.2.8 Intronic edits are depleted near splice-sites

In order to determine if intronic edits could disrupt splice-sites, I profiled the positions of

intronic edits relative to the edited introns splice-sites. I stratified the intronic edits and examined

the length of introns that contained clustered edits. Intron size was determined by calculating the

smallest possible intron using both annotated and novel splice junctions. I found that the majority

of intronic edit-clusters occur within introns over 512 bp in length (Figure 3.14A). This is

expected since larger introns may have more dsRNA substrates including transposon insertions,

which required for ADAR2 editing. Given the promiscuous nature of ADAR2 editing, there is a

Figure 3.13. Saturation analysis of A-to-I hyper-edits. Samples are sorted from highest to lowest numbers of repeat mapped pairs. (A) The percentage of unique A-to-I sites identified as the number of samples are increased for edits supported by uniquely mapped pairs, primary repeat pairs(the highest scoring alignment for an alignment with multiple mappings), and all of the mapped pairs. (B) The number of uniquely mapped pairs versus the percentage the A-to-I edits identified.

A B

96

distinct possibility that RNA editing events at conserved splice-sites residues could directly

impact intron splicing. The most conserved regions include the 5’- and 3’- splice sites at the

intron / exon boundaries and the branch point site and the polypyrimidine track near the 3’-end

of the intron (25, 28). There is an example of this occurring in rats where ADAR2 autoregulates

its self by editing one of its own 3’-splice sites (55). There are additional regulatory motifs that

can be found across the intron such as intron splice silencers and enhancers (25, 28).

To investigate the possibility of ADAR2 editing in worms affecting splice-sites, I localized

clustered edits and repeat elements occur in relation to an intronic splice site signals (Figure 3.14

B, C, D). A depletion of edits and repeat elements near 5’- and 3’- splice was generally observed

with less than 5% of the edit clusters falling within 50 bp of a splice-site. Repeat elements were

also depleted with a rate of less than 10% near splice-sites. The depletion of edits near splice-

sites is suggestive that ADAR2 dependent editing in C. elegans is selected-against near splice-

sites. However, this does not exclude the possibility that edits could affect the rate of splicing by

disrupting RNA secondary structures or splicing regulatory motifs present within the center of an

intron.

To determine if there are any recurrent edits that overlap splice-site signals, I constructed a high

confidence set of edits that overlapped splice site signals. I searched my database of singleton

and clustered A-to-I edits for variants that overlapped splice sites and recurred in at least 5

samples (see Methods 3.4.8). I found 3 potential A-to-I changes that overlapped splice sites, 2 of

these were singleton edits and 1 was clustered (Table 3.3). Intriguingly, all of the edits were

identified in canonical GT-AG splice junctions where the adenosine in the 3’-splice acceptor is

targeted for editing. It is conceivable that edits within a 3’-splice site could abolish its activity.

Therefore, I checked to see if there were detectable counts for the splice junctions in the wildtype

and adr-2(lf) samples. For the edited splice site on chromosome II there were no detectable

counts for the splice junction in any of the samples, which suggests that it may be an annotation

error. For the splice site edit on chromosome X, the number of reads mapping across the splice

junction in all of the samples including the adr-2(lf) samples had counts greater than 10. Finally,

for the splice junction on chromosome V, there were high (>50) counts detected in four of the

flow sorted neuron samples from modENCODE. These samples do not have matched adr-2(lf)

samples so there is a chance that they may still be functional in these samples. I think it would of

interest to further study the splicing of this intron with qRT-PCR.

97

Table 3.3. Recurrent A-to-I edits that overlap splice sites Ref Positiona Edit Junctiona Gene Poly(A)

Samples Poly(A) Stages

Total RNA Samples

Total RNA Called

II 12795452 T>C 12795450, 1279599, GTAG

cpt-1 6 Embryo, L1, L4

8 Embryo, L1, L2, L4

V 11081533 A>G 11072729, 11081535, GTAG

act-1 1 Embryo 7 Embryo, L1, L2

X 13155479 T>C 13155477, 13155580, GTAG

WBGene00008601 3 Embryo, L1, Dauer

4 L1

a Position is zero-based b The splice site based on the strand of the transcript, the edited position is bold and underlined.

98

A

B

C

Figure 3.14. Properties of edit clusters and repeat elements within introns. (A) The length distribution for all introns including novel introns identified in this study and the sizes of introns containing at least one edit cluster (B) The frequency of clustered edits and repeat elements across length-normalized introns. (C,D) The cumulative frequency for the distance (bp) of a clustered edit or repeat element to the nearest 5’- (C) or 3-’(D) splice-site, for repeat elements that overlapped the splice site a distance of -1 was assigned.

D

99

3.2.9 Intergenic edits and antisense transcripts

Since the majority of the RNA-seq libraries processed in this study were not strand-specific I

was unable to calculate rates of antisense transcription. This leaves the possibility that some of

the observed edits are antisense to known genetic features. Using the base annotation scheme I

developed it is possible to quantify how many of the intergenic clustered edits observed are

antisense to another base type. I found that 51.67% of the intergenic edits (37,498 out of 72,574)

are antisense to an annotated base (Figure 3.15). The majority of the antisense transcripts are to

introns associated with inverted repeats and transposons (~20%). The remaining edits were

antisense to 3’-UTR’s (6.39%), coding exons (9.77%), and cistrons (2.36%).

3.2.10 3’-UTR clusters and poly(A) Sites

The 3’-UTR of coding mRNA’s are highly structured non-coding RNA sequences that have

previously been shown to be targets of ADAR-depending RNA editing (70, 216). There is the

possibility that editing could affect alternative polyadenylation, where a proximal or distal

poly(A) site could be preferentially used for 3’-UTR cleavage due on A-to-I editing (113). For

example, A-to-I editing within the 3’-UTR of the human gene EAAT2 can directly impact the

poly(A) signals by activating a cryptic poly(A) site (217). Furthermore, edits that occur

Figure 3.15. Properties of intergenic edits antisense to annotated genetic elements. The element type and total percent of the intergenic edits represented by that type are indicated at the bottom. The repeat element association are indicated as per the legend.

100

downstream of poly(A) sites may not be included in the polyadenylated transcript because the

downstream sequence would be cleaved off prior to addition of adenosine nucleotides.

In this study I have identified 13,346 clustered edits within 3’-UTRs and 10,484 (78.03%)

overlapped a 3’-UTR with at least one poly(A) site. Of these, 567 3’-UTR’s with at least one

clustered edit and at least one poly(A) site. On average 20% of the edits within a 3’-UTR

occurred upstream of the first poly(A) site. An additional 20% of the edits occurred between the

first and last poly(A) site and 60% of the edits occurred downstream of the last poly(A) site. To

profile the position of edits in more detail I examined the position of edits with respect to

poly(A) sites for 3’-UTR’s with 1 to 5 poly(A) sites (Figure 3.16A). I observed that in each case

the majority of the edits occurred after the last poly(A) site. This trend continued for all of 3’-

UTR’s with poly(A) sites and edits (Figure 3.16B,C). The length of UTR sequence after the last

poly(A) site tends to be longer than the length of 3’-UTR before the first poly(A) site and

between the first and last poly(A) site. This data supports the notion that the majority of 3’-UTR

edits may not be present polyadenylated transcript.

Finally, to profile whether edits could directly modify the poly(A) site hexamer I looked for edits

that overlapped annotated poly(A) sites. I found 128 poly(A) sites with at least one edit in the

hexamer sequence. Some of the poly(A) sites identified had more than one A-to-I edit: 24 of the

sites (18.8%) had two edits, 6 sites had 3 edits (~5%), and 2 sites had 4 edits (1.5%). To build a

high confidence set of edits that could affect poly(A) signals, I searched my database of singleton

and clustered A-to-I edits for variants that recurred in at least 5 samples (see Methods 3.4.8).

After selecting for highly replicated edits I retained 3 out of the 128 clustered edits and found 0

singleton edits (Table 3.4). Interestingly, these edits tended to recur more frequently in the total

RNA samples, which suggests that the edits may ablate the poly(A) signal preventing detection

in poly(A)-selected samples. It is possible that editing within these poly(A) sites may affect

polyadenylation and 3’-UTR cleavage.

101

Table 3.4.Recurrent A-to-I edits that overlap annotated poly(A) signals Ref Positiona Edit Poly(A)

signal Gene Poly(A)

Samples Poly(A) Stages

Total RNA Samples

Total RNA Called

I 11953624 A>G 11953626 WBGene00011060 1 Embryo 5 Embryo, L1, L2

III 3317763 A>G 3317760 ral-1 0 7 Embryo, L1, L2

IV 13380403 A>G 13380404 mau-8 3 L1, Adult

2 L1

a Position is zero-based

3.2.11 Identifying putative A-to-I dependent amino acid changes

In humans, mice and flies, RNA editing dependent recoding of amino acids has been identified

as a developmental modulator of receptor activity (218). Currently, it is not known if amino acid

recoding occurs in C. elegans. To explore if this type of editing occurs, I designed a stringent

filtering scheme to identify recurrent edits and applied it to both singleton and clustered edit calls

(See Methods 3.4.8). I identified 76 A-to-I edits that lead to a nonsynonymous amino acid

substitution. Of these, 15 are clustered edits and 61 are singleton edits (Table 3.5.2). To

determine if these amino acid changes affect the translated protein in a functional way I looked

for changes within predicted PFAM domains. The majority of these edits (49 / 76) did not

overlap a predicted PFAM domain and the most frequently edited domain type was collagen with

three different edits. The most common recoding events are: serine to proline (12), leucine to

proline (9), and aspartic acid to glycine (6). The two most common changes both lead to

incorporation of the conformationally rigid proline amino acid (21 / 76 edits) into the peptide

chain, which could affect the functional properties of the protein. However, 15 of these do not

overlap predicted protein domains and the remaining 6 mapped to proteins that did not affect the

known phenotypes associated with adr-2 knockouts. Further investigation into the effects of

these amino acid coding events on the encoded proteins enzymatic function would be worthwhile

if these edits validate using a targeted DNA and RNA sequencing approach.

102

B

Figure 3.16. Localization of 3’-UTR edits with respect to poly(A) sites. UTR’s were only retained if they contained at least one clustered edit and at least one annotated poly(A) site. (A) The number of edits before the first poly(A) site and after subsequent poly(A) sites stratified by the number of poly(A) sites present in the 3’-UTR. (B) Boxplots of the distance between the start of the 3’-UTR and the first poly(A) site (Before), the distance between the first and last poly(A) sites (Between), and the distance between the last poly(A) site and the end of the 3’-UTR (After). (C) Volcano plots of the distribution of the percent of the edits within each UTR before the first poly(A) site (before), between the first and last poly(A) sites (between), and after the last poly(A) site (after). The median is indicated with a bar and the mean is indicated with an asterisk.

A

C

103

3.3 Discussion

ADAR-dependent hyper-editing has been previously profiled in C. elegans on a global scale

using RNA-seq. However, these studies were limited in scope, using a small number of samples

and inconsistent alignment and edit calling algorithms. To more robustly characterize hyper-

editing in C. elegans, I have constructed a consistent and accurate pipeline using my RNA-seq

realignment program RNASequel and a novel-clustering algorithm. Using this algorithm I have

analyzed 91 C. elegans RNA-seq samples and identified 197,890 A-to-I edit sites within 10,941

hyper-edited clusters, generating the most comprehensive database of ADAR2-dependent RNA

editing in worms to date. My results validated previous reports that the majority of clustered

edits are associated with non-coding sequences and sources of structured RNA’s including long

introns, intergenic regions and 3’-UTR’s. These regions are commonly associated with inverted

repeat and transposable elements.

One of the prevailing challenges with determining the role of A-to-I editing in C. elegans is

determining which edits have a functional role. There is the possibility that a proportion of the

edits are a consequence of the transcription of structured RNA’s such as transposable elements

and inverted repeats and do not have a functional role. The observation that ADAR knockouts in

C. elegans do not affect transposon silencing provides further evidence that this may be the case

(219, 220). There is the possibility that ADAR’s do a play a role in transposon silencing within

somatic tissue, where transposable elements are active (221). The majority of the edits I observed

are associated with transposons and inverted repeats (74.62%). These edits are most common

within introns (51.65%) and intergenic regions (32.68%), which are locations that are likely to be

phenotypically neutral for transposon insertions.

The edits I observed within introns could affect splicing efficiency, however, I observed a

depletion of somatic transposon insertions and edits near the exon-intron boundaries where the

conserved splice-site signals occur. Moreover, I identified three recurrent edits that may affect a

splice site. This does not leave out the possibility that editing could affect intron splicing in some

cases, for example, through the biogenesis of circular RNA’s or the efficiency of intron splicing

due to the presence of dsRNA structures. Circular RNA’s occur when a pre-mRNA transcript is

spliced back onto itself (71, 222). Studies have shown that introns associated with circular

transcripts biogenesis are enriched for RNA structures, complementary sequences, and RNA

104

edits. Furthermore, these studies have shown that ADAR1 in humans is essential for circular

RNA biogenesis (71). Transposon mediated dsRNA structures within introns have been

demonstrated to promote the transition of neighboring exons from constitutively spliced to

alternatively spliced in humans (223, 224). There are examples of ADAR dependent edits within

these structures in humans but their effect on splicing has not been explored (223). Finally, it is

unknown whether intronic dsRNA structures and edits have an effect on C. elegans intron

splicing. In the future, it would be worthwhile to perform deep RNA-seq on triplicates derived

from similar C. elegans stages with all single and double knockouts of adr1 and adr2. These

could be used to test if ADAR-dependent RNA editing has an effect on the splicing of introns

targeted for editing. This could provide insight into whether ADAR-dependent RNA editing

within intronic dsRNA structures plays a role in transcript splicing or is a side effect of the

dsRNA structures present in the intron without an effect on splicing. Furthermore, RNAse that

digests linear and unstructured RNA’s could be used to enrich samples for circular RNA’s and

probe whether ADAR proteins have an affect on their biogenesis (225).

I observed that ~6.11% of the clustered edits were within 3’-UTR’s. Edits within 3’-UTR’s could

affect alternative polyadenylation, translation, RNA localization, or turnover (70). ADAR-

dependent editing has been shown to be co-transcriptional and I have observed that 3’-UTR

editing can occur proximal, between, and distal to polyadenylation sites (17). Therefore, it is

possible that editing could play a role in the selection of the polyadenylation sites either by

relaxing secondary structures or altering RNA protein binding prior to polyadenylation. An

altered 3’-UTR length could alter the regulatory signals present in the UTR and lead to

additional or reduced miRNA or RNA binding protein sites (216, 226). There is an example of

this occurring in the EAAT2 pre-mRNA where A-to-I editing activates a cryptic polyadenylation

site (217). To resolve the role of A-to-I editing on alternative polyadenylation, I would perform

TAIL-seq on wildtype, adr-1, adr-2, and double knockouts to quantitatively compare 3’-UTR

use and possibly identify which edits are associated with differential polyadenylation sites (227).

TAIL-seq would also permit me to identify cryptic poly(A) sites that may be activated by RNA

editing.

I searched for potential clustered and singleton edits that could affect amino acid changes and

identified 76 potential events. This is unexpected since these events have not been previously

identified in C. elegans EST sequencing, which is the method that led to the identification of the

105

AMPA receptor editing in humans and flies (228-230). I suspect it would be worthwhile to

validate if these changes can be identified in C. elegans EST sequencing data. Furthermore,

targeted sequencing of the edited transcript and genome sequence would be essential to

determine if these are in fact real editing events. MS/MS sequencing of the protein peptides to

confirm that the amino acid change is incorporated to the final protein product. Finally,

functional screens should be performed on the most compelling targets. If any of these edits are

validated it would provide a much clearer view of when A-to-I editing induced amino acid

recoding has evolved.

On a global scale edits within intronic and intergenic regions appear to be enriched for

heterochromatin marks including H3K9 methylation and HPL-2 binding. Heterochromatin is

most prevalent on the arms of the autosomal chromosomes. Whether RNA editing participates in

heterochromatin deposition or is a consequence of a higher transposon and inverted repeat on the

autosomal arms is unknown. Previous studies have implicated small RNA’s in heterochromatin

deposition and adr knockouts have a marked dysregulation of small RNA processing (53, 54,

231, 232). There is evidence of ADAR-dependent RNA editing regulation transposon mediated

heterochromatic gene silencing in D. melanogaster through the editing of a long structured RNA

derived from the Hoppel transposon (56). This gene was verified as an ADAR target and deletion

of this transposon altered heterochromatic gene silencing. It would be interesting to perform

CHIP-seq for HPL-2 and H3K9 marks in wildtype and adr knockout backgrounds to look for

alterations in heterochromatin deposition.

Recent evidence in mice has demonstrated that RNA editing is important for the discrimination

of endogenous and exogenous sources of dsRNA (67). In this model RNA derived from

exogenous RNA is sensed by MDA5 leading to activation of the cytosolic dsRNA-sensing

pathways and the interferon response (67, 233). Conversely, exogenous sources of RNA that

harbor A-to-I edits are not sensed by MDA5. Furthermore, mice with embryonic lethal

knockouts of ADAR1 were rescued by the inactivation of MDA5. It is possible that RNA editing

in C. elegans has a similar role through ADAR’s role in suppressing the processing of A-to-I

edited transcripts by the RNAi pathway. The cytosolic RNAi machinery in C. elegans is essential

for anti-viral defense (53, 234).

106

This comprehensive and rigorous detection and analysis of A-to-I editing sites will be useful for

further studies investigating the functional role of ADAR proteins in C. elegans. In the future, I

hope to prepare and submit this work as an original publication to make the list of A-to-I edit

sites and clusters available to the scientific community.

3.4 Methods

3.4.1 C. elegans gene annotations and reference sequences.

The C. elegans reference, gene annotations, and cistron (operon) annotations were downloaded

from WormBase (235) release WS245. The reference sequences were supplemented with the E.

coli OP50 genome sequence to remove potential bacterial contamination due to the growth

medium.

3.4.2 Samples.

The RNA-seq samples processed in this study are listed in Table 3.5.1.

3.4.3 RNA-seq preprocessing and alignment.

Spliced leader sequences (SL1 and SL2) downloaded from WormBase WS245 and were

trimmed from the ends of pairs using an in-house developed program. For the RNA-seq datasets

downloaded from Whipple et al. we also trimmed Illumina universal sequencing adapters. We

used STAR (189) version 2.3e for the identification of novel splice junctions with the default

parameters along with a GTF file using the “--sjdbGTFfile” parameter. The output from STAR

was piped into the RNASequel transcriptome generation tool using the following command:

STAR --genomeDir star-index --readFilesIn r1.fastq.gz r2.fastq.gz --readFilesCommand zcat --outSAMstrandField intronMotif --genomeLoad NoSharedMemory --outFileNamePrefix star --outStd SAM | samtools view -bSu - | rnasequel transcriptome –g genes.gtf -n ${read_size} -r genome.fa –b - -o tx --skip-ambiguous --max-intron 32000 --skip MtDNA,OP50

Splice junctions identified in the MtDNA and E. coli OP50 genome were discarded, the

maximum intron size was limited to 32,000 bp, and non-canonical splice junctions were

discarded.

BWA mem indexes were then generated for each sample using the following command:

107

bwa index tx.fa

bwa index genome.fa

Finally, RNASequel was used to remap the reads to the splice-junction index and reference

index:

rnasequel merge –filter rRNA.fa -r genome.fa –g genes.gtf -f tx.txt -o align r1.fastq.gz r2.fastq.gz bwa-index/genome.fa ./tx.fa

3.4.4 Whole Genome Alignment and Variant Calling.

The reads pairs were aligned to the C. elegans genome using BWA mem:

bwa mem -M -a -t 8 -B 2 genome.fa r1.fastq.gz r2.fastq.gz | samtools view -bS - > pairs.bam

The alignments were then sorted using samtools. A minimum base quality was set at 5 for a

position to be counted. Positions with at least 10x coverage, an alternative allele frequency >

0.10 and average base quality supporting the alternative allele of at least 25 was retained.

3.4.5 Identifying potential A-to-I editing events.

The RNA-seq alignments were piled up using an in house program. The alignments were

retained if they had no more than two mapped ambiguous. Positions with a single alternate allele,

with a frequency greater than 0.05, a minimum base quality of 5 and at least 2x coverage were

retained. The retained alignments were then searched for potential edits using the following

criteria to discard low quality calls: 1) at least one uniquely mapped pair supporting the change

(to eliminate alignment artifacts due to singleton read alignments) 2) an average base quality of

at least 25 for alignments supporting the alternative allele 3) positions mapping to tandem repeats

using trf (208) or low complexity and simple regions according to RepeatMasker were discarded,

4) at least one of the reads supporting the alternative base were outside of the first and last 5 base

pairs of the read ends, 5) at least 10x coverage and an alternative allele frequency of less than

10% at the same position in the genome sequencing data. Variants of the same type were

clustered into regions by at least 1x coverage or a read-pair spanning the region. Clusters were

retained if they had at least 5 variants, an average distance between the variants of less than 150

and at least 3 different uniquely mapped read-pairs with support for the clustered variant. For

108

clusters with an average distance between variants of less than 150bp, the variant with the

longest distance at either end of the cluster is trimmed off until the average distance is less than

150 bp (retained) or the number of edits falls below 5 (discarded). After all of the samples were

individually processed, overlapping clusters from the adr-2(lf) and adr-2(wt) gene were merged

separately.

3.4.6 Annotating edits and clusters.

Clusters and edits were annotated for their inverted repeat, transposon, gene and cistron (operon)

overlap. We included a merged set of novel splice junctions from all of the samples to annotate

edit clusters within novel introns. We extended the 3’-UTR of all transcripts that did not overlap

an operon until coverage reached 0 or the extension overlapped an annotated exon. The base

types of an edit position were inferred using the strand of the edit and gene annotations. If an edit

position overlapped more than one feature the feature was marked as ambiguous. For base-type

and repeat-type rates A’s on the plus strand, T’s on the minus strand and both A’s and T’s for

ambiguous positions were counted.

3.4.7 Chromosomal maps

The number of edited clusters of each base type was counted in 2kb bins and the cumulative

frequency of clustered edits with the same base type was also calculated.

Inverted repeat and transposon density were calculated by collapsing their respective annotations

and the proportion of bases overlapped by each type were calculated per 10kb window. Introns

were identified by merging genome annotations and novel junctions that had at least 5 unique

pairs mapping across them in 10% of the samples and the shortest intron was then identified by

collapsing the annotated and novel splice junctions. Gene density was calculated using all of the

annotated protein coding genes and binned into 10kb buckets. The buckets values were Box-Cox

transformed and then z-score normalized.

Normalized ChIP-chip data was downloaded from GEO with the accessions GSE58764 for HPL-

2 (214) and GSE26186 for embryonic H3K9 (215). The probe positions were converted to

WS245 using the WormBase remap script. ChIP-chip data was normalized to z-scores and the

median z-score was calculated for each 2kb window.

109

3.4.8 Detection recurrent A-to-I editing events within splice sites, polyadenylation signals, and coding regions

Clustered and singleton A-to-I edits that overlapped either coding regions, splice sites, or

polyadenylation signals were retained for further analysis if they matched the following criteria:

1) at least 10X coverage and no overlap with transposable or inverted repeat elements, 2) at least

10X coverage in five of the adr-2(lf) samples with no evidence of the same variant, and 3)

replicated in at least five wildtype adr-2 samples.

To select for edits that may regulate splice site selection I searched for edits that overlapped the

first and last two base-pairs of annotated and de novo discovered introns on the correct strand

(See Methods 3.46 for de novo intron discovery).

For edits that overlapped polyadenylation signals I required the edit to overlap the six base pair

annotated polyadenlyation signal in Wormbase WS245.

To further filter edits within coding regions, I annotated putative amino acid with Annovar

version 2015dec14 and only nonsynonymous changes were retained (236). I required the edit to

be replicated in at least one poly(A)-enriched sample. Finally, to annotate the domains where

potential amino acid changes occur, I downloaded the protein peptide sequences from

WormBase WS245 and predicted their domains using PFAM build 29.0 (December 2015) (237).

110

3.5 Appendix Table 3.5.1. C. elegans Samples Processed in this Study

Sample Full Name Study Stage Library

BB2-adult1 adr-1(gpv6) SRX335728 Adult Poly(A)

BB4-adult1 adr-1(gv6); adr-2(gv42) SRX335736 Adult Poly(A)

BB3-adult1 adr-2(gv42) SRX335732 Adult Poly(A)

N2e-YA_RZ-1 N2 1 SRS269392 Adult Total RNA

N2-adult1 N2 2 SRX335723 Adult Poly(A)

N2-adult8 N2 3 SRX335724 Adult Poly(A)

N2e-Ad_gonad-1-RZLI N2 Gonad SRS344182 Adult Total RNA

IP-N2-J2 N2; J2 SRR1581229 Adult Total RNA

IP-dcr-1-J2 IP dcr-1(XX); J2 SRR1581228 Adult Total RNA

N2e-DEntryDAF2-1-1 daf-2(el370) Entry 1 SRS269109 Dauer Poly(A)

N2e-DEntryDAF2-4-1 daf-2(el370) Entry 2 SRS269110 Dauer Poly(A)

N2e-DauerDAF2-2-1 daf-2(el370) Entry 3 SRS269389 Dauer Poly(A)

N2e-DExitDAF2-3-1 daf-2(el370) Exit 1 SRS269108 Dauer Poly(A)

N2e-DExitDAF2-6-1 daf-2(el370) Exit 2 SRS269111 Dauer Poly(A)

N2e-DauerDAF2-5-1 daf-2(el370) Exit 3 SRS269391 Dauer Poly(A)

IP-dcr-1-embryo-dcr IP dcr-1(XX); dcr SRR1581224 Embryo Total RNA

IP-rde-4-embryo-rde IP rde-4(XX); rde SRR1581225 Embryo Total RNA

N2e-4cell_EE_RZ-56 4 Cell SRS311761 Embryo Total RNA

BB2-embryo adr-1(gpv6) SRX335725 Embryo Poly(A)

BB4-embryo adr-1(gv6); adr-2(gv42) SRX335733 Embryo Poly(A)

BB3-embryo adr-2(gv42) SRX335729 Embryo Poly(A)

111


N2e-E2-E8_sorted E cells E2-E8 SRS311762 Embryo Total RNA

N2e-DMM260_N2eref_EE Early Embryo Reference SRS344159 Embryo Total RNA

N2e-EE_DSN-51 Early embryos SRS311763 Embryo Total RNA

N2e-EE_RZ-54 Early embryos SRS311907 Embryo Total RNA

IP-N2-embryo-dcr IP N2; dcr SRR1581226 Embryo Total RNA

IP-N2-embryo-rde IP N2; rde SRR1581227 Embryo Total RNA

N2-embryo N2 SRX335693 Embryo Poly(A)

N2e-EE_50-0 N2 50-0 SRS258165 Embryo Poly(A)

















112










N2e-DMM239_Z1Z4_Em Z1/Z4 SRS344160 Embryo Total RNA

BB2-L1 adr-1(gpv6) SRX335726 L1 Poly(A)

BB4-L1 adr-1(gv6); adr-2(gv42) SRX335734 L1 Poly(A)

BB3-L1 adr-2(gv42) SRX335730 L1 Poly(A)

N2e-DMM383_all-nrn_L1 All Neuron 1 SRS344161 L1 Total RNA

N2e-DMM381_all-nrn_L1 All Neuron 2 SRS344162 L1 Total RNA

N2e-DMM383-all-nrn_L1-V All Neurons 3 SRS311746 L1 Total RNA

N2e-DMM391-all-nrn_L1-V All Neurons 4 SRS311749 L1 Total RNA

N2e-DMM387-NSML_NSMR-nrn_L1-V NSML and NSMR Neurons 1 SRS311748 L1 Total RNA

N2e-DMM386-NSML_NSMR-nrn_L1 NSML and NSMR Neurons 2 SRS308486 L1 Total RNA

N2e-DMM386-NSML_NSMR-nrn_L1-DSN NSML and NSMR Neurons 3 SRS308486 L1 Total RNA

N2e-DMM402_N2eall_L1-DSN N2 1 SRS311687 L1 Total RNA

N2e-DMM401_N2eall_L1-DSN N2 2 SRS311684 L1 Total RNA

N2e-DMM401-N2eall_L1-V N2 3 SRS311747 L1 Total RNA

113


N2e-DMM402-N2eall_L1-V N2 4 SRS311750 L1 Total RNA

N2-L1 N2 5 SRX335720 L1 Poly(A)

N2e-DMM389_NSM_L1 NSM neurons SRS344181 L1 Total RNA

N2e-DMM408_Amot_nrn_L2-DSN A motor neuron 1 SRS311688 L2 Total RNA



N2e-L2_DSN-50 N2 1 SRS344178 L2 Total RNA

N2-L2L3 N2 2 SRX335721 L2 Poly(A)

N2e-L2_RZ-53 N2 3 SRS311908 L2 Total RNA

N2e-pharyngeal N2 pharyngeal muscle SRS242498 L2 Poly(A)

BB2-L4 adr1(gpv6) SRX335727 L4 Poly(A)

BB4-L4 adr1(gv6); adr-2(gv42) SRX335735 L4 Poly(A)

adar2-r1 adr2(gv42) SRS706525 L4 Total RNA

BB3-L4 adr2(gv42) 1 SRX335731 L4 Poly(A)

adar2-polyA adr2(gv42) 2 L4 Poly(A)

wt-polyA N2 1 SRX707276 L4 Poly(A)

N2-L4 N2 2 SRX335722 L4 Poly(A)

wt-r1 N2 3 SRS706527 L4 Total RNA



tdp-polyA tdp-1(ok803) 1 SRX707279 L4 Poly(A)

tdp-R1 tdp-1(ok803) 2 SRS706529 L4 Total RNA

114




115

Table 3.5.2 A-to-I Edits predicted to affect amino acid changes within transcripts

Ref Position (Zero-based) Edit

Poly(A) Samples

Poly(A) Stages

Total RNA Samples

Total RNA Stages Substitution Gene Transcripts

PFAMDomain

I 109754 A>G 2 Embryo 3 L1, Adult I64T rab-11.1 F53G12.1.1, F53G12.1.2 Ras

I 1701680 T>C 2 Embryo, Adult 3 Embryo, Adult S73P lsm-6 Y71G12B.14 LSM

I 1990465 A>G 4 Embryo, L1, L4 1 Adult L395P gap-3 Y20F4.3b -

I 1990465 A>G 4 Embryo, L1, L4 1 Adult L726P gap-3 Y20F4.3a -

I 4214436 A>G 11 Embryo, L1, L4, Adult, Dauer 1 Adult S295P WBGene00015976 C18E3.9a -

I 4214436 A>G 11 Embryo, L1, L4, Adult, Dauer 1 Adult S299P WBGene00015976 C18E3.9b -

I 8462478 T>C 1 Dauer 8 Embryo, L1, L2 S265P unc-120 D1081.2 -

I 9256992 T>C 14 Embryo, Dauer 0 T637A WBGene00009500 F36H2.3d -

I 9256992 T>C 14 Embryo, Dauer 0 T707A WBGene00009500 F36H2.3f -

I 9256992 T>C 14 Embryo, Dauer 0 T708A WBGene00009500 F36H2.3c -

I 9256992 T>C 14 Embryo, Dauer 0 T840A WBGene00009500 F36H2.3a -

I 9256992 T>C 14 Embryo, Dauer 0 T848A WBGene00009500 F36H2.3b -

I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V597A WBGene00009500 F36H2.3d Sushi

I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V667A WBGene00009500 F36H2.3f Sushi

I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V668A WBGene00009500 F36H2.3c Sushi

I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V800A WBGene00009500 F36H2.3a Sushi

I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V808A WBGene00009500 F36H2.3b Sushi

I 14829894 A>G 5 Embryo, Dauer 0 Y351H eva-1 F32A7.3b -

I 14829894 A>G 5 Embryo, Dauer 0 Y456H eva-1 F32A7.3a -

I 15020092 T>C 2 Embryo, Dauer 3 Embryo, L2 D640G dog-1 F33H2.1.1, F33H2.1.2 -

II 33340 T>C 7 Embryo, Dauer 0 S264P fbxc-54 F23F1.3 -

II 2443950 A>G 6 Embryo 0 D120G vab-19 T22D2.1 -

116


Poly(A) Samples

Poly(A) Stages

Total RNA Samples


PFAMDomain

II 3899258 A>G 4 Embryo 1 Embryo E80G WBGene00018750 F53C3.6b -

II 3899258 A>G 4 Embryo 1 Embryo E179G WBGene00018750 F53C3.6a -

II 5675381 A>G 2 Embryo 4 Embryo, L1, L2, L4 V763A WBGene00016117 C25H3.8 -

II 5675849 A>G 3 Embryo 6 Embryo, L1, L2, L4 L709S WBGene00016117 C25H3.8 -

II 7153223 A>G 2 Dauer 4 L4, Adult D67G fust-1 C27H5.3.1, C27H5.3.2 -

II 7496628 A>G 5 Embryo 0 K40E spi-1 R10H1.4 -

II 8485988 A>G 11 Embryo, L2, Dauer 0 N18D gbh-2 M05D6.7 -

II 9056616 T>C 5 L1, L2, L4, Adult 0 K55E rpl-32

T24B8.1b.1, T24B8.1b.2, T24B8.1b.3

Ribosomal_L32e

II 9056616 T>C 5 L1, L2, L4, Adult 0 K108E rpl-32

T24B8.1a.2, T24B8.1a.3, T24B8.1a.1

Ribosomal_L32e

II 10158354 T>C 10 Embryo, Dauer 1 L1 E506G daf-19 F33H1.1c -

II 10158354 T>C 10 Embryo, Dauer 1 L1 E522G daf-19 F33H1.1e -

II 10158354 T>C 10 Embryo, Dauer 1 L1 E545G daf-19 F33H1.1d -

II 10158354 T>C 10 Embryo, Dauer 1 L1 E664G daf-19 F33H1.1a -

II 10158354 T>C 10 Embryo, Dauer 1 L1 E689G daf-19 F33H1.1b -

II 11096953 T>C 5 Embryo, L1, L2, Adult 0 K65E rpl-41 C09H10.2

Ribosomal_L44

II 11211721 T>C 13 Embryo, L4, Adult, Dauer 4 L1, L2, Adult I995T srap-1 T06D8.1c -

II 11211721 T>C 13 Embryo, L4, Adult, Dauer 4 L1, L2, Adult I1199T srap-1

T06D8.1a, T06D8.1b -

II 12075627 A>G 2 Embryo, Adult 4 Embryo, L1 S347P cnt-1 Y17G7B.15b -

II 12075627 A>G 2 Embryo, Adult 4 Embryo, L1 S431P cnt-1 Y17G7B.15a -

II 12896646 T>C 5 L1, L4, Adult 0 S226P clec-63 F35C5.6 VWA

II 13390908 A>G 4 Embryo, Dauer 1 L1 D117G WBGene00012995 Y48C3A.14a Toprim

117


Poly(A) Samples

Poly(A) Stages

Total RNA Samples


PFAMDomain

II 13539119 T>C 4 Embryo, Dauer 1 Embryo K396E WBGene00013001 Y48E1B.2a -

III 940955 T>C 4 Embryo 1 Embryo N292D fbxa-59 T12B5.8 -

III 1872247 A>G 22 Embryo 6 Embryo, L2, Adult H260R WBGene00021444 Y39A3CR.3b -

III 1872247 A>G 22 Embryo 6 Embryo, L2, Adult H656R WBGene00021444 Y39A3CR.3a -

III 2698418 A>G 4 Embryo, L2 2 Embryo, Adult L87P WBGene00022174 Y71H2AM.9 -

III 4476646 T>C 1 Embryo 4 Embryo, L2, Adult Y203H rps-1 F56F3.5

Ribosomal_S3Ae

III 5277404 A>G 7 Embryo, Dauer 0 H417R toc-1 ZC395.3a.2, ZC395.3a.1 -

III 5277404 A>G 7 Embryo, Dauer 0 H435R toc-1 ZC395.3b -

III 5499876 T>C 5 Embryo, Dauer 0 X166Q WBGene00019838 R02F2.9 -

III 5783791 A>G 12 Embryo, Dauer 4 Embryo, L2 L136P WBGene00021345 Y37B11A.3 -

III 7985044 T>C 24 Embryo 5 Embryo, L2, Adult I635T clp-1

C06G4.2b.1, C06G4.2b.2 Calpain_III

III 7985044 T>C 24 Embryo 5 Embryo, L2, Adult I660T clp-1 C06G4.2d Calpain_III

III 7985044 T>C 24 Embryo 5 Embryo, L2, Adult I681T clp-1 C06G4.2a Calpain_III

III 8914116 A>G 4 Embryo, Dauer 2 L4, Adult N97D trxr-2 ZK637.10.2, ZK637.10.1 Pyr_redox_2

III 8973641 T>C 6 Embryo, Dauer 1 Adult I363T WBGene00011144 R08D7.4a.2, R08D7.4a.1 -

III 13780290 A>G 5 Embryo, L2 1 L4 K22E pot-3 3R5.1a, 3R5.1b POT1PC

III 13780291 A>G 3 Embryo, L2 3 L1, L2, L4 K22R pot-3 3R5.1a, 3R5.1b POT1PC

IV 256130 T>C 5 Embryo 1 Adult N243D WBGene00019825 R02D3.8 -

IV 992351 A>G 1 Dauer 7 Embryo, L1, L2, L4 T21A WBGene00021924

Y55F3AM.6b, Y55F3AM.6a zf-CCCH_3

IV 1300749 T>C 3 Embryo 6 L1, L2, L4 I531V WBGene00018776 F53H1.1d, F53H1.1c -

IV 1300749 T>C 3 Embryo 6 L1, L2, L4 I668V WBGene00018776 F53H1.1a, F53H1.1b -

IV 1963819 A>G 11 Embryo 2 Embryo, L2 I53V ulp-3 Y48A5A.2.2, Y48A5A.2.1

Peptidase_C48

118


Poly(A) Samples

Poly(A) Stages

Total RNA Samples


PFAMDomain

IV 4308765 T>C 4 Embryo, Dauer 1 Embryo S227P set-9 F15E6.1 -

IV 4313266 A>G 5 Embryo 1 Embryo Q1100R set-9 F15E6.1 -

IV 4316338 A>G 2 Embryo, Dauer 3 L1 D1588G set-9 F15E6.1 -

IV 7026563 A>G 2 Embryo, L1 3 L1 D17G WBGene00015551 C06G3.5b A_deaminase

IV 7026563 A>G 2 Embryo, L1 3 L1 D59G WBGene00015551 C06G3.5a A_deaminase

IV 7781944 A>G 3 Embryo, Dauer 2 Embryo L124P sec-10 C33H5.9 Sec10

IV 10839352 A>G 3 Embryo 4 Embryo, L1, L2, L4 I555T WBGene00011720

T11G6.5b.1, T11G6.5b.2 -

IV 10839352 A>G 3 Embryo 4 Embryo, L1, L2, L4 I612T WBGene00011720 T11G6.5a -

IV 11509905 T>C 4 Embryo, Adult 3 L1 N89S WBGene00010848 M04B2.6 -

IV 12090044 A>G 4 Embryo, Dauer 1 L2 L465P mig-32 F11A10.3a -

IV 12142249 T>C 4 Embryo, Dauer 1 L1 I288V WBGene00007090 B0001.5 -

IV 12646747 A>G 4 Embryo, Dauer 1 L4 Y527C vha-7 C26H9A.1a V_ATPase_I

IV 12646747 A>G 4 Embryo, Dauer 1 L4 Y771C vha-7 C26H9A.1b V_ATPase_I

IV 12773897 A>G 1 L4 4 L1, L2 S5G unc-31

ZK897.1b, ZK897.1h, ZK897.1i, ZK897.1o, ZK897.1k, ZK897.1l, ZK897.1r, ZK897.1p, ZK897.1q, ZK897.1j, ZK897.1m, ZK897.1c, ZK897.1a, ZK897.1n -

V 6360107 A>G 2 L1 8 L1, L2 N199S WBGene00017430 F13H6.1a -

V 6360107 A>G 2 L1 8 L1, L2 N203S WBGene00017430 F13H6.1b.1, F13H6.1b.2 -

V 6502324 T>C 1 L1 6 L1, L2 L440P WBGene00020909 W01A11.1 -

119


Poly(A) Samples

Poly(A) Stages

Total RNA Samples


PFAMDomain

V 8050237 A>G 7 Embryo 0 L186P col-43 ZC513.8 Collagen

V 8050247 T>C 7 Embryo 0 T183A col-43 ZC513.8 Collagen

V 8050382 T>C 8 Embryo, L2 1 L4 T138A col-43 ZC513.8 -

V 8050399 A>G 8 Embryo, L2 0 V132A col-43 ZC513.8 -

V 8268568 T>C 5 Embryo 1 Adult T108A asp-6 F21F8.7 Asp

V 9778852 A>G 2 Embryo, Dauer 3 L2, Adult S192P WBGene00009769 F46B6.4 -

V 11237607 A>G 6 Embryo, Dauer 0 S349P dvc-1

T19B10.6.2, T19B10.6.3, T19B10.6.1 -

V 11746239 A>G 5 Dauer 0 N1211S WBGene00011436 T04F3.1b -

V 11746239 A>G 5 Dauer 0 N1278S WBGene00011436 T04F3.1a -

V 12042436 A>G 8 Embryo, L2 0 I235V WBGene00008645 F10C2.4 DNA_pol_B_exo1

V 12866164 A>G 16 Embryo, Dauer 4 L2, L4, Adult Y51C WBGene00011591 T07F10.5.2, T07F10.5.1 -

V 13198692 A>G 1 L1 4 L1 V192A col-159 F57B1.3 Collagen

V 15197445 T>C 4 Embryo, Dauer 1 Embryo L856P plc-2 Y75B12B.6 -

V 18796242 A>G 2 L4, Dauer 3 L1 S376P nhr-170 C54E10.5 Hormone_recep

V 18796248 A>G 2 L4, Dauer 4 L1, L4 Y374H nhr-170 C54E10.5 Hormone_recep

V 18796287 A>G 2 L4, Dauer 3 L1 Y361H nhr-170 C54E10.5 Hormone_recep

V 18796290 A>G 3 L4, Dauer 2 L1 Y360H nhr-170 C54E10.5 Hormone_recep

X 4715660 T>C 4 Embryo, Dauer 1 L1 L118P rpl-25.1 F55D10.2 Ribosomal_L23

X 5462827 A>G 4 Embryo, L2, Dauer 1 Embryo N249D ddr-1 C25F6.4 -

X 6047311 A>G 4 Embryo 2 Embryo, L4 S113P pak-1 C09B8.7c -

X 6047311 A>G 4 Embryo 2 Embryo, L4 S159P pak-1 C09B8.7b -

X 6047311 A>G 4 Embryo 2 Embryo, L4 S162P pak-1 C09B8.7a -

120


Poly(A) Samples

Poly(A) Stages

Total RNA Samples


PFAMDomain

X 14072009 A>G 22 Embryo 5 Embryo, L2, Adult E78G WBGene00007904 C33G3.4 -

X 14352031 T>C 1 Embryo 4 Embryo, L1 S393P wdr-5.2 K04G11.4 -

X 17459884 A>G 2 Embryo, Dauer 5 L1 I99V mlc-1 C36E6.3 -

121

Discussion 4The advent of nucleic acids sequencing has led to an exponential increase in the information

available regarding the DNA and RNA content of a biological sample. DNA sequencing

technologies have progressed from sequencing single genes to gigabase-sized genomes. This

technology has also been adapted in tandem with improvements to DNA sequencing to sequence

the RNA content of a cell by taking advantage of reverse transcriptase proteins, permitting

scientists to sequence the expressed portion of a sample’s genome. RNA sequencing technology

evolved from EST and SAGE techniques to whole transcriptome sequencing using Illumina and

other high-throughput sequencers (10, 130, 131). RNA-seq permits quantitative transcriptome

profiling at single base-pair resolutions and experiments based on this technology have revealed

the tremendous and dynamic nature of an organism’s transcriptome. For example, RNA-seq has

been used to identify and quantify RNA editing and alternative splicing on a global scale (12,

116, 138, 238). However, due to the relatively short read lengths of Illumina sequencers (<150

bp) the down-stream analysis of RNA-seq experiments has been difficult. This is especially the

case when analyzing post-regulatory events such as RNA editing and alternative splicing.

My first thesis objective was concerned with the development of an accurate RNA-seq alignment

program. Variant and repeat tolerant RNA-seq alignment is a challenging problem that can

impact the downstream interpretation of the data. RNA-seq alignment requires the alignment of

short-exonic alignments across introns that can span hundreds of kilobases. To date, the majority

of RNA-seq alignment tools have been concerned with the sensitive detection of splice junctions

with little emphasis on managing reads aligning to repeats and variants. Ideally, an RNA-seq

alignment tool must be capable of repeat tolerant and accurate gapped, mismatch and spliced-

read alignment. High accuracy for mismatches, gaps and spliced alignments are all

interdependent. For instance, a read incorrectly mapped into the intronic sequence rather than

across a splice junction can lead to false positive mismatches or an incorrect non-canonical splice

junction is chosen rather than inserting a gap. Repeat alignment sensitivity is critical for the

accurate identification of mismatches such as RNA edits. Alignments incorrectly marked as

mapping to a single location could have additional mappings to similar sequences elsewhere.

These issues are also common when identifying RNA edits since they commonly occur in

122

transposable and repeat elements (34). One study that suffered from this issue used RNA-seq to

identify non-canonical RNA editing events (193). It was found that many of the non-canonical

events were due to the missed alignments to paralogous genes and incorrectly spliced read

alignments.

There has been a lack of freely available RNA-seq alignment algorithms dedicated to the

detection of SNVs and RNA edits. To mitigate this I developed a novel post-processing strategy

called RNASequel that mitigates the aforementioned RNA-seq alignment issues. The primary

innovations implemented by RNASequel have lead to an accurate and dynamic RNA-seq

alignment methodology. RNASequel is designed to be durable by not depending on any specific

tool for de novo splice junction detection and contiguous read alignment. Four primary features

have lead to RNASequel’s highly accurate alignment: 1) a splice junction-database that

integrates novel splice junctions and is capable of handling reads that map across multiple splice

junctions, 2) independent mapping of the reads from each pair to maximize repeat sensitivity 3)

utilizing highly accurate contiguous read aligners 4) an algorithm engineered to empirically

determined the fragment size distribution and to identify repeat and concordant paired-

alignments.

To validate and benchmark the accuracy of the improvements facilitated by RNASequel versus

traditional RNA-seq alignment tools I compared RNASequel to two popular RNA-seq alignment

algorithms (Tophat2 and STAR). I utilized two simulated human RNA-seq datasets and 28

human-derived biological datasets. For the two simulated datasets I demonstrated that

RNASequel post-processing leads to a marked increase in the accuracy of mismatch, gap and

spliced alignments. I further show that RNASequel is more sensitive to repeat alignments and

that my fragment size estimation algorithm aids in the choice of the primary alignment. For the

biological datasets I demonstrate that RNASequel realignment leads to increase mapping rates, a

reduction in the number of non-canonical edits (false positives), an increase in the number of

somatic SNVs, and similar levels of A-to-I RNA edits calls. The improved variant calls were

partially facilitated by an improved repeat sensitivity. In conclusion RNASequel is an important

innovation for RNA-seq alignments and will be useful for RNA-seq experiments where accuracy

is paramount such as the identification of SNVs and RNA edits.

123

The original published version of RNASequel had a few issues, particularly in its disk space

usage and its empirical fragment size estimation algorithm, which could miss valid alignments

towards the tail of the fragment size distribution. This limited the number of samples that could

be processed quickly and caused RNASequel alignments to miss some pair alignments. I

mitigated both of the aforementioned issues for my second data chapter where the new version of

RNASequel was used to profile A-to-I RNA editing in C. elegans.

RNASequel is a useful tool to refine RNA-seq alignments and with future extensions

RNASequel could be used to improve the alignments of circular RNA’s, fusions and trans-

splicing events (222, 239, 240). The splice-juntion database implemented by RNASequel could

be modified to realign reads across any of the aforementioned events to take advantage of

RNASequel’s high sensitivity and low false positive rate. Furthermore, as more organisms

without complete reference genomes are profiled with RNA-seq, taking advantage of the output

of de novo assemblers will be essential. There will be a need to map accurately map short reads

from the assembled RNA-seq data to the assembled reference genome (241, 242). RNASequel

can be modified to work with the assembled contigs in combination with the available reference

genome data to maximize alignment accuracy.

The biological role of RNA editing within an organism has not been fully explored. One of the

reasons for this is that ADAR knockouts in higher order metazoa are lethal, however, other

organisms such as C. elegans and D. melanogaster remain viable (44, 243). The nematode C.

elegans encodes two ADAR genes: adr-2 encodes the catalytically active enzyme and adr-1 the

catalytically inactive enzyme (44). ADAR1 proteins are thought to modulate the editing activity

of ADAR2 and play a regulatory role in small RNA processing. Double knockouts of the adr

genes causes an up-regulation of small RNA expression (53). The caveat to using C. elegans as a

model system is that it may not have evolved the same dependence on A-to-I RNA as higher

order metazoa. ADAR2 dependent RNA editing in C. elegans has been shown to occur in hyer-

edited clusters that localize to non-coding genetic elements such as introns, 3’-UTR’s and

intergenic regions (70). These elements are common sources of structured RNA’s due to inverted

repeats and transposable element insertions.

This association with repeat elements increases the complexity when trying to identify RNA

edits using RNA-seq. Repeat sensitivity is essential to accurately map the relatively short (<100

124

bp each read) RNA-seq pairs. Recent studies have globally profiled RNA editing in C. elegans

by using RNA-seq and complex in-house alignment and hyper-editing detection. (53, 68, 69).

Combining the calls from these papers are suboptimal because of their different analysis

pipelines and a unified method and profile is essential to construct a comprehensive map of

ADAR2 dependent RNA editing in C. elegans.

The final objective of my thesis has been to construct an accurate and comprehensive map of A-

to-I RNA editing sites in C. elegans. To accomplish this I utilized the improved version of

RNASequel and a sensitive hyper-editing cluster identification pipeline. Using both of these

software innovations I have pushed the detection of ADAR2 editing sites to its near-limit. For

cluster edits I require a minimum coverage of 3 unique reads. The minimum coverage could be

reduced, but the false positive rate would increase substantially. The 91 C. elegans RNA-seq

samples processed in this study include adr-2 knockouts, poly(A) selected and total RNA

wildtype adr-2 samples. After merging the edits discovered in all of these samples I identified

197,890 A-to-I edit sites within 10,941 hyper-edited clusters. I verified previous observations

that A-to-I editing returns to background level in the adr-2 knockout samples and the edits are

strongly associated with non-coding genetic elements and repeat DNA. To verify the efficacy of

my pipeline I compared the overlap between my edit calls and those by Zhao et al. (69) and

Whipple et al. (68) and found a ~80% and ~74% concordance respectively. I also identified

~155,000 additional A-to-I hyper-editing sites.

Despite the comprehensiveness of the A-to-I edit sites constructed in this study I suspect that

there are still more undiscovered sites based on three observations; the first is that when

ambiguously mapped reads are included in the dataset the number of edit sites nearly doubles to

409,008 edits in 15,271 clusters. This suggests that a proportion of the sites mapped to by

ambiguously mapped reads may contain additional editing sites. It cannot be asserted that all of

the sites are indeed expressed and edited at some point in the worm life cycle. The second

observation is that there are inverted repeat structures within long introns that may also be targets

for ADAR2 but may have been missed from in the sequencing data (Figure 3.11). The third

observation was based on saturation analysis that the number of new hyper-edited sites that I

identified in this study increased with the addition of new samples (Figure 3.12).

125

Delineating the full extent of A-to-I editing in an organism’s genome is a challenging endeavor

and is limited by three major factors: the read length and throughput of RNA-seq experiments;

the rarity of edited transcripts (for example pre-mRNAs); and the fact that not every ADAR

dsRNA substrate may be expressed in a given cell type, tissue or condition. Furthermore, the

heterogeneity of whole organism derived samples such as C. elegans worms may obscure

biologically relevant signals in rare cell types or rarely expressed genes. These issues could be

mitigated by three primary methods 1) improved sample purification methods 2) improved

sequencing technologies 3) targeted sequencing approaches. Selective sample isolation such as

flow sorting for C. elegans samples would aid in understanding of editing within different worm

tissues such as neuronal or germline tissues (244). Single cell sequencing could further resolution

in the understanding of editing at the cellular level by permitting the quantification of editing

within a tissue on a single cell basis (245). Improved sequencing technologies can resolve repeat

and low coverage issues by increasing sequencing throughput and/or read length. Finally,

targeted sequencing can help select for genetic elements with low coverage such as nascent

transcripts, determine the cellular localization of edited transcripts, and identify the transcripts

associated with inosine binding proteins. A combination of all three aforementioned methods

could be used to increase resolution further.

Sample heterogeneity is a confounding factor when analyzing whole tissues or organisms for

RNA editing (245-247). A typical RNA-seq experiment is a snapshot of the cell population at a

given time. This may lead to scientific discoveries being biased towards the most common cell

types and may obscure other biologically relevant signals (245-247). A notable example of this is

cancer stem cells being obscured by other tumor cells (246, 248). There is a currently a push to

develop commercial kits for single-cell RNA-seq and I believe this will be the future of RNA

sequencing. However, even with new Illumina sequencers the cost of deeply sequencing single

cells is prohibitively expensive and the majority of experiments aim for low sequencing depth

(10-20M reads per cell) for gene expression studies. This depth is not suitable for RNA editing

identification since the majority of inosine containing transcripts are rare. Furthermore, single

cell experiments tend to have an increased end bias due to poly(A) amplification compared to

traditional RNA-seq experiments (245). As sequencing throughput and read-length increases and

single-cell RNA kits improve this technology will become essential to understand RNA

regulation in the context of splicing, gene expression and RNA editing in a population of single

126

cells. Single cell RNA-seq does lead to a challenge when mapping read pairs since there can be

thousands of samples. The upgraded version of RNASequel, would be more than capable of

processing this many samples since it does not produce temporary files and functions with raw

sequencing reads.

There are two primary developments to sequencing technology that will improve RNA editing

detection. Increasing sequencing depth will increase the read depth for rare transcripts such as

pre-mRNA’s. This can be achieved using the Illumina platform by utilizing additional

sequencing lanes, but as of now the cost is still prohibitive. As the cost of sequencing decreases

it will become more common to sequence libraries to high depth. This will increase coverage

across intronic sequences, for example, which may lead to the identification of more edit sites

and sites with a low frequency. The second sequencing development that will aid in the

identification of RNA editing is single molecule sequencing. The sequencers capable of high-

throughput single molecule sequencing are still in their infancy but they will be useful to resolve

ambiguous read mappings and the extent of RNA editing in single transcripts. Two notable

single molecule sequencing technologies are the Pacific Biosciences SMRT sequencer (249)

capable of generating an average read length of 10 – 15kbp and the Oxford Biosciences

Nanopore (250) sequencer that can generate ~5kbp read lengths. Both of these technologies

current have high error rates and lower throughput than existing sequencing technologies but as

they improve they could be useful for profiling RNA edits. Selecting for transcripts that are

likely to be edited could mitigate the lower throughput of these instruments.

The genetic elements targeted by ADAR proteins in C.elegans are generally rare with the

majority derived from intronic and intergenic sequences. Despite sequencing technologies

yielded unprecedented throughput, the cost of deeply sequencing a large number of RNA-seq

samples from differing tissues or whole worms would be restrictive. Therefore, it would be ideal

if the regions likely to be edited could be directly targeted for sequencing. I foresee two useful

sets of techniques to select for edited transcripts. The first involves RNA immunoprecipitation

followed by high-throughput sequencing where RNA’s bound by dsRBPs are precipitated (251).

RNA with structured regions can also be directly precipitated using dsRNA specific antibodies

(68, 210, 252). The second set of methods involves selecting for transcripts with specific

properties such as intron lariats or nascent transcripts (pre-mRNA’s). These methods would not

only reduce the sequencing cost but they may aid in elucidating the biological context of A-to-I

127

RNA editing by determining what proteins appear to target edited transcripts and their cellular

localization.

Immunoprecipitation of structured RNA’s associated with dsRBPs could be used to explore

which proteins bind inosine containing and RNA and the possibility that some dsRBPs may play

a regulatory role by competing with ADAR’s for dsRNA targets. In C. elegans adr-1

immunoprecipitation with RNA-seq was used to sequence the targets of ADAR1 in vivo (62).

This method could be extended and combined with deeper sequencing and paired-end reads to

isolate structured RNA’s that may also have A-to-I edits. The one caveat of these methods is that

if ADAR2 editing disrupts the dsRNA structure the RNA’s may no longer be bound by these

proteins. Other proteins known to bind dsRNA’s could also be used for precipitation such as

Staufen, NONO, RNAi components, or Vigilins (57, 58, 68, 253).

Transcripts with 3’-UTR A-to-I edits have been shown to co-localize with nuclear paraspeckles

by NONO binding (57, 254). However, a cursory literature search has not revealed evidence of

nuclear paraspeckles in C. elegans. Homologues to some of the proteins required for paraspeckle

formation such as NONO (nono-1 on WormBase) are present but have not been characterized. It

would be interesting to verify if C. elegans NONO binds inosine containing dsRNA and if it

leads to the nuclear retention of edited dsRNAs.

The Vigilin proteins have been shown to have high affinity for A-to-I containing RNA’s and

localize to heterochromatin (58). These proteins combined with the observation hyper-edited

RNA transcripts are commonly associated with heterochromatin in C. elegans. Knockouts of the

C. elegans Vigilin homolog increased chromosomal nondisjunction (191). It would be interesting

to verify RNA targets bound by the C. elegans Viglin homolog and to explore whether ADAR

knockouts have an affect on Viglin binding or heterochromatin deposition.

Previous studies profiling RNA editing in C. elegans have taken advantage of the monoclonal J2

anti-dsRNA antibody, which is capable of binding dsRNA helices of at least 40 bp in length (68,

210, 252). The study by Saldi et al. used single end reads making it difficult to resolve

ambiguously mapped reads and the study by Whipple et al. had extensive rRNA contamination

(Figure 3.1). J2 precipitation combined with improved rRNA depletion and paired-end

sequencing could provide a method to identify select for and profile A-to-I editing long

structured RNA’s.

128

RNAse R, a 3’ -> 5’ exoribonuclease is capable of digesting ssRNA but does not digest most

dsRNA, intron lariats and circRNA’s (225). This method in combination with RNA-seq has been

used to identify circRNA’s in both human’s and C. elegans. Since the majority of edits (53.3%)

identified in the C. elegans samples processed in this study were intronic RNAse R digestion

may select for intron lariats containing edits. This could permit the deep profiling of intronic

edits which could be combined with alternative splicing analysis to determine if these RNA

editing regulates splicing in C. elegans. This method could also be used to further explore

evidence that structured RNA’s and RNA editing may play a role in circular RNA biogenesis

(71, 222).

ADAR RNA editing has previously been shown to be co-transcriptional in Drosophila (17, 19,

37). Moreover, the majority of the edits I observed were intronic and associated with transposons

and/or inverted repeats. Therefore it would be worthwhile to sequence nascent RNA transcripts

to isolate intronic sequences prior to splicing (19). This method would validate that RNA editing

occurs co-transcriptionally in C. elegans.

Long non-coding RNA’s are a recently discovered class of non-coding transcripts with a length

of at least 200 nts and have been associated with development and disease (13). Many of these

lncRNA transcripts contain a transposable element such as endogenous retroviral elements or

Alu elements (255). The extensive secondary structure of lncRNA transcripts could be targets for

RNA editing. The best example of this is the C. elegans lncRNA rncs-1, which is a small RNA

pathway antagonist and a target for extensive A-to-I editing (256). I think further exploring how

many lncRNA genes undergo editing and whether there are dsRNA binding proteins that inhibit

this to protect the lncRNA from editing or induced structural changes. Another possibility is that

the lncRNA requires protein binding to fold properly. LncRNA editing has the potential to have

a profound impact on cellular and disease development.

The complete biological role of A-to-I RNA editing has yet to be elucidated, however, as

sequencing technologies and library preparation methods improve the biology will become

clearer. Tools to analyze and interpret this data will be essential and my thesis project has

contributed RNASequel and a sensitive pipeline for the identification of RNA edits that will be

useful to aid in the analysis of A-to-I RNA editing as new datasets are generated and new

methods are devised.

129

References 1. Watson,J.D. and Crick,F.H. (1974) Molecular structure of nucleic acids: a structure for

deoxyribose nucleic acid. JD Watson and FHC Crick. Published in Nature, number 4356 April 25, 1953. Nature.

2. Sanger,F. (1988) Sequences, sequences, and sequences. Annu. Rev. Biochem.

3. Hutchison,C.A. (2007) DNA sequencing: bench to bedside and beyond. Nucleic Acids Research, 35, 6227–6237.

4. Kircher,M. and Kelso,J. (2010) High-throughput DNA sequencing - concepts and limitations. BioEssays, 32, 524–536.

5. Mardis,E.R. (2012) PERSPECTIVE. Nature, 470, 198–203.

6. Metzker,M.L. (2009) Sequencing technologies — the next generation. Nature Reviews Genetics, 11, 31–46.

7. Stein,L.D. (2010) The case for cloud computing in genome informatics. Genome biology, 11, 207.

8. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C., Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921.

9. The Human Genome Project Completion: Frequently Asked Questions The Human Genome Project Completion: Frequently Asked Questions National Human Genome Research Institute.

10. Wang,Z., Gerstein,M. and Snyder,M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.

11. Zhao,S., Fung-Leung,W.-P., Bittner,A., Ngo,K. and Liu,X. (2014) Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells. PLoS ONE, 9, e78644–13.

12. Ramaswami,G., Lin,W., Piskol,R., Tan,M.H., Davis,C. and Li,J.B. (2012) Accurate identification of human Alu and non-Alu RNA editing sites. Nature Methods, 9, 1–5.

13. Gibb,E.A., Brown,C.J. and Lam,W.L. (2011) The functional role of long non-coding RNA in human carcinomas. Mol Cancer, 10, 38.

14. Ponting,C.P., Oliver,P.L. and Reik,W. (2009) Evolution and Functions of Long Noncoding RNAs. Cell, 136, 629–641.

15. Lee,T.I. and Young,R.A. (2013) Transcriptional Regulation and Its Misregulation in Disease. Cell, 152, 1237–1251.

16. Halbeisen,R.E., Galgano,A., Scherrer,T. and Gerber,A.P. (2007) Post-transcriptional gene

130

regulation: From genome-wide studies to principles. Cell. Mol. Life Sci., 65, 798–813.

17. Bentley,D.L. (2014) Coupling mRNA processing with transcription in time and space. Nature Reviews Genetics, 15, 163–175.

18. Maniatis,T. and Reed,R. (2002) An extensive network of coupling among gene expression machines. Nature, 416, 499–506.

19. Rosbash,J.R.J.M.M., Menet,J.S. and Rosbash,M. (2012) Nascent-Seq Indicates Widespread Cotranscriptional RNA Editing in Drosophila. Mol Cell, 47, 27–37.

20. Kornblihtt,A.R., la Mata,de,M., Fededa,J.P., Munoz,M.J. and Nogues,G. (2004) Multiple links between transcription and splicing. RNA (New York, N.Y.), 10, 1489–1498.

21. Ryman,K., Fong,N., Bratt,E., Bentley,D.L. and Ohman,M. (2007) The C-terminal domain of RNA Pol II helps ensure that editing precedes splicing of the GluR-B transcript. RNA (New York, N.Y.), 13, 1071–1078.

22. Berget,S.M., Moore,C. and Sharp,P.A. (1977) Spliced segments at the 5'terminus of adenovirus 2 late mRNA. In.

23. Chow,L.T., Gelinas,R.E., Broker,T.R. and Roberts,R.J. (1977) An amazing sequence arrangement at the 5' ends of adenovirus 2 messenger RNA. Cell, 12, 1–8.

24. Sharp,P.A. (2005) The discovery of split genes and RNA splicing. Trends in biochemical sciences, 30, 279–281.

25. Chen,M. and Manley,J.L. (2009) Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nature reviews. Molecular cell biology, 10, 741–754.

26. William Roy,S. and Gilbert,W. (2006) The evolution of spliceosomal introns: patterns, puzzles and progress. Nature Reviews Genetics, 7, 211–221.

27. Reed,R. (1989) The organization of 3' splice-site sequences in mammalian introns. Genes & Development, 3, 2113–2123.

28. Irimia,M. and Blencowe,B.J. (2012) Alternative splicing: decoding an expansive regulatory layer. Current Opinion in Cell Biology, 24, 323–332.

29. Ghigna,C., Valacca,C. and Biamonti,G. (2008) Alternative splicing and tumor progression. Curr. Genomics, 9, 556–570.

30. Hallegger,M., Llorian,M. and Smith,C.W.J. (2010) Alternative splicing: global insights. FEBS J, 277, 856–866.

31. Knoop,V. (2010) When you can’t trust the DNA: RNA editing changes transcript sequences. Cell. Mol. Life Sci., 68, 567–586.

32. Keegan,L.P., Gallo,A. and O'Connell,M.A. (2001) The many roles of an RNA editor. Nature

131

Reviews Genetics, 2, 869–878.

33. Moris,A., Murray,S. and Cardinaud,S. (2014) AID and APOBECs span the gap between innate and adaptive immunity. Front Microbiol, 5, 534.

34. Bass,B.L. (2002) RNA editing by adenosine deaminases that act on RNA. Annu. Rev. Biochem., 71, 817–846.

35. Vendeix,F.A.P., Munoz,A.M. and Agris,P.F. (2009) Free energy calculation of modified base-pair formation in explicit solvent: A predictive model. RNA (New York, N.Y.), 15, 2278–2287.

36. Zinshteyn,B. and Nishikura,K. (2009) Adenosine-to-inosine RNA editing. Wiley interdisciplinary reviews Systems biology and medicine, 1, 202–209.

37. Laurencikiene,J., K allman,A.M., Fong,N., Bentley,D.L. and Ohman,M. (2006) RNA editing and alternative splicing: the importance of co-transcriptional coordination. EMBO reports, 7, 303–307.

38. Valente,L. and Nishikura,K. (2005) ADAR gene family and A-to-I RNA editing: diverse roles in posttranscriptional gene regulation. Progress in nucleic acid research and molecular biology, 79, 299–338.

39. Grice,L.F. and Degnan,B.M. (2015) The origin of the ADAR gene family and animal RNA editing. BMC Evol Biol, 15, 4.

40. Kuttan,A. and Bass,B.L. (2012) Mechanistic insights into editing-site specificity of ADARs. Proc Natl Acad Sci USA, 109, E3295–E3304.

41. Wang,Q. (2000) Requirement of the RNA Editing Deaminase ADAR1 Gene for Embryonic Erythropoiesis. Science, 290, 1765–1768.

42. Higuchi,M., Maas,S., Single,F.N., Hartner,J., Rozov,A., Burnashev,N., Feldmeyer,D., Sprengel,R. and Seeburg,P.H. (2000) Point mutation in an AMPA receptor gene rescues lethality in mice deficient in the RNA-editing enzyme ADAR2. Nature, 406, 78–81.

43. Jepson,J.E.C. and Reenan,R.A. (2009) Adenosine-to-Inosine Genetic Recoding Is Required in the Adult Stage Nervous System for Coordinated Behavior in Drosophila. The Journal of biological chemistry, 284, 31391–31400.

44. Tonkin,L.A., Saccomanno,L., Morse,D.P., Brodigan,T., Krause,M. and Bass,B.L. (2002) RNA editing by ADARs is important for normal behavior in Caenorhabditis elegans. The EMBO Journal, 21, 6025–6035.

45. Yang,J.H., Sklar,P., Axel,R. and Maniatis,T. (1997) Purification and characterization of a human RNA adenosine deaminase for glutamate receptor B pre-mRNA editing. Proceedings of the National Academy of Sciences, 94, 4354–4359.

46. Melcher,T., Maas,S., Herb,A., Sprengel,R., Seeburg,P.H. and Higuchi,M. (1996) A

132

mammalian RNA editing enzyme. Nature, 379, 460–464.

47. Horsch,M., Seeburg,P.H., Adler,T., Aguilar-Pimentel,J.A., Becker,L., Calzada-Wack,J., Garrett,L., Götz,A., Hans,W., Higuchi,M., et al. (2011) Requirement of the RNA-editing enzyme ADAR2 for normal physiology in mice. The Journal of biological chemistry, 286, 18614–18622.

48. Hideyama,T., Yamashita,T., Aizawa,H., Tsuji,S., Kakita,A., Takahashi,H. and Kwak,S. (2012) Profound downregulation of the RNA editing enzyme ADAR2 in ALS spinal motor neurons. Neurobiol. Dis., 45, 1121–1128.

49. Kim,D.D.Y. (2004) Widespread RNA Editing of Embedded Alu Elements in the Human Transcriptome. Genome Research, 14, 1719–1725.

50. Athanasiadis,A., Rich,A. and Maas,S. (2004) Widespread A-to-I RNA editing of Alu-containing mRNAs in the human transcriptome. PLoS Biology, 2, e391.

51. Bass,B.L. and Weintraub,H. (1987) A developmentally regulated activity that unwinds RNA duplexes. Cell, 48, 607–613.

52. Bass,B.L. and Weintraub,H. (1988) An unwinding activity that covalently modifies its double-stranded RNA substrate. Cell, 55, 1089–1098.

53. Wu,D., Lamm,A.T. and Fire,A.Z. (2011) Competition between ADAR and RNAi pathways for an extensive class of RNA targets. Nature Structural & Molecular Biology, 18, 1094–1101.

54. Warf,M.B., Shepherd,B.A., Johnson,W.E. and Bass,B.L. (2012) Effects of ADARs on small RNA processing pathways in C. elegans. Genome Research, 22, 1488–1498.

55. Rueter,S.M., Dawson,T.R. and Emeson,R.B. (1999) Regulation of alternative splicing by RNA editing. Nature, 399, 75–80.

56. Savva,Y.A., Jepson,J.E.C., Chang,Y.-J., Whitaker,R., Jones,B.C., St Laurent,G., Tackett,M.R., Kapranov,P., Jiang,N., Du,G., et al. (2013) RNA editing regulates transposon-mediated heterochromatic gene silencing. Nature Communications, 4, 2745.

57. Zhang,Z. and Carmichael,G.G. (2001) The fate of dsRNA in the nucleus: a p54(nrb)-containing complex mediates the nuclear retention of promiscuously A-to-I edited RNAs. Cell, 106, 465–475.

58. Wang,Q., Zhang,Z., Blackwell,K. and Carmichael,G.G. (2005) Vigilins bind to promiscuously A-to-I-edited RNAs and are involved in the formation of heterochromatin. Curr. Biol., 15, 384–391.

59. Prasanth,K.V., Prasanth,S.G., Xuan,Z., Hearn,S., Freier,S.M., Bennett,C.F., Zhang,M.Q. and Spector,D.L. (2005) Regulating gene expression through RNA nuclear retention. Cell, 123, 249–263.

133

60. Zheng,H., Fu,T.B., Lazinski,D. and Taylor,J. (1992) Editing on the genomic RNA of human hepatitis delta virus. J. Virol., 66, 4693–4697.

61. Luciano,D.J., Mirsky,H., Vendetti,N.J. and Maas,S. (2004) RNA editing of a miRNA precursor. RNA (New York, N.Y.), 10, 1174–1177.

62. Washburn,M.C., Kakaradov,B., Sundararaman,B., Wheeler,E., Hoon,S., Yeo,G.W. and Hundley,H.A. (2014) The dsRBP and Inactive Editor ADR-1 Utilizes dsRNA Binding to Regulate A-to-I RNA Editing across the C. elegans Transcriptome. CellReports, 6, 599–607.

63. Palladino,M.J., Keegan,L.P., O'Connell,M.A. and Reenan,R.A. (2000) A-to-I pre-mRNA editing in Drosophila is primarily involved in adult nervous system function and integrity. Cell, 102, 437–449.

64. XuFeng,R., Boyer,M.J., Shen,H., Li,Y., Yu,H., Gao,Y., Yang,Q., Wang,Q. and Cheng,T. (2009) ADAR1 is required for hematopoietic progenitor cell survival via RNA editing. Proc Natl Acad Sci USA, 106, 17763–17768.

65. Hoogstrate,S.W., Volkers,R.J., Sterken,M.G., Kammenga,J.E. and Snoek,L.B. (2014) Nematode endogenous small RNA pathways. Worm, 3, e28234–11.

66. Tonkin,L.A. and Bass,B.L. (2003) Mutations in RNAi rescue aberrant chemotaxis of ADAR mutants. Science (New York, N.Y.), 302, 1725–1725.

67. Liddicoat,B.J., Piskol,R., Chalk,A.M., Ramaswami,G., Higuchi,M., Hartner,J.C., Li,J.B., Seeburg,P.H. and Walkley,C.R. (2015) RNA editing by ADAR1 prevents MDA5 sensing of endogenous dsRNA as nonself. Science, 349, 1115–1120.

68. Whipple,J.M., Youssef,O.A., Aruscavage,P.J., Nix,D.A., Hong,C., Johnson,W.E. and Bass,B.L. (2015) Genome-wide profiling of the C. elegans dsRNAome. RNA (New York, N.Y.), 21, 786–800.

69. Zhao,H.Q., Zhang,P., Gao,H., He,X., Dou,Y., Huang,A.Y., Liu,X.M., Ye,A.Y., Dong,M.Q. and Wei,L. (2014) Profiling the RNA editomes of wild-type C. elegans and ADAR mutants. Genome Research, 10.1101/gr.176107.114.

70. Hundley,H.A. and Bass,B.L. (2010) ADAR editing in double-stranded UTRs and other noncoding RNA sequences. Trends in biochemical sciences, 35, 377–383.

71. Ivanov,A., Memczak,S., Wyler,E., Torti,F., Porath,H.T., Orejuela,M.R., Piechotta,M., Levanon,E.Y., Landthaler,M., Dieterich,C., et al. (2014) Analysis of Intron Sequences Reveals Hallmarks of Circular RNA Biogenesis in Animals. CellReports, 10, 1–9.

72. Morse,D.P., Aruscavage,P.J. and Bass,B.L. (2002) RNA hairpins in noncoding regions of human brain and Caenorhabditis elegans mRNA are edited by adenosine deaminases that act on RNA. Proc Natl Acad Sci USA, 99, 7906–7911.

73. Holley,R.W., Everett,G.A., Madison,J.T. and Zamir,A. (1965) Nucleotide Sequences in the Yeast Alanine Transfer Ribonucleic Acid. The Journal of biological chemistry, 240, 2122–

134

2128.

74. Sanger,F., Brownlee,G.G. and Barrell,B.G. (1965) A two-dimensional fractionation procedure for radioactive nucleotides. J. Mol. Biol., 13, 373–IN4.

75. Min Jou,W., Haegeman,G., Ysebaert,M. and Fiers,W. (1972) Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature, 237, 82–88.

76. Fiers,W., Contreras,R., Duerinck,F., Haegeman,G., Iserentant,D., Merregaert,J., Min Jou,W., Molemans,F., Raeymaekers,A., Van den Berghe,A., et al. (1976) Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature, 260, 500–507.

77. Wu,R. (1994) Development of the primer-extension approach: a key role in DNA sequencing. Trends in biochemical sciences, 19, 429–433.

78. Wu,R. (1970) Nucleotide sequence analysis of DNA. I. Partial sequence of the cohesive ends of bacteriophage lambda and 186 DNA. J. Mol. Biol., 51, 501–521.

79. Wu,R. and Taylor,E. (1971) Nucleotide sequence analysis of DNA. II. Complete nucleotide sequence of the cohesive ends of bacteriophage lambda DNA. J. Mol. Biol., 57, 491–511.

80. Sanger,F. and Coulson,A.R. (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol., 94, 441–448.

81. Sanger,F., Air,G.M., Barrell,B.G., Brown,N.L., Coulson,A.R., Fiddes,C.A., Hutchison,C.A., Slocombe,P.M. and Smith,M. (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 265, 687–695.

82. Maxam,A.M. and Gilbert,W. (1977) A new method for sequencing DNA. Proceedings of the National Academy of Sciences, 74, 560–564.

83. Peattie,D.A. (1979) Direct chemical method for sequencing RNA. In.

84. Sanger,F., Nicklen,S. and Coulson,A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74, 5463–5467.

85. Sanger,F. and Coulson,A.R. (1978) The use of thin acrylamide gels for DNA sequencing. FEBS Lett., 87, 107–110.

86. Smith,L.M., Sanders,J.Z., Kaiser,R.J., Hughes,P., Dodd,C., Connell,C.R., Heiner,C., Kent,S.B. and Hood,L.E. (1986) Fluorescence detection in automated DNA sequence analysis. Nature, 321, 674–679.

87. Prober,J.M., Trainor,G.L., Dam,R.J., Hobbs,F.W., Robertson,C.W., Zagursky,R.J., Cocuzza,A.J., Jensen,M.A. and Baumeister,K. (1987) A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science, 238, 336–341.

88. Zagursky,R.J. and Berman,M.L. (1984) Cloning vectors that yield high levels of single-

135

stranded DNA for rapid DNA sequencing. Gene, 27, 183–191.

89. Sinville,R. and Soper,S.A. (2007) High resolution DNA separations using microchip electrophoresis. J. Sep. Sci., 30, 1714–1728.

90. Karger,B.L. and Guttman,A. (2009) DNA sequencing by Capillary Electrophoresis . Electrophoresis, 30 Suppl 1, S196–202.

91. Swerdlow,H., Wu,S.L., Harke,H. and Dovichi,N.J. (1990) Capillary gel electrophoresis for DNA sequencing. Laser-induced fluorescence detection with the sheath flow cuvette. J. Chromatogr., 516, 61–67.

92. Swerdlow,H. and Gesteland,R. (1990) Capillary gel electrophoresis for rapid, high resolution DNA sequencing. Nucleic Acids Research, 18, 1415–1419.

93. Luckey,J.A., Drossman,H., Kostichka,A.J., Mead,D.A., D'Cunha,J., Norris,T.B. and Smith,L.M. (1990) High speed DNA sequencing by capillary electrophoresis. Nucleic Acids Research, 18, 4417–4421.

94. Pariat,Y.F., Berka,J., Heiger,D.N., Schmitt,T., Vilenchik,M., Cohen,A.S., Foret,F. and Karger,B.L. (1993) Separation of DNA fragments by capillary electrophoresis using replaceable linear polyacrylamide matrices. J Chromatogr A, 652, 57–66.

95. Huang,X.C., Quesada,M.A. and Mathies,R.A. (1992) DNA sequencing using capillary array electrophoresis. Anal. Chem., 64, 2149–2154.

96. Saiki,R.K., Gelfand,D.H., Stoffel,S., Scharf,S.J., Higuchi,R., Horn,G.T., Mullis,K.B. and Erlich,H.A. (1988) Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, 239, 487–491.

97. Chien,A., Edgar,D.B. and Trela,J.M. (1976) Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus. J. Bacteriol., 127, 1550–1557.

98. Murray,V. (1989) Improved double-stranded DNA sequencing using the linear polymerase chain reaction. Nucleic Acids Research, 17, 8889.

99. Ronaghi,M. (2001) Pyrosequencing sheds light on DNA sequencing. Genome Research, 11, 3–11.

100. Ronaghi,M., Karamohamed,S., Pettersson,B., Uhlén,M. and Nyrén,P. (1996) Real-time DNA sequencing using detection of pyrophosphate release. Anal. Biochem., 242, 84–89.

101. Bentley,D.R., Balasubramanian,S., Swerdlow,H.P., Smith,G.P., Milton,J., Brown,C.G., Hall,K.P., Evers,D.J., Barnes,C.L., Bignell,H.R., et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59.

102. Johnson,D.S., Mortazavi,A., Myers,R.M. and Wold,B. (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, N.Y.), 316, 1497–1502.

136

103. Fedurco,M. (2006) BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Research, 34, e22–e22.

104. Southern,E.M., Maskos,U. and Elder,J.K. (1992) Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: evaluation using experimental models. Genomics, 13, 1008–1017.

105. Adessi,C., Matton,G., Ayala,G., Turcatti,G., Mermod,J.J., Mayer,P. and Kawashima,E. (2000) Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. Nucleic Acids Research, 28, E87.

106. Turcatti,G., Romieu,A., Fedurco,M. and Tairi,A.P. (2007) A new class of cleavable fluorescent nucleotides: synthesis and optimization as reversible terminators for DNA sequencing by synthesis. Nucleic Acids Research, 36, e25–e25.

107. Schirmer,M., Ijaz,U.Z., D'Amore,R., Hall,N., Sloan,W.T. and Quince,C. (2015) Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 43, e37.

108. Fuller,C.W., Middendorf,L.R., Benner,S.A., Church,G.M., Harris,T., Huang,X., Jovanovich,S.B., Nelson,J.R., Schloss,J.A., Schwartz,D.C., et al. (2009) The challenges of sequencing by synthesis. Nature Biotechnology, 27, 1013–1023.

109. Kircher,M., Stenzel,U. and Kelso,J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome biology, 10, R83.

110. Aird,D., Ross,M.G., Chen,W.-S., Danielsson,M., Fennell,T., Russ,C., Jaffe,D.B., Nusbaum,C. and Gnirke,A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome biology, 12, R18.

111. Benjamini,Y. and Speed,T.P. (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research, 40, e72.

112. Thomson,J.M. (2006) Extensive post-transcriptional regulation of microRNAs and its implications for cancer. Genes & Development, 20, 2202–2207.

113. Shi,Y. (2012) Alternative polyadenylation: new insights from global analyses. RNA (New York, N.Y.), 18, 2105–2117.

114. Sleator,R.D. (2010) An overview of the current status of eukaryote gene prediction strategies. Gene, 461, 1–4.

115. Trapnell,C., Williams,B.A., Pertea,G., Mortazavi,A., Kwan,G., van Baren,M.J., Salzberg,S.L., Wold,B.J. and Pachter,L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515.

116. Wang,E.T., Sandberg,R., Luo,S., Khrebtukova,I., Zhang,L., Mayr,C., Kingsmore,S.F., Schroth,G.P. and Burge,C.B. (2008) Alternative isoform regulation in human tissue

137

transcriptomes. Nature, 456, 470–476.

117. Temin,H.M. and Mizutani,S. (1970) RNA-dependent DNA polymerase in virions of Rous sarcoma virus. Nature, 226, 1211–1213.

118. Sim,G.K., Kafatos,F.C., Jones,C.W., Koehler,M.D., Efstratiadis,A. and Maniatis,T. (1979) Use of a cDNA library for studies on evolution and developmental expression of the chorion multigene families. Cell, 18, 1303–1316.

119. Cocquet,J., Chong,A., Zhang,G. and Veitia,R.A. (2006) Reverse transcriptase template switching and false alternative transcripts. Genomics, 88, 127–131.

120. Houseley,J. and Tollervey,D. (2010) Apparent Non-Canonical Trans-Splicing Is Generated by Reverse Transcriptase In Vitro. PLoS ONE, 5, e12271.

121. Menéndez-Arias,L. (2002) Molecular basis of fidelity of DNA synthesis and nucleotide specificity of retroviral reverse transcriptases. Progress in nucleic acid research and molecular biology, 71, 91–147.

122. Champoux,J.J. and Schultz,S.J. (2009) Ribonuclease H: properties, substrate specificity and roles in retroviral reverse transcription. FEBS Journal, 276, 1506–1516.

123. Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) Serial analysis of gene expression. Science, 270, 484–487.

124. Sutcliffe,J.G., Milner,R.J. and Bloom,F.E. (1982) Common 82-nucleotide sequence unique to brain RNA. In.

125. Putney,S.D., Herlihy,W.C. and Schimmel,P. (1983) A new troponin T and cDNA clones for 13 different muscle proteins, found by shotgun sequencing. Nature, 302, 718–721.

126. Brosseau,J.-P., Lucier,J.-F.C., Lapointe,E., Durand,M., Gendron,D., Gervais-Bird,J., Tremblay,K., Perreault,J.-P. and Elela,S.A. (2010) High-throughput quantification of splicing isoforms. RNA (New York, N.Y.), 16, 442–449.

127. Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B. and Moreno,R.F. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252, 1651–1656.

128. Sakharkar,M.K., Chow,V.T.K. and Kangueane,P. (2004) Distributions of exons and introns in the human genome. In Silico Biol. (Gedrukt), 4, 387–393.

129. Parkinson,J. (2009) Expressed Sequence Tags (ESTs).

130. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST--database for "expressed sequence tags". Nature genetics, 4, 332–333.

131. Harbers,M. and Carninci,P. (2005) Tag-based approaches for transcriptome research and genome annotation. Nature Methods, 2, 495–502.

138

132. Lister,R., O'Malley,R.C., Tonti-Filippini,J., Gregory,B.D., Berry,C.C., Millar,A.H. and Ecker,J.R. (2008) Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis. Cell, 133, 523–536.

133. Mortazavi,A., Williams,B.A., McCue,K., Schaeffer,L. and Wold,B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5, 621–628.

134. Nagalakshmi,U., Wang,Z., Waern,K., Shou,C., Raha,D., Gerstein,M. and Snyder,M. (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (New York, N.Y.), 320, 1344–1349.

135. Trapnell,C., Hendrickson,D.G., Sauvageau,M., Goff,L., Rinn,J.L. and Pachter,L. (2012) Differential analysis of gene regulation at transcript resolution with rNA-seq. Nature Biotechnology, 31, 46–53.

136. Hodgkinson,A., Idaghdour,Y., Gbeha,E., Grenier,J.-C., Hip-Ki,E., Bruat,V., Goulet,J.-P., de Malliard,T. and Awadalla,P. (2014) High-resolution genomic analysis of human mitochondrial RNA sequence variation. Science (New York, N.Y.), 344, 413–415.

137. Guttman,M., Garber,M., Levin,J.Z., Donaghey,J., Robinson,J., Adiconis,X., Fan,L., Koziol,M.J., Gnirke,A., Nusbaum,C., et al. (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology, 28, 503–510.

138. Pan,Q., Shai,O., Lee,L.J., Frey,B.J. and Blencowe,B.J. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40, 1413–1415.

139. Pickrell,J.K., Marioni,J.C., Pai,A.A., Degner,J.F., Engelhardt,B.E., Nkadori,E., Veyrieras,J.-B., Stephens,M., Gilad,Y. and Pritchard,J.K. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464, 768–772.

140. Levin,J.Z., Yassour,M., Adiconis,X., Nusbaum,C., Thompson,D.-A., Friedman,N., Gnirke,A. and Regev,A. (2010) Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods, 7, 709–715.

141. van Dijk,E.L., Jaszczyszyn,Y. and Thermes,C. (2014) Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res., 322, 12–20.

142. Peng,Z., Cheng,Y., Tan,B.C.-M., Kang,L., Tian,Z., Zhu,Y., Zhang,W., Liang,Y., Hu,X., Tan,X., et al. (2012) Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nature Biotechnology, 30, 1–10.

143. He,S., Wurtzel,O., Singh,K., Froula,J.L., Yilmaz,S., Tringe,S.G., Wang,Z., Chen,F., Lindquist,E.A., Sorek,R., et al. (2010) Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. Nature Methods, 7, 807–812.

144. Yi,H., Cho,Y.-J., Won,S., Lee,J.-E., Jin Yu,H., Kim,S., Schroth,G.P., Luo,S. and Chun,J. (2011) Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq.

139

Nucleic Acids Research, 39, e140.

145. Hansen,K.D., Brenner,S.E. and Dudoit,S. (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, 38, e131–e131.

146. van Gurp,T.P., McIntyre,L.M. and Verhoeven,K.J.F. (2013) Consistent Errors in First Strand cDNA Due to Random Hexamer Mispriming. PLoS ONE, 8, e85583.

147. Roberts,A., Trapnell,C., Donaghey,J., Rinn,J.L. and Pachter,L. (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology, 12, R22.

148. Ameur,A., Zaghlool,A., Halvardson,J., Wetterbom,A., Gyllensten,U., Cavelier,L. and Feuk,L. (2011) Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nature Structural & Molecular Biology, 18, 1435–1440.

149. Haas,B.J., Chin,M., Nusbaum,C., Birren,B.W. and Livny,J. (2012) How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes? BMC Genomics, 13, 734.

150. Toung,J.M., Lahens,N., Hogenesch,J.B. and Grant,G. (2014) Detection Theory in Identification of RNA-DNA Sequence Differences Using RNA-Sequencing. PLoS ONE, 9, e112040.

151. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197.

152. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453.

153. Gotoh,O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol., 162, 705–708.

154. Farrar,M. (2007) Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23, 156–161.

155. Zhao,M., Lee,W.-P., Garrison,E.P. and Marth,G.T. (2013) SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications. PLoS ONE, 8, e82138.

156. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357–359.

157. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear space. Comput. Appl. Biosci., 4, 11–17.

158. Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. (2000) A greedy algorithm for aligning DNA sequences. Journal of computational biology : a journal of computational molecular cell biology, 7, 203–214.

159. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison.

140

Proceedings of the National Academy of Sciences, 85, 2444–2448.

160. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

161. Kent,W.J. (2002) BLAT---The BLAST-Like Alignment Tool. Genome Research, 12, 656–664.

162. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.

163. Li,H. and Homer,N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11, 473–483.

164. Ma,B., Tromp,J. and Li,M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England), 18, 440–445.

165. Li,H., Ruan,J. and Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851–1858.

166. Li,R., Li,Y., Kristiansen,K. and Wang,J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713–714.

167. Homer,N., Merriman,B. and Nelson,S.F. (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE, 4, e7767.

168. Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A. and Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol, 5, e1000386.

169. Weiner,P. (1973) Linear pattern matching algorithms. Switching and Automata Theory, 19, 331–353.

170. Gusfield,D. (1997) Algorithms on Strings, Trees and Sequences Cambridge University Press.

171. Delcher,A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. and Salzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids Research, 27, 2369–2376.

172. Shrestha,A.M.S., Frith,M.C. and Horton,P. (2014) A bioinformatician's guide to the forefront of suffix array construction algorithms. Briefings in bioinformatics, 15, 138–154.

173. Manber,U. and Myers,G. (1989) Suffix Arrays.

174. Abouelhoda,M.I., Kurtz,S. and Ohlebusch,E. (2004) Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2, 53–86.

175. Ferragina,P. and Manzini,G. (2000) Opportunistic data structures with applications. In. IEEE Comput. Soc, pp. 390–398.

176. Burrows,M. and Wheeler,D.J. (1994) A Block-sorting Lossless Data Compression

141

Algorithm.

177. Lam,T.W., Sung,W.K., Tam,S.L., Wong,C.K. and Yiu,S.M. (2008) Compressed indexing and local alignment of DNA. Bioinformatics (Oxford, England), 24, 791–797.

178. Langmead,B., Trapnell,C., Pop,M. and Salzberg,S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10, R25.

179. Li,R., Yu,C., Li,Y., Lam,T.-W., Yiu,S.-M., Kristiansen,K. and Wang,J. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics (Oxford, England), 25, 1966–1967.

180. Li,H. (2014) Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics (Oxford, England), 30, 2843–2851.

181. Hastings,M.L. and Krainer,A.R. (2001) Pre-mRNA splicing in the new millennium. Current Opinion in Cell Biology, 13, 302–309.

182. Garber,M., Grabherr,M.G., Guttman,M. and Trapnell,C. (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods, 8, 469–477.

183. Alamancos,G.P., Agirre,E. and Eyras,E. (2014) Methods to Study Splicing from High-Throughput RNA Sequencing Data. In Methods in Molecular Biology, Methods in Molecular Biology. Humana Press, Totowa, NJ, Vol. 1126, pp. 357–397.

184. Engström,P.G., Sipos,B., Alioto,T., Behr,J., Bohnert,R., Campagna,D., Davis,C.A., Dobin,A., Gingeras,T.R., Goldman,N., et al. (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods, 10.1038/nmeth.2722.

185. Trapnell,C., Pachter,L. and Salzberg,S.L. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25, 1105–1111.

186. Wang,K., Singh,D., Zeng,Z., Coleman,S.J., Huang,Y., Savich,G.L., He,X., Mieczkowski,P., Grimm,S.A., Perou,C.M., et al. (2010) MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research, 38, e178–e178.

187. Kim,D., Pertea,G., Trapnell,C., Pimentel,H., Kelley,R. and Salzberg,S.L. (2013) TopHat2: accurate alignment of transcriptomes inthe presence of insertions, deletions and genefusions. Genome biology, 14, R36.

188. Wu,T.D. and Nacu,S. (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics (Oxford, England), 26, 873–881.

189. Dobin,A., Davis,C.A., Schlesinger,F., Drenkow,J., Zaleski,C., Jha,S., Batut,P., Chaisson,M. and Gingeras,T.R. (2012) STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29, 15–21.

190. Stein,L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A., et al. (2003) The genome sequence of Caenorhabditis

142

briggsae: a platform for comparative genomics. PLoS Biology, 1, E45.

191. Sijen,T. and Plasterk,R.H.A. (2003) Transposon silencing in the Caenorhabditis elegans germ line by natural RNAi. Nature, 426, 310–314.

192. C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 2012–2018.

193. Piskol,R., Peng,Z., Wang,J. and Li,J.B. (2013) Lack of evidence for existence of noncanonical RNA editing. Nature Biotechnology, 31, 19–20.

194. Li,J.B., Levanon,E.Y., Yoon,J.-K., Aach,J., Xie,B., Leproust,E., Zhang,K., Gao,Y. and Church,G.M. (2009) Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science (New York, N.Y.), 324, 1210–1213.

195. Li,M., Wang,I.X., Li,Y., Bruzel,A., Richards,A.L., Toung,J.M. and Cheung,V.G. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science (New York, N.Y.), 10.1126/science.1207018.

196. Sakurai,M., Yano,T., Kawabata,H., Ueda,H. and Suzuki,T. (2010) Inosine cyanoethylation identifies A-to-I RNA editing sites in the human transcriptome. Nature Chemical Biology, 6, 733–740.

197. Wilson,G.W. and Stein,L.D. (2015) RNASequel: accurate and repeat tolerant realignment of RNA-seq reads. Nucleic Acids Research, 10.1093/nar/gkv594.

198. Li,H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.

199. Au,K.F., Jiang,H., Lin,L., Xing,Y. and Wong,W.H. (2010) Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Research, 38, 1–9.

200. Grant,G.R., Farkas,M.H., Pizarro,A., Lahens,N., Schug,J., Brunk,B., Stoeckert,C.J., Hogenesch,J.B. and Pierce,E.A. (2011) Comparative Analysis of RNA-Seq Alignment Algorithms and the RNA-Seq Unified Mapper (RUM). Bioinformatics, 10.1093/bioinformatics/btr427.

201. Odelberg,S.J., Weiss,R.B., Hata,A. and White,R. (1995) Template-switching during DNA synthesis by Thermus aquaticus DNA polymerase I. Nucleic Acids Research, 23, 2049–2057.

202. Djebali,S., Davis,C.A., Merkel,A., Dobin,A., Lassmann,T., Mortazavi,A., Tanzer,A., Lagarde,J., Lin,W., Schlesinger,F., et al. (2013) Landscape of transcription in human cells. Nature, 488, 101–108.

203. Gott,J.M. and Emeson,R.B. (2000) Functions and mechanisms of RNA editing. Annu. Rev. Genet., 34, 499–531.

204. Nishikura,K. (2010) Functions and Regulation of RNA Editing by ADAR Deaminases.

143

Annu. Rev. Biochem., 79, 321–349.

205. Karolchik,D., Barber,G.P., Casper,J., Clawson,H., Cline,M.S., Diekhans,M., Dreszer,T.R., Fujita,P.A., Guruvadoo,L., Haeussler,M., et al. (2013) The UCSC Genome Browser database: 2014 update. Nucleic Acids Research, 42, D764–D770.

206. Wang,J., Wang,W., Li,R., Li,Y., Tian,G., Goodman,L., Fan,W., Zhang,J., Li,J., Zhang,J., et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456, 60–65.

207. Consortium,T.1.G.P., The 1000 Genomes Consortium Participants are arranged by project role,T.B.I.A.A.F.A.W.I.E.F.P.I.A.P.L.A.I., author,C., committee,S., Medicine,P.G.B.C.O., Broad Institute of MIT and Harvard, Max Planck Institute for Molecular Genetics, Washington University in St Louis, Wellcome Trust Sanger Institute, Affymetrix,A.G., et al. (2013) An integrated map of genetic variation from 1,092 human genomes. Nature, 490, 56–65.

208. Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27, 573–580.

209. Rice,P., Longden,I. and Bleasby,A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends in Genetics, 16, 276–277.

210. Saldi,T.K., Ash,P.E., Wilson,G., Gonzales,P., Garrido-Lecca,A., Roberts,C.M., Dostal,V., Gendron,T.F., Stein,L.D., Blumenthal,T., et al. (2014) TDP-1, the Caenorhabditis elegans ortholog of TDP-43, limits the accumulation of double-stranded RNA. The EMBO Journal, 10.15252/embj.201488740.

211. Gerstein,M.B., Lu,Z.J., Van Nostrand,E.L., Cheng,C., Arshinoff,B.I., Liu,T., Yip,K.Y., Robilotto,R., Rechtsteiner,A., Ikegami,K., et al. (2010) Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project. Science (New York, N.Y.), 330, 1775–1787.

212. Hastings,K.E.M. (2005) SL trans-splicing: easy come or easy go? Trends in Genetics, 21, 240–247.

213. Vendeix,F.A.P., Munoz,A.M. and Agris,P.F. (2009) Free energy calculation of modified base-pair formation in explicit solvent: A predictive model. RNA (New York, N.Y.), 15, 2278–2287.

214. Garrigues,J.M., Sidoli,S., Garcia,B.A. and Strome,S. (2015) Defining heterochromatin in C. elegansthrough genome-wide analysis of the heterochromatin protein 1 homolog HPL-2. Genome Research, 25, 76–88.

215. Liu,T., Rechtsteiner,A., Egelhofer,T.A., Vielle,A., Latorre,I., Cheung,M.-S., Ercan,S., Ikegami,K., Jensen,M., Kolasinska-Zwierz,P., et al. (2011) Broad chromosomal domains of histone modification patterns in C. elegans. Genome Research, 21, 227–236.

216. Mignone,F., Gissi,C., Liuni,S. and Pesole,G. (2002) Untranslated regions of mRNAs. Genome biology, 3, REVIEWS0004.

144

217. Flomen,R. and Makoff,A. (2011) Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients: involvement of a cryptic polyadenylation site. Neurosci. Lett., 497, 139–143.

218. Hogg,M., Paro,S., Keegan,L.P. and O'Connell,M.A. (2011) 3 - RNA Editing by Mammalian ADARs 1st ed. Elsevier Inc.

219. Vastenhouw,N.L., Fischer,S.E.J., Robert,V.J.P., Thijssen,K.L., Fraser,A.G., Kamath,R.S., Ahringer,J. and Plasterk,R.H.A. (2003) A genome-wide screen identifies 27 genes involved in transposon silencing in C. elegans. Curr. Biol., 13, 1311–1316.

220. Pothof,J., van Haaften,G., Thijssen,K., Kamath,R.S., Fraser,A.G., Ahringer,J., Plasterk,R.H.A. and Tijsterman,M. (2003) Identification of genes that protect the C. elegans genome against mutations by genome-wide RNAi. Genes & Development, 17, 443–448.

221. Emmons,S.W. and Yesner,L. (1984) High-frequency excision of transposable element Tc 1 in the nematode Caenorhabditis elegans is limited to somatic cells. Cell, 36, 599–605.

222. Jeck,W.R., Sorrentino,J.A., Wang,K., Slevin,M.K., Burd,C.E., Liu,J., Marzluff,W.F. and Sharpless,N.E. (2013) Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA (New York, N.Y.), 19, 141–157.

223. Lev-Maor,G., Ram,O., Kim,E., Sela,N., Goren,A., Levanon,E.Y. and Ast,G. (2008) Intronic Alus influence alternative splicing. PLoS genetics, 4, e1000204.

224. Keren,H., Lev-Maor,G. and Ast,G. (2010) Alternative splicing and evolution: diversification, exon definition and function. Nature Reviews Genetics, 11, 345–355.

225. Suzuki,H. (2006) Characterization of RNase R-digested cellular RNA source that consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic Acids Research, 34, e63–e63.

226. He,L. and Hannon,G.J. (2004) MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5, 522–531.

227. Chang,H., Lim,J., Ha,M. and Kim,V.N. (2014) TAIL-seq: genome-wide determination of poly(A) tail length and 3' end modifications. Mol Cell, 53, 1044–1052.

228. Stapleton,M., Carlson,J.W. and Celniker,S.E. (2006) RNA editing in Drosophila melanogaster: New targets and functional consequences. RNA (New York, N.Y.), 12, 1922–1932.

229. Levanon,E.Y. (2005) Evolutionarily conserved human targets of adenosine to inosine RNA editing. Nucleic Acids Research, 33, 1162–1168.

230. Clutterbuck,D.R., Leroy,A., O'Connell,M.A. and Semple,C.A.M. (2005) A bioinformatic screen for novel A-I RNA editing sites reveals recoding editing in BC10. Bioinformatics (Oxford, England), 21, 2590–2595.

231. Guérin,T.M., Palladino,F. and Robert,V.J. (2014) Transgenerational functions of small

145

RNA pathways in controlling gene expression in C. elegans. Epigenetics, 9, 37–44.

232. Holoch,D. and Moazed,D. (2015) RNA-mediated epigenetic regulation of gene expression. Nature Reviews Genetics, 16, 71–84.

233. Vitali,P. and Scadden,A. (2010) Double-stranded RNAs containing multiple IU pairs are sufficient to suppress interferon induction and apoptosis. Nature Structural & Molecular Biology, 17, 1043–1050.

234. Ermolaeva,M.A. and Schumacher,B. (2014) Insights from the worm: The C. elegans model for innate immunity. Seminars in immunology, 26, 303–309.

235. Harris,T.W., Antoshechkin,I., Bieri,T., Blasiar,D., Chan,J., Chen,W.J., La Cruz,De,N., Davis,P., Duesbury,M., Fang,R., et al. (2010) WormBase: a comprehensive resource for nematode research. Nucleic Acids Research, 38, D463–7.

236. Wang,K., Li,M. and Hakonarson,H. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38, e164–e164.

237. Finn,R.D., Bateman,A., Clements,J., Coggill,P., Eberhardt,R.Y., Eddy,S.R., Heger,A., Hetherington,K., Holm,L., Mistry,J., et al. (2014) Pfam: the protein families database. Nucleic Acids Research, 42, D222–30.

238. Park,E., Williams,B., Wold,B.J. and Mortazavi,A. (2012) RNA editing in the human ENCODE RNA-seq data. Genome Research, 22, 1626–1633.

239. Maher,C.A., Kumar-Sinha,C., Cao,X., Kalyana-Sundaram,S., Han,B., Jing,X., Sam,L., Barrette,T., Palanisamy,N. and Chinnaiyan,A.M. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature, 458, 97–101.

240. Wu,C.S., Yu,C.Y., Chuang,C.Y., Hsiao,M., Kao,C.F., Kuo,H.C. and Chuang,T.J. (2014) Integrative transcriptome sequencing identifies trans-splicing events with important roles in human embryonic stem cell pluripotency. Genome Research, 24, 25–36.

241. Haas,B.J., Papanicolaou,A., Yassour,M., Grabherr,M., Blood,P.D., Bowden,J., Couger,M.B., Eccles,D., Li,B., Lieber,M., et al. (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols, 8, 1494–1512.

242. Grabherr,M.G., Haas,B.J., Yassour,M., Levin,J.Z., Thompson,D.A., Amit,I., Adiconis,X., Fan,L., Raychowdhury,R., Zeng,Q., et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29, 644–652.

243. Paro,S., Li,X., O'Connell,M.A. and Keegan,L.P. (2012) Regulation and functions of ADAR in drosophila. Curr. Top. Microbiol. Immunol., 353, 221–236.

244. Spencer,W.C., McWhirter,R., Miller,T., Strasbourger,P., Thompson,O., Hillier,L.W., Waterston,R.H. and Miller,D.M. (2014) Isolation of specific neurons from C. elegans larvae for gene expression profiling. PLoS ONE, 9, e112102.

146

245. Shapiro,E., Biezuner,T. and Linnarsson,S. (2013) Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics, 14, 618–630.

246. Stegle,O., Teichmann,S.A. and Marioni,J.C. (2015) Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16, 133–145.

247. Kolodziejczyk,A.A., Kim,J.K., Svensson,V., Marioni,J.C. and Teichmann,S.A. (2015) The Technology and Biology of Single-Cell RNA Sequencing. Mol Cell, 58, 610–620.

248. Navin,N.E. (2014) Cancer genomics: one cell at a time. Genome biology, 15, 452.

249. Eid,J., Fehr,A., Gray,J., Luong,K., Lyle,J., Otto,G., Peluso,P., Rank,D., Baybayan,P., Bettman,B., et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science (New York, N.Y.), 323, 133–138.

250. Clarke,J., Wu,H.-C., Jayasinghe,L., Patel,A., Reid,S. and Bayley,H. (2009) Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol, 4, 265–270.

251. Milek,M., Wyler,E. and Landthaler,M. (2012) Transcriptome-wide analysis of protein-RNA interactions using high-throughput sequencing. Seminars in Cell and Developmental Biology, 23, 206–212.

252. Schönborn,J., Oberstrass,J., Breyel,E., Tittgen,J., Schumacher,J. and Lukacs,N. (1991) Monoclonal antibodies to double-stranded RNA as probes of RNA structure in crude nucleic acid extracts. Nucleic Acids Research, 19, 2993–3000.

253. LeGendre,J.B., Campbell,Z.T., Kroll-Conner,P., Anderson,P., Kimble,J. and Wickens,M. (2013) RNA Targets and Specificity of Staufen, a Double-stranded RNA-binding Protein in Caenorhabditis elegans. The Journal of biological chemistry, 288, 2532–2545.

254. Bond,C.S. and Fox,A.H. (2009) Paraspeckles: nuclear bodies built on long noncoding RNA. J. Cell Biol., 186, 637–644.

255. Kelley,D. and Rinn,J. (2012) Transposable elements reveal a stem cell-specific class of long noncoding RNAs. Genome biology, 13, R107.

256. Hellwig,S. and Bass,B.L. (2008) A starvation-induced noncoding RNA modulates expression of Dicer-regulated genes. Proc Natl Acad Sci USA, 105, 12897–12902.

accurate identification of adenosine deamination · 2016-08-04 · iii acknowledgments i would like...

Documents