accurate identification of adenosine deamination · 2016-08-04 · iii acknowledgments i would like...
TRANSCRIPT
Accurate Identification of Adenosine Deamination
by
Gavin Walter Wilson
A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy
Molecular Genetics University of Toronto
© Copyright by Gavin Walter Wilson 2016
ii
Accurate Identification of Adenosine Deamination with RNA-seq
Gavin Walter Wilson
Doctor of Philosophy
Molecular Genetics University of Toronto
2016
Abstract
The eukaryotic transcriptome is further diversified by post-transcriptional processing, including
alternative splicing and RNA editing. The latter includes the modification of adenosine to inosine
(A-to-I) within structured transcripts. Uniquely, inosine has similar base-pairing properties to
guanine, which can have downstream consequences to RNA secondary structure or RNA-RNA
interactions. While RNA editing events were typically characterized on gene-by-gene basis,
advancements in high throughput RNA-sequencing technologies have allowed A-to-I editing to
be investigated on a global scale. However, accurate identification of A-to-I edits on a
transcriptome-wide scale is compounded by artifacts introduced by reverse transcription,
sequencing, and computational alignment, all of which can lead to false positive signals. RNA-
seq read alignment for the purposes of RNA editing calls can be affected by the accuracy of
spliced, gapped, multi-mapped, and mismatch alignments. To address these challenges, I
developed RNAsequel, a software package that runs as a post-processing step in conjunction
with an RNA-seq aligner. I benchmarked the accuracy of RNASequel using a combination of
human derived simulated and biological datasets and demonstrated a clear improvement in all
four of the aforementioned accuracy metrics compared to current RNA-seq alignment tools.
Next, I utilized RNASequel to identify clusters of A-to-I hyper-editing in 91 C.
elegans samples using a novel algorithm designed to mitigate common sources of false positive
calls that are difficult to mitigate during read alignment. This resulted in the most comprehensive
map of RNA editing in C. elegans to date with 197,890 sites within 10,941 clusters. I then
explored the localization of the clusters to genetic features and heterochromatin. Collectively,
these data show the extensive editing events in the worms, while concurrently demonstrating the
utility of RNASequel.
iii
Acknowledgments
I would like my sincere thanks to my supervisor Dr. Lincoln Stein, without his expertise on bioinformatics, this work presented in this thesis would not have been possible. I would also like to thank my committee members: Dr. Ben Blencowe, Dr. Michael Brudno and Dr. Quaid Morris. Their boundless insight, feedback and much-needed pressure to wrap up my projects were essential for the completion of my doctoral studies. I would like to thank my friend and colleague Ewan “RNA has G:U base-pairs” Gibb. The quality of my scientific writing would not be where it is today without his thoughtful and thorough advice and suggestions. My friendship with Ewan has resulted in many of the fondest memories from my graduate school experience in my masters and doctoral degrees. Many thanks to my colleagues and friends: Nardin Samuel Faiyaz Notta, Marc Perry, Shirley Tam, and Quang Trinh. Their friendship and constant scientific dialog has been one of the highlights of my doctoral experience. Finally, I would like to thank Nick Provart and Fritz Roth for their helpful comments that have improved the quality of my thesis.
iv
Every day is a new day. It is better to be lucky. But I would rather be exact. Then when luck comes you are ready.
Ernest Hemmingway – The Old Man and the Sea
v
Table of Contents
ACKNOWLEDGMENTS.......................................................................................................................................IIITABLEOFCONTENTS..........................................................................................................................................V
LISTOFTABLES...............................................................................................................................................VIIILISTOFFIGURES.................................................................................................................................................IX
ABBREVIATIONS.................................................................................................................................................XI
BACKGROUND................................................................................................................................................11
1.1 THEDYNAMICEUKARYOTICTRANSCRIPTOME........................................................................................................31.2 A-TO-IEDITINGINCAENORHABDITISELEGANS........................................................................................................51.3 NUCLEICACIDSEQUENCING........................................................................................................................................61.4 HIGHTHROUGHPUTSEQUENCING..............................................................................................................................81.5 ILLUMINASEQUENCINGARTIFACTS........................................................................................................................101.6 RNASEQUENCING......................................................................................................................................................111.7 HIGH-THROUGHPUTRNASEQUENCING................................................................................................................131.7.1 RNA-seqLibraryPreparation........................................................................................................................14
1.8 RNA-SEQLIBRARYPREPARATIONCHALLENGES.................................................................................................151.9 SEQUENCEALIGNMENTALGORITHMS....................................................................................................................161.10 RNA-SEQREADALIGNMENT.................................................................................................................................211.10.1 SegmentationApproaches...........................................................................................................................251.10.2 SeedandExtendApproaches......................................................................................................................27
1.11 CURRENTCHALLENGESMAPPINGRNA-SEQPAIRS...........................................................................................281.11.1 IdentifyingRNAeditingwithRNA-seq....................................................................................................291.11.2 IdentifyingRNAeditswithoutsequencing............................................................................................29
1.12 THESISOBJECTIVES.................................................................................................................................................29 ACCURATERNA-SEQREALIGNMENTWITHRNASEQUEL.............................................................312
2.1 RESULTS:......................................................................................................................................................................342.1.1 DevelopinganAccurateRNA-seqRealignmentTool..........................................................................342.1.2 RNASequelrealignmentleadstoimprovedalignmentaccuracy..................................................342.1.3 Realignmenttoasplicejunctiondatabaseimprovessplicedreadaccuracy...........................402.1.4 RNASequelrealignmentimprovesalignmentswithinsertionsanddeletions.........................482.1.5 RNASequelrealignmentincreasesmismatchtoleranceandaccuracy.......................................48
vi
2.1.6 RNASequelexecutionspeedandmemoryrequirements...................................................................512.1.7 RNASequelrealignmentimprovesalignmentcharacteristicsonbiologicaldatasets..........512.1.8 RNASequelrealignmentgeneratesmorerobustRNAeditingcalls..............................................56
2.2 DISCUSSION:.................................................................................................................................................................662.3 METHODS:....................................................................................................................................................................662.3.1 Referencegenomeandannotations...........................................................................................................662.3.2 BiologicalDatasets:...........................................................................................................................................672.3.3 AlignmentProtocols:.........................................................................................................................................672.3.4 RNASequelRealignment..................................................................................................................................682.3.5 SpliceJunctionDefinitionsandAlignmentScoring.............................................................................692.3.6 SpliceJunctionDiscoveryandSpliceJunctionIndexGeneration...................................................692.3.7 ContiguousandSplicedReadAlignment..................................................................................................702.3.8 EstimatingtheEmpiricalFragmentSizeDistribution.......................................................................702.3.9 ResolvingReadPairAlignments..................................................................................................................712.3.10 SimulatedDatasetBenchmarking............................................................................................................712.3.11 IdentifyingPutativeAdenosinetoInosineRNAeditingevents....................................................72
IDENTIFYINGRNAHYPER-EDITINGINC.ELEGANS........................................................................743
3.1 BACKGROUND..............................................................................................................................................................743.2 RESULTS.......................................................................................................................................................................743.2.1 ImprovementstotheRNASequelAligner.................................................................................................753.2.2 RNA-seqsampleprocessingandalignment............................................................................................763.2.3 Accurateandsensitiveidentificationofhyper-editing......................................................................783.2.4 Comparisonwithotherstudies.....................................................................................................................833.2.5 Clustersareenrichedfornon-codingelements.....................................................................................843.2.6 ClusteredA-to-Ieditreplicationandproperties...................................................................................863.2.7 AGlobalmapofA-to-Iediting.......................................................................................................................893.2.8 Introniceditsaredepletednearsplice-sites...........................................................................................953.2.9 Intergeniceditsandantisensetranscripts..............................................................................................993.2.10 3’-UTRclustersandpoly(A)Sites..............................................................................................................993.2.11 IdentifyingputativeA-to-Idependentaminoacidchanges........................................................101
3.3 DISCUSSION...............................................................................................................................................................1033.4 METHODS..................................................................................................................................................................1063.4.1 C.elegansgeneannotationsandreferencesequences.....................................................................1063.4.2 Samples.................................................................................................................................................................106
vii
3.4.3 RNA-seqpreprocessingandalignment...................................................................................................1063.4.4 WholeGenomeAlignmentandVariantCalling...................................................................................1073.4.5 IdentifyingpotentialA-to-Ieditingevents............................................................................................1073.4.6 Annotatingeditsandclusters......................................................................................................................1083.4.7 Chromosomalmaps.........................................................................................................................................1083.4.8 DetectionrecurrentA-to-Ieditingeventswithinsplicesites,polyadenylationsignals,and
codingregions...................................................................................................................................................................1093.5 APPENDIX..................................................................................................................................................................110
DISCUSSION................................................................................................................................................1214
REFERENCES.....................................................................................................................................................129
viii
List of Tables
Table 3.1 Mapping Rates.
Table 3.2 A-to-I clustered edit recurrence rates.
Table 3.3. Recurrent A-to-I edits that overlap splice sites
Table 3.4. Recurrent A-to-I edits that overlap annotated poly(A) signals
Table 3.5.1 C. elegans samples Processed in this study
Table 3.5.2 Clustered A-to-I edits (Supplementary File:
Wilson_Gavin_W_201606_PhD_worm_edits.txt)
Table 3.5.3 Extended 3’-UTRs (Supplementary File:
Wilson_Gavin_W_201606_PhD_utr_extensions.xlsx)
ix
List of Figures
Figure 1.1 Timeline of sequencing technology developments.
Figure 1.2 RNA-seq Library Preparation
Figure 1.3 RNA-seq Alignment Overview.
Figure 1.4 RNA-seq Alignment Strategies.
Figure 2.1 RNA-sequel realignment schematic.
Figure 2.2 Simulated dataset alignment rates.
Figure 2.3 Example simulated dataset alignment
Figure 2.4 Example simulated dataset alignment
Figure 2.5 Spliced read alignment rates for the simulated datasets
Figure 2.6 The number of correct splice junctions identified in each read stratified by the total number of true splice junctions for the simulated datasets.
Figure 2.7 Alignment characteristics for the first simulated dataset
Figure 2.8 Alignment characteristics for the second simulated dataset
Figure 2.9 Positional biases of false positives, false negatives, and true positives for splice junctions and gaps for the first simulated dataset.
Figure 2.10 Positional biases of false positives, false negatives, and true positives for splice junctions and gaps for the second simulated dataset
Figure 2.11 Positional biases of false positive, false negative and true positive mismatches for both simulated datasets.
Figure 2.12 Alignment rates for biological datasets with matched somatic SNP calls.
Figure 2.13 Alignment rates for 25 additional biological datasets.
Figure 2.14 Application of the RNASequel fragment size estimation and verification algorithm to the alignments produced by Tophat and STAR.
Figure 2.15 YH SNV and edit call comparisons.
Figure 2.16 GM12878-1 SNV and edit call comparisons.
Figure 2.17 GM12878-2 SNV and edit call comparisons.
x
Figure 2.18 Comparing SNV and edit call rates for STAR Ann. with two passes and STAR Ann. + RNASequel for the 25 other biological datasets.
Figure 2.19 Comparing the differences in edit calls after removing likely false positive alignments for 25 other biological datasets.
Figure 2.20 Identifying alignment issues that cause false positive variant calls for YH, GM12878-1 and GM12878-2.
Figure 2.21 Identifying alignment issues that cause false positive variant calls for the other 25 biological datasets.
Figure 3.1 Alignment rates for the C. elegans samples processed in this study
Figure 3.2 Summary variant identification and filtering steps.
Figure 3.3 Comparison of variant call rates for clustered and singleton edits.
Figure 3.4 Number of clustered A-to-I and non-A-to-I versus the number of uniquely mapped reads.
Figure 3.5 A-to-I edit call comparison with other studies.
Figure 3.6 A-to-I editing association with genetic elements.
Figure 3.7 A-to-I hyper- edit recurrence stratified by the overlapping genetic element and recurrence rate.
Figure 3.8 Properties of clusters by the dominant base and repeat type of the edits contained in the cluster.
Figure 3.9 Global A-to-I cluster localization.
Figure 3.10 Chromosomal distribution of clustered edits.
Figure 3.11 Global Pearson correlations between chromatin marks, A-to-I edits, and genetic features.
Figure 3.12 A-to-I editing events may have been missed within introns.
Figure 3.13 Saturation analysis of A-to-I hyper-edits.
Figure 3.14 Properties of edit clusters and repeat elements within introns.
Figure 3.15 Properties of intergenic edits antisense to annotated genetic elements.
Figure 3.16 Localization of 3’-UTR edits with respect to poly(A) sites.
xi
Abbreviations
ssRNA single-stranded DNA
dsRNA double-stranded DNA
RT Reverse Transcriptase
PCR Polymerase Chain Reaction
bp Base Pair
nt Nucleotide
cDNA Complementary DNA
rRNA Ribosomal RNA
RNA-seq High-throughput whole transcriptome sequencing
A, C, G, T, U, I Adenosine, Cytosine, Guanine, Thymine, Uracil, Inosine
dNTP deoxyribonucleoside tri-phosphate
rNTP ribonucleoside tri-phosphate,
ddNTP di-deoxyribonucleoside tri-phosphate
PAGE polyacrylamide gel electrophoresis
SW, NW Smith-Waterman, Needleman-Wunsch
BWT Burrows Wheeler Transform
SNP, SNV Single Nucleotide Polymorphism, Single Nucleotide Variant
RBP, dsRBP RNA binding protein, double-stranded RNA binding protein
lncRNA Long non-coding RNA
SNV Single nucleotide variant
1
Background 1Nucleic acids are the blueprints of biological life on earth; they encode the regulatory and coding
potential of an organism’s genome. Since the discovery of the structures of DNA and RNA
molecules there has been an intense scientific effort to develop technologies to sequence DNA
and RNA molecules (1). Sequencing technologies have evolved from being low-throughput and
labor intensive to high-throughput and automated (Figure 1.1) (2-4). This has been an
incremental process where previous technological advances are integrated or improved to create
new sequencing technologies. These innovations have progressed the capabilities of sequencing
from a single transfer-RNA to individual genes and finally to whole viral, bacterial and
eukaryote genomes and transcriptomes. Within the last decade the increase in throughput and
decrease in cost has been staggering (5-7). With the rise of high-throughput sequencing
technologies such as Illumina, the cost of sequencing a human genome is below ten thousand
dollars. This has permitted population scale genome sequencing consisting of thousands of
genomes. Sequencing technologies are quickly reaching a point where it is possible to sequence
a whole human genome for less than one thousand dollars. This is in stark comparison to the first
draft human genome sequence published in 2001 which was estimated to cost nearly three billion
dollars (8, 9). This has led to a new analysis bottleneck due to the storage, processing, mapping
and analyzing the tremendous amount of sequencing data. The cost of the analysis is quickly
approaching the cost of the sequencing (7). Novel algorithms, tools and optimizations have and
continue to be required to meet the aforementioned challenges.
While historically individual RNA species were sequenced, current technologies have facilitated
sequencing RNA on a global scale (10). Whole transcriptome sequencing has been crucial to our
understanding of the cellular transcriptome by increasing sensitivity compared to microarrays,
serial analysis of gene expression, and expressed sequence tag sequencing (10, 11). This had led
to the identification of novel or infrequent regulatory events and transcripts such as RNA editing
and long non-coding RNA (12-14). However, this sensitivity has revealed a new challenge to
separate biologically relevant transcriptional events from spurious events (biological noise).
2
Figure 1.1. Timeline of sequencing technology developments. (A) Major milestones in the development of sequencing technologies are indicating along the timeline. (B) High-throughput sequencer read lengths, machines are listed in the legend. For machines that produce paired-end reads the length of a single read is indicated (C) Number of reads produced (D) Throughput in gigabases (|number of reads| ! |read length| / 109), for machines that produce paired-end reads the read length was doubled. Note that a logarithmic y-axis is used for panels (C) and (D). Data retrieved from Nederbragt, Lex (2012): developments in NGS. Figshare: http://dx.doi.org/10.6084/m9.figshare.100940. Note that only a single data point is shown for the Sanger 3730xl and the Megabase 4500.
A
B C
D
3
1.1 The Dynamic Eukaryotic Transcriptome
One of the fundamental biological processes within a cell is the transcription of gene products.
This process is tightly regulated and catalyzed by the RNA polymerase family of proteins.
Transcriptional regulation is carried out at multiple-levels including chromatin state and
transcription factor binding (15). Perturbations in the regulation of a gene or genes can lead to
developmental disorders and cancer (15). After transcription there is another set of regulatory
layers, the so-called post-transcriptional regulatory mechanisms (16, 17). Some of these
mechanisms occur co-transcriptionally while the nascent RNA is being transcribed (17-21),
while others occur post-transcriptionally and include: 5’-end capping, 3’-polyadenylation, RNA
splicing, RNA editing, RNA interference, and nonsense mediated decay (16, 17).
Transcripts were originally thought to have their function encoded within a single contiguous
sequence. However, this assumption was challenged by the observation that eukaryotic nuclear
transcripts were much longer than the corresponding cytoplasmic transcripts (22-24). This led to
the identification of intron sequences that “split” the functional nucleotides of a transcript. The
introns are excised from of the pre-mRNA transcript by the spliceosome complex co- and post-
transcriptionally (19). Two of the key sequence motifs are present at the 5’- and 3’- end of the
intron sequence, these are called the 5’- and 3’- splice sites respectively (25-27). Combinations
of different splice sites can be used to generate different transcript isoforms in a process known
as alternative splicing (25, 28). Another important sequence feature of introns is the branch point
site that is used to form an intron lariat by ligating the 5’-end of the intron to a conserved
adenosine within the branch point site (26, 27). Finally, conserved sequence motifs within the
exonic and intronic sequences are bound by trans-acting splicing regulatory factors that can
promote or repress the inclusion of an exon in the final transcript (28). Perturbations in any of the
conserved motifs or the expression of splicing regulatory factors can lead to developmental
defects or promote disease development such as cancer (29, 30).
RNA editing is a post-transcriptional modification of RNA molecules, in which specific
nucleotides are deaminated (31, 32). Two types of nucleotide deamination have been identified:
adenosine to inosine (A-to-I) and cytidine to uridine (C-to-U) deamination (33, 34). Inosine
preferentially base pairs with cytisine and has less energetically favorable base pairs with uracil
4
and adenosine (35). Uridine preferentially base pairs with adenosine and has a less energetically
favorable base-pair with guanosine (35). A-to-I and C-to-U editing can have a number of down
stream effects on the modified RNA molecule since the deaminated base has altered base
pairing. The effects of editing include: (i) consequences to base pairing and stability, (ii) changes
in translated amino acid sequences due to changes in amino acid codons within the RNA
transcript, and (iii) effects on alternative splicing (34, 36-38). C-to-U deamination has is an
integral part of host viral defense against dsRNA substrates (33).
A-to-I editing is the dominant form of editing in metazoans, and it is catalyzed by adenosine
deaminases that act on RNA (ADARs), which target double stranded RNA (dsRNA) (31, 32, 34,
38). ADAR proteins are only found in metazoan species (39). ADAR proteins do not exhibit
sequence specificity, but do show flanking sequence preferences (40). ADAR’s have two
primary modes of RNA editing: selective editing which targets dsRNAs with specific structures
to promote the editing of specific bases and promiscuous editing where long stem-loop structures
are edited randomly (34). This editing can occur co-transcriptionally, suggesting a potential
regulatory role for post-transcriptional regulatory pathways such as alternative splicing and
localization due to A-to-I edits altering the secondary structure and sequence motifs within the
targeted transcript (19, 21, 38). ADARs play a critical role in development and the central
nervous system (CNS) (34, 38). ADAR knockouts in mice result in a lethal phenotype, while
knockouts in Drosophila melanogaster and Caenorhabditis elegans remain viable (41-44).
Selective A-to-I editing is critical for a glutamine to arginine substitution in the glutamate
receptor-2 (GluR2) protein, which alters Ca2+ permissibility through the AMPA receptor (45,
46). The depletion of the glutamine to arginine substitution in humans contributes to the
development of sporadic amyotrophic lateral sclerosis (ALS) and the lethal phenotype in mice
(42, 47, 48).
The extent of A-to-I editing and its role in a cellular function has not been fully explored.
However, the majority of RNA editing is associated with transposable elements capable of
folding into dsRNA structures such as Alu elements in humans (49, 50). Since ADAR proteins
can relax dsRNA structures they could affect the binding of any dsRNA specific protein (51, 52).
Furthermore, inosine could affect the binding of proteins recognizing conserved motifs that are
edited by ADAR proteins. One of the best examples of this is the competition between ADAR
proteins and the RNAi pathway for dsRNA substrates that was observed in C. elegans (53, 54).
5
Another example is the auto-regulation of ADAR2 in rats by altering its splicing patterns through
the modification of a AA dinucleotide splice site to AI, which acts as a canonical splice acceptor
site (55). There are other examples of A-to-I editing affecting RNA stability, RNA translation
RNA localization, miRNA, antiviral protection, and heterochromatic gene silencing (17, 34, 38,
56-61).
1.2 A-to-I editing in Caenorhabditis elegans
C. elegans has two ADAR genes (adr-1 and adr-2) with only adr-2 being catalytically active. D.
melanogaster encodes a single ADAR protein. The C. elegans adr-1 gene has a truncation of its
catalytic domain and plays a regulatory role for RNA editing (44, 62). Unlike mammals, ADAR
knockouts in C. elegans remain viable but have a reduced lifespan, chemotaxis defects and
transgene silencing. In Drosophila, knockout of the sole ADAR gene leads to viable flies with
normal lifespans but they exhibit strong behavioral defects (63). Finally, in murine models both
ADAR proteins are required for viability; ADAR1 is required for erythropoiesis and ADAR2 is
required for proper function of the AMPA receptor (42, 64).
ADAR proteins appear to act co-transcriptionally implying that editing on the pre-mRNA
molecule can compete with other dsRNA-binding proteins. In support of this, ADAR proteins
have been found to alter small RNA expression (53, 54). Small RNAs are non-coding RNA
molecules with a size of 21-26 bp. There are multiple classes of small RNAs in C. elegans and
these include: microRNAs, endogenous small interfering RNAs, and piwi-interacting RNAs
(65). These small RNA molecules participate in multiple regulatory pathways including
translation and gene expression (65). Knockouts of RNAi pathway components lead to the
suppression of ADAR knockout phenotypic defects (66). Furthermore, adr-1, adr2 double
knockouts have perturbed small-RNA expression suggesting that both ADAR proteins can
compete with RNAi for dsRNA substrates (53, 54). This competition may be an early
mechanism to distinguish exogenous and endogenous sources of dsRNA. In this model,
endogenous dsRNA undergo RNA editing which inhibits their processing with the RNAi
machinery. This does appear to be the case in higher order eukaryotes where knockouts of
dsRNA sensing pathways suppresses ADAR1 knockout phenotypes in mice (67). ADAR
proteins also commonly target intronic dsRNA structures (53, 68-70). RNA editing or ADAR
binding may contribute to the regulation of circular RNA biogenesis or alternative splicing (71).
6
A-to-I editing in C. elegans clusters into regions of hyper-editing to produce transcripts with 30
or more edits (44, 53, 69, 70). The clusters of hyper-editing are typically found within non-
coding DNA elements including: introns, 3’-UTR’s, and intergenic sequences (53, 68, 69). These
regions are associated with sources of dsRNA including inverted repeats and transposons (34,
53, 72). Finally, hyper-edited regions tend to be localized to the arms of the autosomal
chromosomes (68).
1.3 Nucleic Acid Sequencing
Nucleic acid sequencing has been an important area of research since the discovery of the
structure of DNA and RNA in 1962 (1, 73). The seminal methods developed during the early
days of nucleic acid sequence have been used extensively in the current generation of
sequencers. The primary drive for nucleic acid sequencing has been to reduce costs, decrease
sequencing time, and to increase sequencing read throughput and read lengths.
The first gene to be completely sequenced was the 77 nt yeast transfer-RNAAla in 1965 by Holley
et al (73). They utilized complete and partial digestions with a set of ribonucleases with known
specificity combined with ion-exchange chromatography to resolve the fragment sizes. This was
improved by using two-dimensional polyacrylamide gel electrophoresis of degradation products
(2-4 nts) labeled with the radioactive isotope 32P (74). This method facilitated the sequencing of
the first RNA gene, the ~460 nts RNA bacteriophage MS2 coat protein, and later the entire 3,569
nt MS2 RNA genome (75, 76). These mechanical sequencing methods were laborious and did
not scale well to larger genes and genomes.
The fundamental innovation that lead to DNA sequencing was primer-extension which used
sequencing by extension and combined three important observations (77): 1) that
deoxyoligonucleotides sequences (primers) can be annealed to template DNA to prime synthesis
with DNA polymerase (78, 79); 2) radiolabelled dNTPs can be used in the primer extension
reaction to extend a specific primer (78); 3) primer extension can be terminated by not using all
four of the dNTPs (78). The next major development took advantage of primer extension to
develop the “plus and minus” DNA sequencing method that used sequencing by extension to
sequence the 5,386 nt bacteriophage ΦX174 genome (first DNA genome sequenced) (80, 81).
7
In parallel to Sanger and Coulson, Maxam and Gilbert developed a simplified chemical
sequencing method capable of sequencing dsDNA or ssDNA in 1977 (82). The method consisted
of treatment of DNA 32P radiolabelled at one of its 5’-ends. The radiolabelled fragments were
treated with one of four different reagents that cleaved DNA at A+G nucleotides, G nucleotides,
C, nucleotides or C+T nucleotides. The fragments were then resolved in a PAGE gel to sequence
more than 100-200 bp per reaction. This method was easier to use compared to the “plus and
minus” method since it required only four reactions rather than eight and it could use both
dsDNA and ssDNA as template. Similar chemical sequencing methods were developed and
applied to sequencing 3’-radiolabelled RNA (83).
The Sanger sequencing method was developed in 1977 and represents one of the most important
innovations in nucleic acid sequencing (84). Sanger sequencing uses sequencing by synthesis
and chain-terminating dideoxynucleotides to sequence each of the standard dNTPs
independently. Each reaction consisted of the standard dNTPs, ssDNA template, a 5’-32P labelled
primer, and a lower concentration of one of the four ddNTPs. The ddNTPs lack a 3’-hydroxyl
preventing the formation of phosphodiester bonds between nucleotides, causing DNA
polymerase to terminate extension at the ddNTP. The four reactions were originally resolved on
a PAGE slab gel, but because of issues resolving ssDNA sequences with secondary structure,
this was later rectified by using thin denaturing polyacrylamide sequencing gels (85). The next
adaptation came by replacing the need for radiolabelled primers by labelling each of the four
standard ddNTPs with a different coloured fluorescent dye (86, 87). Dye-terminator based DNA
sequencing consisted of a mixture of the standard dNTPs and all four of the dye-labelled
ddNTPs. The sample could then be resolved in a single sequencing lane rather than four. The
sequences could be read with optics and automated software algorithms that greatly speed up the
sequencing process and increased sample throughput. These methods were further improved with
recombinant DNA polymerases and improved dye technologies. One downside of Sanger
sequencing was that it requires ssDNA template, which was resolved by the development of
ssDNA bacteriophage based cloning vectors (88). As the quality of sequencing reagents
increased the limiting factor became the sequencing gel which had a maximum read length of up
to ~700 bp. Furthermore, increasing the number of samples in a sequencing gel (up to 96) caused
difficulty with sample loading, lane tracking and the potential for overlapping bands between
samples (89).
8
There was a need to separate each sample into its own miniature gel, which was accomplished by
using polyacrylamide-filled capillaries, which eventually increased the read length to 1,000 bp
(90-93). The next major increase in throughput was the ability to replace the gel within a
capillary by using non-crosslinked polyacrylamide to sequence another sample and to arrange the
capillaries into arrays with 96 to 384 capillaries per sequencing machine (94, 95). The
sequencing reaction steps and loading of the capillaries could be automated permitting increased
throughput. Automated sequencing processes could read the sequence and produced base-
qualities for each sequenced nucleotide to indicate confidence in the base. The discovery of
thermostable polymerases and the development of PCR permitted the chain-termination reaction
to be subject to repeated cycles of denaturation, primer annealing, and extension (96-98). Cycle
sequencing reduced the amount of primer wasted during sequencing, increased the yield of the
terminated sequencing products and eliminated the requirement for ssDNA templates
1.4 High Throughput Sequencing
The huge cost and effort involved with sequencing the draft human genome continued the push
for more sequencing throughput and reduced sequencing costs. This lead to the development of
three novel cyclic sequencing methods that took advantage of innovative optical, microfluidic
and biochemical sequencing technologies: 1) 454 pyrosequencing (454); 2) Solexa / Illumina
sequencing by synthesis; 3) SOLiD sequencing by ligation. These methods heralded a revolution
in DNA sequencing and were capable of producing hundreds of megabases to terabases of data
in a single experiment. Common to all of these methods is the need to fragment a DNA library
into pieces that are sized according to the sequence read length and process. The complete
sequencing experiment can be completed in a week or two, cost significantly less than Sanger
sequencing, and produce base quality scores similar to automated Sanger sequencing methods.
The 454 sequencer was the first commercially available high-throughput sequencing system. The
system involves the fragmentation of a DNA library, adapter ligation and then the binding of
individual fragments to microbeads (99, 100). The ssDNA substrates are clonally amplified on
each bead using emulsion PCR. The beads are mixed with sequencing primer and DNA
polymerase to prepare them for nucleotide extension. The beads are deposited into picoliter-sized
wells on a fabricated flow cell. Each flow cell contains 1.6 million wells that have enough space
for a single bead. The beads in each well are sequenced by pyrosequencing using ATP
9
sulfurylase, luciferase, and apyrase (100). The ATP sulfurylase catalyzes the conversion of
pyrophosphate to ATP in the presence of adenosine 5’-phosphosuflate and the ATP and luciferin
is catalyzed to visible light by luciferase. Each of the four standard dNTPs is then washed across
the slide and the light is captured using a specialized camera. Unincorporated dNTPs and left
over ATP are degraded by the apyrase. The dNTP wash can be repeated up to one hundred times
producing 1,000,000 reads up to 500 bp in length. A typical sequencing experiment can produce
up to 700 megabases.
Illumina sequencing utilizes sequencing by synthesis with reversible termination to sequence
ssDNA fragments (101). There are two modes of Illumina sequencing: single-end and paired-end
sequencing. For single-end sequencing only one end of the template fragment is sequenced,
while for paired-end both ends of the template are sequenced in two steps. Paired-end
sequencing is the most common method of Illumina sequencing for genome and RNA
sequencing, while single-end sequencing is common for methods with small fragment templates
such as small-RNA-seq and ChIP-seq (10, 102).
Illumina libraries are prepared by ligating sequencing adapters to randomly generated and size
selected DNA fragments (101). The adapter-ligated fragments are amplified using adapter
specific PCR primers and a low number of cycles. The amplified fragments are size-selected
using beads or agarose gel and the fragment sizes are selected to minimize read overlap when
sequencing both ends of the fragment. Instead of clonally amplifying the DNA on beads using
emulsion PCR, the fragments are hybridized onto a lane within a flow cell (101, 103, 104). Each
lane of the flow cell is coated with oligonucleotides that are complementary to the sequencing
adapter. The library fragments are hybridized to the lane taking into consideration the density of
ssDNA molecules that can be sequenced within the lane. The hybridized ssDNA molecules are
amplified using bridge amplification to create clusters of ~1000 molecules (101, 103, 105). The
clustering step is essential for the visualization of the sequencing reaction. The clusters are then
linearized to ssDNA, a sequencing primer is annealed and recombinant DNA polymerase is
added to the flow cell. Illumina sequencing uses reversibly terminated dNTPs that have are 3’-
labelled with a fluorescent dye (101, 106). Each of the four dNTP’s is labelled with a different
color and the flow cell is flooded with all four of the dNTP’s at once. DNA polymerase
incorporates the dNTP’s and the flow cell is imaged to detect the colour emitted by each cluster.
The reversible terminator and fluorescent label is then cleaved off to free the 3’-OH group for the
10
next round of extension. A key innovation with Illumina sequencing is that clusters can be
regenerated, which permits the sequencing of the opposite end of the DNA fragment to produce
paired-end reads.
Currently, the Illumina HiSeq 4000 can sequence two flow cells each consisting of eight lanes.
Each lane produces ~300 million 2 x 150 bp paired-end reads in 3.5 days. The MiSeq is a
miniature version of Illumina’s HiSeq that can produce 25 million 2x300 bp paired-end reads in
a single lane flow cell. The cost of sequencing reagents and simplicity of the sequencing reaction
has made Illumina sequencing the method of choice compared to 454 and SOLiD sequencing.
1.5 Illumina Sequencing Artifacts
The huge throughput offered by Illumina sequencing is not without issues. The most common
issue with Illumina sequencers are errors that lead to false positive mismatches within the
sequenced read and GC content biases (4, 107-109). False positive mismatches occur due to two
sequencing issues, the first being cross-talk and the second phasing. Cross-talk errors occur due
to spectral overlap between the fluorescently labelled nucleotides. These cross-talk issues
manifest as A-to-C and G-to-T mismatches due to their similar emission spectra. Phasing occurs
due to the loss of synchrony of the sequencing reactions in a cluster due to issues with the
Illumina reversible-sequencing chemistry. Phasing can be subdivided into pre-phasing and
phasing errors. Phasing occurs due to the incomplete removal of a reversible terminated
fluorescent-base in a proportion of the molecules in a cluster, which causes some of the
sequencing reactions to lag behind. Pre-phasing occurs when sequencing of a molecule in a
cluster is advanced further than the rest of the molecules by missing the incorporation of
nucleotides without proper terminators. These mismatch errors are usually marked as bases with
low-base quality scores and tend to get progressively worse towards the 3’-end of the sequenced
molecule. This is reflected as a general degradation of base quality scores towards the end of the
read. Finally, the PCR amplification step during library preparation leads to a lower abundance
of GC-rich and GC-poor fragments (110, 111). While Illumina has mitigated these issues over
time, they have not been completely resolved.
11
1.6 RNA Sequencing
One of the difficulties with the understanding of gene structure and function relationships within
the eukaryotic transcriptome is the high level of transcript post-processing, which include RNA
editing, alternative splicing, and polyadenylation (16, 112, 113). Prediction of these events is
difficult using genome sequencing information alone (114). There is a need to determine and
quantify what RNAs are present within a sample and the extent of their post-transcriptional
modifications on a global scale. In general the ideal RNA sequencing experiment would: 1)
quantify the overall expression of a gene, 2) infer the RNA isoforms encoded within a gene, and
3) quantify the expression of each individual isoform, and 4) identify expressed variants, RNA
editing, and other post-transcriptional modifications. Sequencing technology is a viable method
to determine all of the aspects of RNA. By quantitatively sequencing the total RNA content of a
cell it is possible to calculate gene expression, isoform structures and isoform expression (115,
116). The majority of sequencing innovation is focused on sequencing DNA, however, the
discovery of reverse transcriptase (RT) in 1970, which acts to reverse transcribe RNA into
cDNA provided a means to apply DNA sequencing technology to RNA (117, 118).
Using RT to generate cDNA introduces issues compared to standard DNA sequencing that need
to be considered: 1) RT’s lower fidelity, 2) the presence of a RNAse H domain, and 3) RT
template switching effects (119-122). The lower fidelity of RT is attributed to the lack of a
proofreading activity, which leads to a higher error rate compared to DNA polymerases. The
RNAse H domain can degrade RNA templates and first-strand DNA synthesis products during
long incubations. The RNAse H domain can also limit full-length cDNA synthesis. Recombinant
RT proteins have improved fidelity and carry an inactive RNAse-H domain (122). RT template
switching occurs when the RT enzyme switches templates either inter- or intra-molecularly and
continues reverse transcription (119, 120). This can lead to the identification of false-positive
fusions or splicing events. The effect and rate of reverse transcriptase template switching has not
been fully explored, however, it has been shown to cause false positive trans-splicing, gene
fusion and cis-splicing events (119, 120). Collectively, the discovery of RT and the
advancements in DNA sequencing have provided a foundation for the development of novel
methods that permitted the identification and / or the quantification of post-transcriptional
regulation. Other important sequencing techniques include: expressed sequence tag (EST)
sequencing, serial analysis of gene expression (SAGE), high-throughput quantitative RT-PCR
12
based RNA isoform quantification, and RNA-seq (118, 123-127). These methods have permitted
scientists to evaluate the dynamics of gene expression and post-transcriptional regulation using
DNA sequencing technologies.
The eukaryotic transcriptome is complex due to its dynamic nature, which includes tissue and
cell type specific expression, alternative mRNA isoforms, overlapping transcripts and antisense
transcripts (15). The accurate identification of isoform structures within eukaryotic organisms is
challenging due to the presence of introns, which can range in size from a few 100 base pairs to
upwards of 500,000 base pairs (128). Alternative splicing confounds this because a gene can
have multiple isoforms that cannot be easily inferred from the genome sequence. The earliest
methods for sequencing the transcribed portion of an organism’s genome took advantage of the
poly(A) tails present on the majority of mRNA’s. The key steps involved reverse transcription
with poly(T) primers that would anneal to the 3’-poly(A) tail of an RNA transcript and
synthesize a complementary DNA sequence (118, 124, 125, 127, 129). The cDNA sequences are
then cloned into vectors and transformed into the appropriate organism for replication. Individual
clones can be sequenced using Sanger sequencing to produce reads up to 1kb in length. This
method is known as cDNA cloning with expressed sequence tag (EST) sequencing and was, and
continues to be, used to profile alternative splicing, allele specific expression, gene structure, and
RNA editing. As of this writing the NCBI’s dbEST has over 74 million EST sequences from a
variety of eukaryotic species (130). EST sequencing has been an essential complement to
genome sequencing. Genome sequencing studies generally combine both genome and EST
sequencing to better annotate and predict gene structures. EST sequencing has three major
shortcomings: the first is that at best semi-quantitative, the full transcript is not sequenced, and
the second is that some cDNA sequences are recalcitrant to cloning.
The lack of quantitative information from EST sequencing limits the techniques ability to
support the analysis of gene expression and post-transcriptional regulation. Methods have been
developed to quantify the RNA content of a cell such as microarrays but they are dependent on
gene annotations,which change over time as new genes and isoforms are discovered. The number
of probes that can be printed on a slide further limits microarrays (limit ~2 million). A
quantitative RNA method based on sequencing solves this issue since the sequences can be
remapped as genome annotations improve. The first method to quantify gene expression on a
global scale with sequencing was the serial analysis of gene expression (SAGE) method (123,
13
131). SAGE consists of reverse transcribing an RNA sample using poly(T) primers, fragmenting
the cDNA, and isolating short (14-20 bp) tags. The tags are concatenated together into
concatemers that are 800 bp - 1kbp in length. The concatemers are cloned and sequenced using
Sanger sequencing. Finally, the tags can be mapped back to the genome and the relative
abundance of the tags can be used to infer the abundance of the gene. SAGE provides a method
to analyze global gene expression; however, there are some disadvantages. Generating the
concatemer tags is a laborious process because short tags can have ambiguity in their mapping
back to the genome, the resolution of SAGE is low without sequencing many concatemer clones,
and isoform structures cannot be inferred or quantified. Variations of the SAGE tag-sequencing
method have been developed, these include cap analysis of gene expression (CAGE) and
longSAGE that uses 21 bp tags (131).
1.7 High-throughput RNA Sequencing
High-throughput sequencing technologies such as Illumina sequencing allow the global profiling
of the RNA content of a cell at single nucleotide resolution (10, 132-134). This whole
transcriptome sequencing approach - referred to as RNA-seq - is quantitative for gene and
isoform abundance estimates (10, 133). The single nucleotide resolution can be utilized for the
de novo identification of gene and isoform structure, splice junctions, RNA editing sites, RNA
methylation, and allele specific expression (12, 115, 116, 135, 136). RNA-seq has lead to
fundamental discoveries in biology such as the identification of novel RNA classes including
long non-coding RNA and circular RNAs, and the extent of genes alternatively spliced in the
human transcriptome (116, 137-139).
The preparation of a RNA-seq library consists of four steps: 1) RNA purification and size
selection 2) random fragmentation 3) cDNA synthesis 4) sequencing adapter ligation. Some of
these steps may be combined, the order of steps changed or the protocol modified, and additional
RNA selection and depletion steps may be used. The library preparation can also be modified to
preserve strand information for each fragment (140). The method used to generate the library
should be tailored to the experimental goals of the study and taken into consideration for the
downstream analysis of the sequencing data. The RNA-seq library is commonly sequenced as a
paired-end library with the Illumina sequencer.
14
1.7.1 RNA-seq Library Preparation
The first step when preparing a RNA-seq library is to determine whether poly(A)-selected,
poly(A)-depleted or total RNA is to be sequenced (141, 142). Poly(A) selection uses a Poly(T)
column or magnetic beads to select RNAs with poly(A) tails. This eliminates the majority non-
poly(A) RNA, including rRNA. However, some non-poly(A) transcripts with A-rich regions may
also be selected. Poly(A)-depletion consists of collecting the non-hybridized RNA after poly(A)-
selection, while total RNA does not use any enrichment methods. For both poly(A)-depleted and
total RNA samples, ribosomal RNAs are extremely abundant and must be depleted. Common
methods include duplex-specific nuclease treatment, subtractive hybridization and magnetic bead
based rRNA depletion (133, 143, 144). Total RNA and poly(A)-depleted RNA samples include a
greater abundance of pri-mRNAs, repeat elements, erroneously spliced transcripts and transcripts
that are targets for RNA editing. The next step can either be random fragmentation or cDNA
synthesis depending on the protocol. After double stranded cDNA fragments are generated,
sequencing adapters are ligated to the fragments, amplified and sequenced following the standard
Illumina paired-end DNA sequencing protocol. Stranded information for the RNA can be
preserved with modifications to existing protocols but require an additional layer of complexity
and create an additional source of errors and bias (140). There have been various commercial
“kits” developed by different vendors to encapsulate all of the library steps into an simplified
protocol. These kits have been optimized for different RNA input levels from single cells to
tissue samples, strand information preservation, and for poly(A)-selection or total RNA
sequencing.
The most commonly used library preparation method is the Illumina TruSeq protocol. The
TruSeq protocol uses two rounds of poly(A) selection followed by incubation at 94°C with
divalent cations to fragment the RNA and random hexamers to prepare it for first-strand cDNA
synthesis (Figure 1.2). Reverse transcriptase is used to generate the first cDNA strand. Second
strand cDNA is synthesized using DNA polymerase with RNase H to degrade the RNA template.
The fragments are then blunted and tailed with a single 3’-A and the Illumina sequencing Y-
adapters are ligated onto the fragments. The fragments are then sequenced following Illumina’s
paired-end sequencing protocol, which includes a linear PCR amplification step.
15
1.8 RNA-seq Library Preparation Challenges
In addition to the sequencing issues and biases associated with the Illumina sequencing platform
there are a number of issues unique to the preparation of RNA-seq libraries that can confound
down-stream analysis and / or obscure biologically relevant signals (141). These issues are
primarily due to biases during library enrichment, fragmentation and reverse transcription (141,
145, 146). Some of these biases can addressed computationally, while others are more difficult to
correct for and should be considered before making any biological conclusions.
Poly(A)-selection can lead to a fragment bias towards the 3’-end of transcripts and lower read
coverage at the 5’-ends of transcripts (10). The end-bias can affect gene-expression
Figure 1.2. RNA-seq library preparation schematic. For a description of the steps see section 1.7.1
(AAAAAAAAAAAAA) Selection and Fragmentation
First Strand cDNA Synthesis
Second Strand cDNA Synthesis
Adapter Ligation
Linear Amplificaiton
PCR Amplificaiton
Read 1
Read 2
Not Sequenced
Fragment Size
Paired-end Sequencing
16
quantification, variant calling and splice junction identification (147). This is particularly
problematic for transcripts with low expression. Biological noise is more pervasive in deeply
sequenced total RNA and poly(A) depleted samples and may contain additional transcripts that
are erroneously expressed or spliced (148, 149). These libraries also have greater abundances of
intronic sequences due to incompletely spliced pre-mRNA’s, and repeat elements such as
transposons. These additional sequences may not be relevant for the experiment and reduce the
sensitivity for transcripts with low expression levels. The library selection may contain
contaminating genomic DNA if the library is not properly DNAse treated; this can lead to the
false positive coverage of regions that are not normally expressed (149).
Library fragmentation leads to a read-bias towards longer reads since a longer transcript will
have more potential fragments than a smaller transcript (145, 147). This leads to a long transcript
having more sequenced pairs than a smaller transcript with similar expression levels.
Reverse transcriptase is not error correcting and has a higher error rate than high-fidelity DNA
polymerases (121). These are true mismatch errors that are not due to spectral overlap during the
sequencing reaction, so they may have high base qualities. This can lead to an increased false
positive mismatch calls compared to DNA libraries. This is particularly problematic for the
identification of RNA editing and variants. RT can switch templates intra- or inter- molecularly
during reverse transcription of structured regions leading to the generation of chimeric transcripts
(119, 120). These transcripts can cause the identification of false positive splicing events and
gene fusions. Random hexamer first-strand synthesis priming can lead to biases in the nucleotide
composition at the beginning of sequencing reads (145, 146, 150). The annealing of random
hexamer primers with mismatches can lead to mismatches in the sequencing reads compared to
the RNA fragment.
1.9 Sequence Alignment Algorithms
As sequencing technology produces longer and more numerous sequencing reads it has become
increasingly challenging to align or map these sequences to a reference genome, or assemble
them into contiguous sequences. The accurate mapping or alignment of the sequencing reads to a
database is essential for the downstream analysis of the sequencing data. One of the seminal
developments for sequence alignment algorithms was the development of the “global”
Needleman-Wunsch and “local” Smith-Waterman dynamic programming based alignment
17
algorithms (151, 152). These algorithms permit the alignment of two different sequences with
mismatches, gaps (insertions and deletions) and a reasonable computational complexity. The key
difference between the NW and SW algorithms is that the NW algorithm aligns the two
sequences completely from end to end (global alignment), while the SW algorithm can produce
smaller internal alignments (local alignment). Both of these algorithms were further refined by
the development of the affine gap penalty modification (153). The affine gap penalty uses
separate scores for opening and extending an insertion or deletion; this is more biologically
relevant since a longer gap can be generated by a single mutational event. Vectorizing the
algorithms using state-of-the-art CPU instructions further increased the performance of the SW
and NW algorithms (154-156). These methods can also be sped up using heuristics to reduce the
alignment space to the most relevant regions of the aligned sequences (157, 158). For example,
reducing the alignment space by limiting the number and size of insertions or deletions can
increase performance with a small reduction in sensitivity.
The next challenge for sequence alignment was to map reads to a database of known sequences
or chromosomes. The SW and NW algorithms would spend a significant amount of
computational time in regions that may not contain a relevant alignment, which became
infeasible for large databases or numerous sequencing reads. An algorithm was needed to
identify regions and sequences that are likely to contain a valid alignment rather than brute-
forcing the alignment across the whole database. Many tools have been developed to solve this
problem with various speed and sensitivity tradeoffs, these tools include FASTA(159),
BLAST(160), BLAT(161), and BWA(162). These tools offered a substantial speed improvement
for database searching compared to the SW or NW algorithms.
Early heuristic methods were developed to reduce the search space for sequence alignments by
populating a hash table of k-grams derived from a sequence database (159, 163). The k-gram
table uses a collection k-mers of length k and a step size s that indicates the distance between
each k-mer, for example with s = 2, every other k-mer would be retained. The hash table can be
constructed in linear time, has a customizable memory footprint and offers constant lookup
speed. A query sequence can be used to search the hash table by looking up the positions of all
the k-grams across the query sequence. Heuristics can be used to choose which regions should be
investigated for alignment. The k-size can influence mismatch tolerance, sensitivity and
performance of the algorithm. Smaller k values will be more sensitive, however, there will be
18
more potential regions of interest. The step size can be used to reduce the memory footprint and
increase performance of the hash table with a reduction in sensitivity. The typical memory
footprint for a DNA hash table using a two-bit encoding for the DNA alphabet is approximately
4(4! + !/!) bytes, where N is the number of nucleotides in the database. One of the
advantages of hash tables is that they can be constructed on machines with limited memory. One
disadvantage of hash tables for searches with low sequence similarity is that the k-size needs to
be small. To mitigate this a variant hash table utilizing “spaced seeds” that are generated using
non-continuous k-grams from a defined set of patterns is used to increase sensitivity without
reducing the k-mer size (164).
FASTA(159) was the first alignment tool to use hashing, heuristics and the SW algorithm to
generate gapped alignments. BLAST(160) was the first tool to use a hash table of the query
sequence to scan a large database for potential alignments. BLAST performance was further
increased with a modified version of the SW algorithm called X-drop that extends regions of
interest in each direction and stops if the alignment score drops below a threshold (158). BLAST
can be used to search very large databases such as GenBank.
A similar tool to BLAST, the BLAT(161) program, uses a hash table of all non-overlapping k-
grams in a database sequence and is capable of mapping cDNA sequences across introns. BLAT
maps reads across introns by stitching together anchors derived from seeds and takes advantage
of conserved splice-site information. BLAT is useful for the alignment of sequences derived
from cDNA libraries.
PatternHunter(164) was the first alignment tool to use a spaced-seed based hash table approach.
This offered an improved sensitivity and speed compared to traditional hash table approaches.
Tools optimized for long reads >200 bps were not optimal for the alignment of the short (35 –
150 bp) paired-end reads generated by high-throughput sequencers such as Illumina (163).
Moreover, existing tools were too slow to map the millions of reads generated by high-
throughput sequencers. Tools optimized for high-throughput read alignment generally map each
read from a pair independently and join the hits together using the fragment size distribution.
Two common approaches are used to construct an index (usually a hash table) to map a database
of reads to a database of reference sequences 1) indexing the reads and scanning the genome 2)
19
indexing the genome and scanning the reads (160, 163). The latter has become the dominant
indexing method since it is faster to index the genome once and use it for each experiment rather
than index the reads for each experiment.
The first short read sequencing aligners used hash tables. Eland (Cox. unpublished) and MAQ
(165), two of the first short read aligners indexed the reads and scanned the genome using a seed
and extend approach in a method similar to BLAST. MAQ permitted a configurable number of
mismatches per read while Eland was limited to two mismatches. As the number of reads and
their lengths increased it became infeasible to index all of the reads. The next generation of short
read alignments indexed the genome rather than the reads. These tools included SOAP (166) and
BFAST (167), which used hash tables and a seed and extend approach. To further increase
sensitivity for mismatches and gaps, short read alignment tools that used spaced seeds were
developed. These included including SHRiMP (168), which uses a vectorized version of the SW
algorithm. Among the disadvantages of these hash based alignment tools is that maximal
performance requires a large memory footprint. Furthermore, having a constant k-gram size
limits sensitivity. For example, MAQ requires ~14GB of memory for the human genome and
SHRiMP with its multiple-spaced seeds indexes requires ~48GB of memory. Clearly, new
memory efficient and sensitive index structure was needed.
Around 1999, bioinformaticians began exploring other full-text index options. These included an
index structure called the suffix tree, which consists of all the suffixes of the database sequence
stored in a tree (169, 170). The suffix tree can be used to find the occurrences of any query
length in linear time. The suffix tree permits occurrence searching with mismatches. This is in
contrast to hash tables that can only match strings with a specific k-size. Suffix trees were
previously used for whole genome comparisons (ie. mummer (171)), however, the memory
requirements were too large (typically ~12.5 bytes per base) (172). The next advancement came
by replacing the suffix tree with a suffix array, which is an array of all the suffixes in the
database sorted in lexicographic order (173, 174). Suffix arrays only require 4 bytes per base, but
lose the ability to solve some of the complex string matching problems compared to suffix trees.
Exact matches can be looked up using a suffix array in linear time with the proper auxiliary data
structures. Additional data structures can be used to supplement a suffix array to solve the same
set of problems as a suffix tree (174). Suffix arrays still required too much memory; roughly
20
48GB for the human genome, which was much more memory than most computers had in 2008;
therefore, an even smaller index was needed.
The next major advancement came with the invention of the Burrows-Wheeler Transform and
the FM-index (175, 176). The BWT is a transformation of the reference sequence that can be
derived from the suffix array. The BWT is utilized by the FM-index with additional auxiliary
data structures including: a down-sampled suffix array, an occurrence data structure and a
constant sized character count table (162, 163, 177, 178). The FM-index permits linear time
pattern matching with low memory requirements. The BWT can be stored in !/4 bytes, for 2 bit
encoded DNA sequences. The sampled suffix array typically uses 4N / 2! bytes where k defines
the sampling rate of the suffix array. A larger k increases performance at the expense of memory.
The occurrence data structure can take up between !/2 and ! bytes of memory. Increasing the
sampling rate does incur a performance cost since additional processing is required to find the
positions of a given occurrence. The FM-index can be used to find occurrences of strings with
mismatches but the performance is reduced as the number of mismatches increases beyond two
(178).
The first aligner to use a FM-index was BWT-SW (163, 177), which was a proof of concept that
a FM-index in conjunction with the SW algorithm can be used to generate local alignments with
mismatches and gaps. This implementation was slower than BLAST and BLAT, but was seminal
in demonstrating the utility of the FM-index for DNA alignment. BWT-SW led to the
development of multiple alignment tools that use the FM-index including SOAP2 (179), Bowtie
(178) and BWA (162). SOAP2 splits the read into three seeds and identifies their positions using
an FM-index. SOAP2 is limited to two mismatches or short gaps. Bowtie generates gapless
global alignments using a seed-and-extend strategy with a FM-index; this permits very fast
execution. To find seeds Bowtie uses a base quality aware backtracking algorithm, to find seeds
with a maximum of three mismatches. The seeds are then extended using the FM-index to
produce global alignments, Bowtie stops when a valid alignment is found. Bowtie2 (156) is a
refinement to Bowtie that uses a vectorized SW or NW algorithm to generate local or global
alignments respectively, this permitted Bowtie2 to map reads with more mismatches and across
gaps. Finally, BWA uses a backtracking algorithm similar to Bowtie but supports gaps and
higher numbers of mismatches at the expense of performance. BWA-mem (180) is a refinement
to BWA that uses the FM-index to find seeds and a SW and X-drop like algorithm to extend the
21
seeds and generate alignments. These alignment tools tend to require less memory, increase
performance and have similar or better sensitivity compared to hash table based alignment tools.
1.10 RNA-seq Read Alignment
Paired-end RNA-seq read alignment is difficult compared to contiguous read alignment due to
the non-contiguous nature of mRNA transcripts (34, 38, 181). Critically, RNA-seq aligners must
be able to identify short exonic alignments in regions that can be interspersed with introns that
can reach hundreds of thousands of kilobases in length, with the longest surpassing 500kb
(Figure 1.4) (182). High-throughput sequencing pairs are generally short (<150 bp each) and
derived from longer DNA fragments, making it challenging to identify exonic alignments due to
potentially very short exonic overlaps on either side of the splice junction. Mapping across splice
junctions is further challenged by the general deterioration of base quality (ie. higher mismatch
rate) at 3’-read ends, which can lead to the identification of false positive splice junctions and
variant calls (108).
22
RNA-seq alignment tools are generally divided into three main categories: 1) exon-first, 2) seed-
and-extend and 3) hybrid tools (182, 183). Exon-first tools map reads to the genome
contiguously using a traditional high-throughput sequencer aligner and then the unmapped reads
are processed for splice junctions. These methods rely on global alignments and may miss
spliced alignments with short exonic overlaps and alignments that have a suboptimal contiguous
alignment. Seed-and-extend methods map reads in a method similar to BLAT where anchors are
stitched together using a DNA index. Seed-and-extend approaches perform best when identifying
novel splice junctions (184). Hybrid methods combine an exon-first alignment with a seed-and-
extend method. An important difference between RNA-seq aligners is their capability to identify
non-canonical (ie. non-GT-AG) splice junctions. There have been more than thirty tools and
methods developed to map reads across annotated splice junctions, discover novel splice
Gene
Alignment
Fragment Size
Read 1
Read 2
Transcript
Paired-end Sequencing
-AAAAAAA
Repeat AlignmentInsertion Deletion Mismatches
Figure 1.3. RNA-seq alignment overview. A paired-end read must be mappable across introns. An example read pair that maps across an intron in two places is illustrated. Examples of insertions, deletions, mismatches and repeat alignments that must be supported are illustrated.
23
junctions de novo and map reads across combinations of annotated and novel splice junctions
(183, 184).
RNA-seq aligners must not only be able to map reads across splice junctions, but they must align
and detect that read-pairs map concordantly and in the correct orientation (183, 184). Generally,
for read-pairs to have a concordant alignment they should map to the same chromosome and map
to opposite strands. Pairs that do not have this orientation may have arisen due to library
preparation artefacts, genome rearrangements or false positive alignments. Most tools map read
pairs independently and subsequently their alignments are combined to generate concordantly
mapped pairs. The majority of RNA-seq alignment tools consider a pair concordantly mapped if
they are in the correct orientation and within a user-defined distance, which typically defaults to
500kb (183, 184). This long paired alignment distance threshold can lead to the false
identification of read pairs that map between genes or transcriptional units.
The first RNA-seq alignment methods were designed for single-end reads and used databases of
annotated splice junctions (116, 138). Modern RNA-seq aligners use one or more of the
following methods to map reads across splice junctions: 1) the use of a splice-junction database,
2) segmentation based alignments, and/or 3) seed-and-extend algorithms (Figure 1.2) (183, 184).
The first method tends to give the highest specificity and sensitivity if all of the splice junctions
are known, however, if the splice junction database is incomplete the sensitivity and specificity
can be reduced. The latter two methods are capable of the de novo identification of splice
junctions while the former is only capable of identifying novel splice junctions involving known
splice donor and acceptors. These methods increase sensitivity by identifying splice junctions
missing from databases of annotated splice junctions. These tools usually limit the maximum
size of a novel intron to 500kb. Utilizing gene annotations generally produces better quality
spliced alignments, since the number of splice junctions that must be discovered is reduced
(184). The alignment tools with the highest sensitivity and specificity are hybrid methods that
integrate existing splice-junction annotations with the identification of de novo splice junctions
(184).
24
Splice-junction databases are the first and most accurate method to map RNA-seq reads across
annotated splice junctions (116, 138). This method consists of generating a database of synthetic
sequences where each entry consisted of the concatenation of the exonic sequences upstream and
downstream of the splice junction. The length of the sequence flanking the splice junction is
usually set to the length of the sequencing read. The reads can then be mapped to this splice
junction database using a traditional contiguous read aligner such as Bowtie or BWA (162, 178).
Contiguous alignment tools have been engineered to sensitively map alignments with
mismatches and gaps but do not have the ability to align reads across splice junctions. This
method produces the most accurate spliced read alignments since accurate and error-tolerant
contiguous read aligners can be used for alignment. This prevents the identification of false
positive splice sites that may be generated due to poor mismatch and gap tolerance during
alignment. Candidate novel splice junctions can be detected by populating the database with
every combination of annotated 5’- and 3’- splice sites derived from the same strand (138). The
Figure 1.2. RNA-seq Alignment Strategies. The three most commonly used RNA-seq alignment strategies are illustrated. Alignments using a splice-junciton database (left column) where a set of synthetic sequences representing the spliced-mRNA are used for alignment. Alignments where an indexing strategy such as a hash table or suffix array are used to find exonic alignments that are “stitched” together to form spliced alignments (center column). Split read alignment algorithms (right column) where the read sequence is split into N pieces and each piece is independently mapped to a splice junction database and/or the genome reference.
Seed and Extend
Exon 1 Exon 2
Seed Alignment
Align Using Seeds
Exon 1 Exon 2
Junction Database
Align to Database
Resolve
Exon 1 Exon 2
Exon 1 Exon 2
Read Segment Alignment
Exon 1 Exon 2
Exon 1 Exon 2
Split Read
Align Segments
Extend Segments
Join Segments
Exon 1 Exon 2
A B C
25
combinations can be limited to genes or to a specified maximum intron size. This method does
not permit the de novo detection of splice junctions using unknown splice sites but may aid in the
identification of alternative splicing events. There are several drawbacks to using splice-junction
databases. The first drawback is that as sequencing reads get longer there is a greater probability
that they will span more than one splice junction and these reads will be incorrectly mapped.
Another issue is that if a splice junction is missing from the database a contiguous aligner may
generate a false positive alignment by forcing an alignment across a different splice junction by
incorporating incorrect mismatches and gaps. Finally, reads that cross novel splice junctions may
not be mapped at all leading to false negatives.
1.10.1 Segmentation Approaches
Segmentation based RNA-seq alignment approaches were the first alignment tools that were
widely accessible and capable of functioning on commodity hardware. Tophat (185) was the first
popular exon-first method, Tophat can align read-pairs, use gene annotations and identify de
novo splice junctions (185). Tophat uses a multi-step approach in which the reads are first
mapped contiguously to the genome with Bowtie(178) and unmapped reads are retained.
Regions with coverage are then assembled into islands, assessed for potential GT-AG splice
sites, joined together and filtered to remove false positives. The retained splice junctions are used
to form a database of synthetic splice junctions. Next, the unmapped reads are then mapped
against the splice-junction database using Bowtie and resolved back to genomic coordinates.
Individual reads from a pair are processed using the Tophat method independently and are then
assessed for concordant alignments using a specified insert size distribution. The biggest
disadvantage of using coverage islands to find splice junctions is that there is no support for non-
canonical splice junctions, in addition false positives are common since GT-AG sequences can
occur without being functional splice junctions. Finally, TopHat has no support for insertions or
deletions, which can lead to additional false positive splice junction predictions. However, the
two-step alignment process of mapping reads to a splice-junction database after a phase of novel
junction discovery is important and is commonly used today by RNA-seq alignment tools. The
method of mapping read-pairs independently and then assessing whether their alignments are
concordant is also common practice.
26
Segmentation-based RNA-seq alignment tools rely heavily on contiguous high-throughput
aligners. Novel splice junctions are identified de novo by splitting unmapped reads into multiple
segments and then mapping them independently to the genome sequence. The split read
alignments are used to infer splice junctions based on patterns of their alignment. Two popular
segmentation based RNA-seq aligners are MapSplice (186) and a revised version of Tophat
named Tophat2 (115, 185, 187). The primary differences between Tophat2 and MapSplice are
the usage of gene annotations and the patterns of segmented-reads they can utilize to identify
novel splice junctions (183). These segmentation-based algorithms are advantageous in that they
were less challenging to implement since they depend on existing RNA-seq alignment tools for
the majority of the alignments.
Tophat2 (187) uses a three-step approach to map RNA-seq reads. The first step is only used
when gene annotations are available: Tophat2 uses Bowtie2 to map reads to a modified version
of a splice-junction database that includes complete cDNA sequences of all annotated transcripts.
The second step consists of mapping the unmapped reads to the genome sequence. Finally, the
third step collects the remaining unmapped reads, segments them into 25 bp non-overlapping
segments and then independently maps the segments to the genome. Only two cases of segment
alignments are considered, the case where an internal segment (ie. a segment that isn’t the first or
last segment) is unmapped or a pair of consecutive segments does not map contiguously to the
genome. These two segment alignment cases are used to identify novel splice junctions by
searching for splice sites near the segments and joining them to identify novel splice junctions. A
splice junction database is generated with the discovered splice junctions and the unmapped
segments are mapped to this database and stitched together with the other segments. This
algorithm has several drawbacks. Tophat2 can miss novel exons shorter than the segment length,
as well as splice junctions with suboptimal segment alignments or segments with multiple splice
junctions.
MapSplice (186) uses a hybrid segmentation and seed-and-extend algorithm without gene
annotations to map reads across splice junctions. All of the reads are split into 25 bp non-
overlapping segments. The segments are mapped to the genome sequence and used as anchors
for a seed-and-extend approach to identify novel splice junctions. MapSplice is more sensitive at
splice junction discovery than Tophat2 since it does not rely solely on segment alignments.
Nonetheless, since the majority of splice junctions are annotated and MapSplice does not take
27
advantage of gene annotations, which reduces its accuracy compared to Tophat2 with
annotations.
1.10.2 Seed and Extend Approaches
Seed-and-extend based RNA-seq alignment tools have extended the ideas of BLAT (161) and
increased their performance. The two most commonly used seed-and-extend based RNA-seq
aligners are GSNAP (188) and STAR (189). These two tools are both capable of mapping
paired-end reads and are sensitive to mismatches and splice junctions, however, they differ
greatly in their implementation, performance and gapped alignment capabilities.
GSNAP (188) was one of the first seed-and-extend based RNA-seq alignment tools and utilizes a
hash table with a k-size of 12 and step size of 3 by default. GSNAP does not have to load the
entire index in memory but instead uses disk-based memory-mapping leading to a very small
memory footprint. However, using disk based memory-mapped indexes does lead to a massive
performance decrease compared to tools that use in-memory indexes. GSNAP uses a hash-table
to find candidate regions and merges the alignments to generate splice junctions to GT-AG, GC-
AG, and, AT-AC splice junctions. GSNAP is capable of mapping reads with a single gap that
can be longer than those supported by most RNA-seq aligners with the exception of Tophat2.
STAR (189), as of writing this is a state-of-the art RNA-seq alignment tool that is both very fast
and sensitive. STAR uses a suffix-array as its index, which permits very fast searches. Its main
drawback is that it requires more than 40GB of memory for a human sized genome. STAR uses
the suffix array to find Maximal Mappable Prefixes (MMP), to iteratively find sets of candidate
alignment regions. Unlike most RNA-seq alignment tools, STAR processes both pairs from a
read at the same time when generating the regions. The regions are stitched together to produce
alignments that can identify non-canonical splice junctions, multiple mismatches and a single
gap. STAR is also the fastest RNA-seq aligner and can map more than 300 million read-pairs
per hour with 6 threads.
Both GSNAP and STAR do not execute full SW or NW alignments and may miss optimal
alignments. These tools tend to be sensitive; however, they both emit high numbers of false-
positive splice junctions predictions and aligned mismatches (184).
28
1.11 Current challenges mapping RNA-seq pairs
Common RNA-seq alignment artefacts in combination with library preparation biases can
obscure biologically important signals. Alignment issues are caused by three primary
mechanisms: the interdependence between mismatch, gap and splice junction alignments on
accuracy; repeat alignment sensitivity; and resolving whether paired-end reads map
concordantly. The interdependence between gaps, splice junctions and mismatches is complex,
and there is a fine balance between preferring an alignment with mismatches, gaps, splice
junctions, or combinations of the above. For example, consider a contiguous alignment with a
mismatch or gap versus a splice junction with neither gaps or mismatches. The latter could be a
false positive splice junction and the former could correspond to false positive SNVs. A false
negative splice junction prediction can lead to alignments with gaps and mismatches within an
intron sequence rather than a splice. These issues tend to be dependent on how the alignment tool
scores a splice junction, gap and mismatch. Alignment tools such as STAR (189) and Tophat2
(187) do not penalize GT-AG splice junctions. This can lead to an increased false positive rate
for splice junctions rather than a contiguous alignment with mismatches. A low gap penalty can
also lead to incorrect alignments; for example, a gap could be used to prefer a GT-AG splice
junction and a gap versus a non-canonical splice junction.
Eukaryotic genomes have multiple classes of repeat elements including retrotransposons, DNA
transposons, miniature inverted-repeat transposable elements, and paralogous genes. For
example, 12% of the C. elegans genome and 45% of the human genome is composed of
transposable elements (8, 190-192). Some of the transposon classes are extensively repeated
throughout the genome; for example, Alu elements represent ~10% of the human genome and
occur more than 1 x 106 times. The C. elegans genome contains 3,327 Ce000087 insertions
comprising ~1.32% of the genome (190). Accurately mapping reads derived from these repeat
elements is difficult and important for the identification of RNA editing. Incorrectly handling
repeat alignments such as missed repeat alignments to paralogous regions or repeat elements has
led to the false positive identification of RNA edits (193). RNA-seq tools also tend to have low
repeat tolerances preferring performance to sensitivity. Finally, as previously mentioned many
RNA-seq alignment tools use a simple method to determine if read-pairs map concordantly,
which may lead to false positive read-pair alignments due to sequencing library preparation
artefacts or repeat regions.
29
1.11.1 Identifying RNA editing with RNA-seq
Adenosine deamination and cytidine deamination can be identified by aligning RNA-seq data to
the DNA sequence from the same individual. For example, A-to-I editing appears as an A-to-G
or T-to-C mismatch depending on the strand of the targeted RNA (194). However, the
differentiation between real RNA editing events and false positives is challenging. Mitigation of
common sequencing errors, alignment artifacts, and the removal of somatic DNA mismatches
are essential for the accurate identification of RNA edits (12). For example, the initial
identification of putative non-canonical RNA editing has more recently been demonstrated to
arise from false positives derived from sequencing and alignment artifacts (193, 195). Mismatch
tolerant alignments are also essential since RNA editing can occur in clusters of hyper-editing
with 30 A-to-I edits in a single 2x100 bp read (69).
1.11.2 Identifying RNA edits without sequencing
The presence of inosine bases in an RNA molecule can be verified using sequencing based
method termed inosine chemical erasure (ICE) (196). Inosine bases are chemically modified with
acrylonitrile to produce N1-cyanoethylinosine. The cyanoethylated inosine residues lead to the
arrest of reverse transcriptase extension during cDNA synthesis. A typical experiments involves
reverse transcriptase with specific primers using a normal RNA sample and an acrylonitrile
treated sample. The cDNA is subsequently PCR amplified and both reactions are sequenced
using Sanger sequencing. If inosine is present in the targeted RNA the sequencing data should
detect A-to-G mismatches in the normal sample and an absence of A-to-G mismatches in the
acrylonitrile treated sample. The ICE method provides a complimentary method to eliminate
false positive mismatches due to sequencing errors.
1.12 Thesis Objectives
My thesis is concerned with the development of methods for the accurate alignment of RNA-seq
pairs and the identification of SNV’s and RNA editing. Reflecting these goals, my work has
three major objectives: 1) designing an accurate RNA-seq alignment system, 2) benchmarking
the system against current RNA-seq aligners and 3) utilizing this tool to identify RNA editing in
the nematode C. elegans. In Chapter Two of my thesis I focus on the development and
benchmarking of RNASequel an RNA-seq alignment post-processing tool. The third chapter of
30
my thesis involves improving RNASequels performance and using it to accurately profile RNA
hyper-editing in C. elegans.
In Chapter Two, I describe the development and implementation of RNASequel, a software
package that runs as a post-processing step in conjunction with an RNA-seq aligner and
systematically corrects common alignment artifacts. Its key innovations are a two-pass splice
junction alignment system that includes de novo splice junctions and the use of an empirically
determined estimate of the fragment size distribution when resolving read pairs. I demonstrate
that RNASequel produces improved alignments when used in conjunction with STAR (189) or
Tophat2 (187) using two simulated human datasets (184). In addition, I show that RNASequel
improves the identification of adenosine to inosine RNA editing sites on human-derived
biological datasets. The strength of this software lies in applications requiring the accurate
identification of variants in RNA sequencing data, the discovery of RNA editing sites and the
analysis of alternative splicing.
In Chapter Three, I report improvements to RNASequel and demonstrate the utility of the
program by identifying the most comprehensive map of C. elegans RNA editing to date. The
original version of RNASequel required large amounts of temporary disk-space reducing
inhibiting its usefulness for the analysis of multiple datasets in parallel. In this chapter I describe
a modified version of RNASequel designed to eliminate its temporary disk space requirements;
this leads to a performance increase. The primary modification is that RNASequel uses BWA-
mem (180) as a software library rather than requiring four individual alignments per sample. I
used this new version to profile RNA hyper-editing in 91 C. elegans RNA-seq datasets. I used
this map of A-to-I editing to verify that edits are generally associated with non-coding RNA,
repeat elements and heterochromatin.
31
Accurate RNA-seq Realignment with RNASequel 2This work was published in Nucleic Acids Research (197).
The current generation of RNA-seq paired-end aligners suffers from shortcomings that obscure
biologically important signals, or which give rise to false signals. For example, the initial
identification of putative non-canonical RNA editing has more recently been demonstrated to
arise from false positives derived from sequencing and alignment artifacts (193).
A typical RNA-seq experiment consists of sequencing both ends of a cDNA fragment to generate
two reads (a read pair) separated by a variable length of sequence. The accurate alignment of
these read pairs is essential to the downstream analysis of an RNA-seq experiment, but RNA-seq
read alignment is challenging due to the non-contiguous nature of mRNA transcripts (181).
Critically, RNA-seq aligners must be able to identify exonic alignments in regions that can be
interspersed with introns that can reach hundreds of kilobases in length (182). To solve this issue
paired-end RNA-seq alignment methods typically apply a distance cutoff to exclude discordantly
mapped pairs. However, these cutoffs tend to be arbitrary and very liberal. For example, many
algorithms consider mapped pairs to be concordant up to a maximum distance of 500kb, which is
sufficiently high to catch the rare very long intron, but also is prone to incorrectly classifying the
more common case of discordant reads that are mapped incorrectly.
To facilitate the mapping of spliced reads while attempting to minimize common systematic
errors, various RNA-seq alignment methodologies have been developed. These methods include
tools that are dependent on, and optimized for, a specific short read alignment tool such as
Bowtie or BWA (156, 162, 178, 185, 186, 198-200). Other tools implement their own alignment
algorithms that may not be as accurate as traditional short read alignment tools, or which are less
tolerant to gaps and mismatches (188, 189). RNA-seq alignment methods also differ in their
usage of pre-existing splice junction databases. Most methods perform better when a splice
junction database is provided, but this hinders the identification of novel splice junctions, and
may not be feasible for less well-characterized species (184, 200). In addition, few splice
junction aware RNA-seq aligners are able to recognize and handle transcripts that span more
than one splice junction or contain a novel combination of existing junctions.
32
Other common artifacts that lead to issues with spliced alignments include 1) the identification of
false positive splice junction alignments due to short alignment overlaps on one side of the splice
junction, which is compounded by the reduction of base quality at read ends; 2) false positive
splice junctions due to reverse transcriptase and PCR template switching and splicing noise; 3)
splice junctions that are missed because the read has been incorrectly aligned to an intron
sequence rather than across a splice junction (119, 120, 184, 200, 201). These artifacts contribute
to false positives for calling insertions, deletions, splice junctions and mismatches. For example,
many false positive sites in predicted RNA edits tend to be located near splice sites due to
incorrectly spliced alignments (12, 193). These are compounded by issues relating to library
preparation such as errors generated by reverse transcription and random hexamer priming (146).
In general, RNA-seq aligners have a low default tolerance for insertions, deletions and
mismatches, which together increase the number of unmapped bases (soft clipping) at read ends
and miss alignments to regions with a high mismatch rate. Finally, poor repeat tolerance can also
lead to false positive mismatch calls by aligning a read pair to one paralogous gene while
missing the alignment to another.
One common method to compensate for spliced alignment artifacts is to execute a two-pass
alignment scheme (185, 200). A two pass alignment consists of two steps: 1) the alignment of the
reads to known splice junctions and the reference genome for the identification of novel splice
junctions; 2) the generation of a new index file including all, or a subset of, high confidence
novel splice junctions. This can drastically improve the spliced alignment of reads with low short
exonic overlaps.
To address the common causes of systematic artifacts in RNA-seq library preparation,
sequencing and alignment I have constructed an RNA-seq realignment program called
RNASequel. RNASequel utilizes the spliced-read output of any read mapper and de novo splice
junction identification algorithm to perform an error-tolerant realignment (Figure 2.1). It takes
advantage of an empirically determined fragment size distribution and annotated and novel splice
junctions to predict if a read pair maps concordantly. I have tested RNASequel against
STAR(189) and Tophat2 (187) for de novo splice junction prediction using real and simulated
datasets, and find increases in sensitivity and decreases in false positive predictions. I also show
that RNASequel has improved repeat alignment sensitivity that improves the identification of
potential single nucleotide variants and RNA editing sites.
33
RNASequel is implemented in C++ is available under the GNU Public License from:
(https://github.com/GWW/RNASequel).
Spliced Read Aligner
Filter Splice Junctions Gene Annotations
Generate Splice Junction Database
Read 1 Genome Alignment Read 1 Spliced Alignment Read 2 Genome Alignment Read 2 Spliced Alignment
Merge Read 1 Alignments Merge Read 2 Alignments
Estimate Fragment Size Distribution
Resolve and Output Read Pairs
Figure 2.1. RNA-sequel realignment schematic. A spliced read aligner is used to identify sample specific novel splice junctions that are used to generate a splice junction index. Read 1 and read2 from each read pair are independently mapped to the genome and splice junction index using a contiguous read aligner. Low quality alignments are removed, the genomic and splice junction alignments are merged and the read pairs are resolved using an empirically determined fragment size distribution.
34
2.1 Results:
2.1.1 Developing an Accurate RNA-seq Realignment Tool.
I have developed RNASequel, an accurate and error-tolerant paired-end RNA-seq realignment
tool, which functions as a post-processing step attached to an RNA-seq alignment algorithm. My
implementation allows the user to utilize his or her preferred aligner, and future-proofs the tool:
it can be used to improve the accuracy of any current or future RNA-seq alignment software that
emits its results in standard BAM format. The tool refines the splice junction predictions prior to
realignment by removing junctions that experience has shown are likely to be false positives, for
example junctions found only in the end of reads or junctions found with repeat alignments. To
improve paired-end alignment accuracy the reads from each pair are independently mapped to
the genome sequence (genomic index) and a database of splice junctions (splice junction index)
(Figure 2.1). An advantage of aligning the reads independently to the genome and splice
junction index is the reduction of indexing time and the disk space usage, since indexing the
reference sequence can take a long time and require gigabytes of disk space while indexing the
RNASequel-generated splice junction database is comparatively fast and produces indexes
~100MB. These four alignments can be performed in parallel using a computational cluster. The
genomic and splice junction database alignments for each read are merged and alignments are
discarded based on user-configured filtering parameters, which can be customized based on the
required repeat tolerance defined by the user. Lastly, I refine paired-end read analysis by
validating that each potential read pair alignment falls within an empirically determined fragment
size distribution. This is in contrast to most spliced alignment methods that consider a read pair
concordant if it aligns within a preset distance. The advantage of this method is that it improves
the detection of concordant read pairs and repeat-mapped pairs.
2.1.2 RNASequel realignment leads to improved alignment accuracy.
To benchmark RNASequel realignment I tested two different de novo splice junction prediction
tools, Tophat2 and STAR with gene model annotations (Tophat2 Ann. and STAR Ann) and
without annotations (Tophat2 and STAR). The novel splice junctions identified from each of
these tools was used for realignment with RNASequel. I also compared RNASequel realignment
against STAR with two passes where the splice junctions predicted in the first pass are used to
generate a new index for a second pass (STAR Two Pass and STAR Ann. Two Pass). Finally, to
35
benchmark RNASequel without de novo splice junctions, RNASequel was used with gene
annotations alone in a single pass alignment (Annotation Only). I chose Tophat2 because of its
popularity as one of the first RNA-seq alignment tools and STAR for its use within the
ENCODE project, its high accuracy and its fast alignment rate (184).
To determine the alignment characteristics of Tophat2, STAR and RNASequel I utilized two
simulated datasets that were previously used in an RNA-seq alignment benchmarking study were
used for benchmarking (184). Engström et al. using the Benchmarker for Evaluating the
Effectiveness of RNA-Seq Software generated the simulated alignments for both datasets.
(BEERS) (200). This software was designed to avoid gene model biases and generate read-pairs
with a normal fragment size distribution, novel splice junctions, mismatches, and gaps. BEERS
combined gene annotations from 11 sources to generate alternatively spliced gene isoforms
including intron retention events. An empirical distribution of gene expression scores from a
biological dataset was used to determine the underlying expression of each of the genes. Read-
pairs were then randomly generated for each gene to mimic the randomly selected expression
level and to have a fragment size that matches a normal distribution. Mismatches, insertions, and
deletions were randomly introduced into the read-pairs at a user-defined rates and additional
modeling of base-call errors and quality scores was performed by simNGS. Finally, one of the 11
annotation databases (Ensmbl for this work) can be used to provide alignment tools with known-
splice junctions.
Each dataset has roughly 3.7 x 107 2x76 bp read pairs (184). The second of the two simulated
datasets was generated with a higher mismatch (~3x more), gap (~5x more) and novel splice
junction rate (1.5x more).
The simulated datasets were generated to closely match a biological experiment, however, they
do fall short for the following two reasons: 1) the simulated alignment may not be the optimal
alignment due to a simulated read mapping to other genomic locations, and 2) the simulated
alignments do not include all of the artifacts in an RNA-seq experiment such as template
switching. Despite these shortcomings simulated datasets are important when benchmarking
RNA-seq tools because they provide a ground truth permitting the identification of correct and
incorrect alignments with combinations of splice junctions, mismatches and gaps.
36
Overall, RNASequel improved the number of reads that perfectly recapitulated the simulated
alignment; this was especially the case for the second simulated dataset (Figure 2.2A/B). For the
first simulated dataset RNASequel alignments produced the highest number of perfect
alignments, ~90% versus 80-87% for the other methods and with the second simulated dataset
RNASequel identified 12-20% more perfect alignments. The performance of the algorithms with
and without gene annotations was similar for both simulated datasets. Finally, Tophat2 had the
fewest number of partial alignments and the highest number of singleton alignments, likely due
to one read in the pair having more mismatches than Tophat2’s default cutoff. For both simulated
datasets RNASequel realignment demonstrated an increased repeat sensitivity, the number of
correct alignments to repetitive elements was typically ~4x higher for the first simulated dataset
and ~2x higher for the second dataset. This improved alignment accuracy is also reflected in
regions in both simulated datasets (Figure 2.3, 2.4).
37
A
B Percent (%)
Percent (%)
Figure 2.2. Simulated dataset alignment rates. Alignment rates as percentages of the total number of pairs for the first (A) and second (B) simulated datasets with the indicated alignment methods. For a description of the alignment types see the benchmarking methods description.
38
chr1: 155105667 – 155110998 bp
B
STA
R A
nn.
STA
R A
nn. P
lus
RN
AS
eque
l
A
Figure 2.3. (A) Alignment view of chr1: 155105667 – 155110998 bp for simulated dataset 1 comparing STAR Ann. with STAR Ann. plus RNASequel. The color of each alignment indicates how the alignment compared to the true alignment as indicated by the legend. Read pairs that were perfectly aligned by both tools are not shown. (B) The summary for all of the alignments in the indicated region.
39
chr20: 61439314 - 61475112 bp
B
STA
R A
nn.
STA
R A
nn. P
lus
RN
AS
eque
l
A
Figure 2.4. Alignment view of chr20: 61439314 - 61475112 bp for simulated dataset 2 comparing STAR Ann. with STAR Ann. plus RNASequel. The color of each alignment indicates how the alignment compared to the true alignment as indicated by the legend. Read pairs that were perfectly aligned by both tools are not shown. (B) The summary for all of the alignments in the indicated region.
40
2.1.3 Realignment to a splice junction database improves spliced read accuracy
A major challenge for de novo splice junction identification is that a single pass alignment
scheme may incorrectly align reads with short exonic alignments because the true splice junction
has not been discovered. To mitigate this issue I applied a filtering scheme to identify and
remove false positives that occurred due to repetitive region mappings, splice junctions occurring
exclusively in the ends of a read and/or non-canonical splice motifs. To maximize my ability to
align reads across multiple splice junctions I supplemented sample-specific splice junction index
with groups of novel and annotated splice junctions that could be spanned by a single sequencing
read. For both simulated datasets, realignment with RNASequel or STAR with two passes
increased the number of perfectly mapped spliced reads by 2-10% (Figure 2.5). When gene
annotations were present the number of perfect alignments increased by 4-10%. This was
particularly evident for reads that spanned multiple splice junctions, which demonstrates the
utility of my splice index alignment (Figure 2.6). RNASequel realignment had the lowest
number of incorrect spliced alignments and the highest number of perfect alignments compared
to STAR. The rate of incorrect alignments was higher when using Tophat2 for de novo splice
junction predictions. This may be due to Tophat2’s higher false negative rate. The importance of
including de novo splice junctions for alignment is highlighted by examining RNASequel using
only gene annotations which had the highest number of incorrect spliced reads. The number of
perfect spliced reads was more pronounced for the second simulated dataset where the number of
perfect alignments was increased by ~10% and the number of failed alignments decreased by 5%
without annotations and 2% with annotations for RNASequel realignment versus STAR with two
passes. Overall, RNASequel realignment had the highest precision for both annotated and novel
splice junctions (Figure 2.7A/B, 2.8 A/B). For annotated splice junctions RNASequel
realignment had the highest recall for both simulated datasets and comparable precision. The
increase was small for the first simulated dataset, but 7-30% higher for the second simulated
dataset. As expected, the recall and precision were highest when gene model annotations were
supplied.
41
A
B Percent (%)
Percent (%)
Figure 2.5. Spliced read alignment rates for the first simulated dataset (A) and the second simulated dataset (B). Perfect spliced alignments have all of the correct splice junctions, partial alignments have at least one correct splice junction and no incorrect splice junctions and failed alignments are unmapped reads or reads with an incorrect splice junction.
42
A
B
Percent (%)
Percent (%)
Figure 2.6. The number of correct splice junctions identified in each read stratified by the total number of true splice junctions for the first simulated dataset (A) and the second simulated dataset (B). Colored bars indicate the number of correctly identified junctions.
43
A B
C D
E
Figure 2.7. Alignment characteristics for the first simulated dataset. The recall and precision as a percentage of the number of correctly aligned reads for annotated junctions (A), novel junctions (B), insertions (C), and deletions (D). The alignment algorithms used are indicated according to the legend and the arrows indicate the improvement by RNASequel and are colored according to the legend. (E) Receiver-operator curve (ROC) demonstrating the relationship of correctly called sequence variants (Y axis) to the number of falsely-called variants (X axis) for each read pair across each of the alignment methods. Note that the X-axis scale is false positive variant calls per 100,000 reads.
44
A B
C D
E
Figure 2.8. Alignment characteristics for the second simulated dataset. The recall and precision as a percentage of the number of correctly aligned reads for annotated junctions (A), novel junctions (B), insertions (C), and deletions (D). The alignment algorithms used are indicated according to the legend and the arrows indicate the improvement by RNASequel and are colored according to the legend. (E) Receiver-operator curve (ROC) demonstrating the relationship of correctly called sequence variants (Y axis) to the number of falsely-called variants (X axis) for each read pair across each of the alignment methods. Note that the X-axis scale is false positive variant calls per 100,000 reads.
45
For the identification of novel splice junctions, RNASequel had a slightly lower recall rate due to
my filtering scheme, but a ~3-5% higher precision than STAR for the first simulated dataset. The
slight decrease in recall and the increase in precision demonstrates the tradeoff when applying a
filtering scheme to novel splice junctions prior to realignment. For the second simulated dataset,
RNASequel realignment increased the recall by 6-23% and the precision by 2-4%. I examined
the false negative splice junction alignments, and observed that majority of them (23-60%) were
within 15 bp of the 3’ end of the read sequence. These may have been missed due to the
simulated read quality degradation near the 3’ ends (Figure 2.9, Figure 2.10).
In summary, by generating a splice junction database and mapping the reads with an accurate
error tolerant realignment I have increased the splice junction accuracy, especially in the case of
datasets with high error rates.
46
Figure 2.9. The proportion of false positives (red), false negatives (blue) and true positives (green) by their position across each read sequence for junctions (first column), insertions (second column) and deletions (third column) from the first simulated dataset.
47
Figure 2.10. The proportion of false positives (red), false negatives (blue) and true positives (green) by their position across each read sequence for junctions (first column), insertions (second column) and deletions (third column) from the second simulated dataset.
48
2.1.4 RNASequel realignment improves alignments with insertions and deletions
Gapped alignments are a challenge for RNA-seq alignment. For example, a higher gap tolerance
threshold can result in additional false positive splice junction predictions by inserting a gap to
bridge an alignment to an incorrect splice junction. Furthermore, false positive gaps can be
inserted within an alignment that incorrectly aligns to an intronic sequence. To overcome this I
have combined RNASequel’s accurate splice junction indexing strategy with a gap tolerant
alignment using BWA mem followed by a trimming of alignments that map to intron sequences.
Using this approach, RNASequel increased the gap recall by ~20% compared to STAR and
Tophat2 (Figure 2.7C/D, Figure 2.8 C/D). The insertion precision was comparable between all
of the methods used while the deletion precision after RNASequel realignment was ~20-25%
higher compared to STAR. For each of the alignment algorithms the false negatives for
insertions and deletions tended to occur in the first and last 10 bp of each read where aligners are
more likely to soft clip the alignment rather than insert a gap (Figure 2.9, Figure 2.10).
Intriguingly, STAR alignments produced a higher percentage of false positive deletions through
the middle of the read compared to Tophat2 and RNASequel realignment. Tophat2 had a slightly
higher false positive rate near the read ends due to using an underlying global rather than local
alignment algorithm.
The effect of RNASequel’s increased gap tolerance is to reduce read artifacts such as
mismatches and incorrect splice junction calls due to incorrect gapped alignment.
2.1.5 RNASequel realignment increases mismatch tolerance and accuracy
High mismatch tolerance for RNA-seq alignment can lead to an increase in sensitivity, but it can
also lead to more false positive splice junction alignments or alignments that should be spliced
but are contiguously aligned into an intron sequence. The RNASequel splice junction filtering
step helps reduce some of these false positives while my attempt to trim alignments that overlap
splice sites near the read ends reduces many false positives. The simulated datasets are
dominated by alignments with low numbers of mismatches and to assess the performance of the
tools and RNASequel on read pairs with high and low levels of mismatches, I plotted the number
49
of true positive and false positive mismatches stratified by the true number of mismatches in
each read pair (Figure 2.7E, Figure 2.8E). RNASequel realignment had the highest mismatch
recall and precision compared to the other tools. Tophat2 had the lowest mismatch accuracy due
to a low mismatch tolerance by default. As observed in the splice junction and gap tests, the
majority of the false negative and false positive mismatches were near the ends of reads,
particularly the 3-prime end of the read (Figure 2.11). This is due to the higher number of
mismatches near the 3-prime end of the read from the simulated read quality degradation. It
should be noted that I could have improved the other tools’ accuracy by hand-optimizing their
alignment parameters, but I felt that the default parameters represented a typical laboratory use
case. Furthermore, adjusting the alignment tools mismatch parameters may lead to undesirable
alignment artifacts, for example, a higher false positive spliced read alignment rate.
50
Figure 2.11. The proportion of false positives (red), false negatives (blue) and true positives (green) mismatches by their position across each read sequence for the first simulated dataset (first column) and the second simulated dataset (second column).
51
2.1.6 RNASequel execution speed and memory requirements.
RNASequel realignment is reasonably fast. The splice junction index generation takes less than
15 minutes with a single thread of execution and less than a 1GB of memory. The reference and
splice junction alignment steps are dependent on the chosen alignment tool, for BWA-mem this
takes 2-3 hours per 100M reads with 16 threads for the reference alignment and 1 hour per 100M
reads for the splice junction index alignment. BWA-mem uses 40GB of memory for both
alignment types. The merge step processes ~35M pairs per hour with 8 threads and uses 20GB of
memory. It should be noted that all four of the BWA-mem alignments could be parallelized on a
computational cluster decreasing the RNASequel processing time substantially. As a comparison
STAR processes roughly 50M pairs per hour with 8 threads and ~60GB of memory. Tophat2
processes roughly 8M pairs per hour with 8 threads and <20GB of memory.
2.1.7 RNASequel realignment improves alignment characteristics on biological datasets.
Simulated datasets do not capture all of the potential sources of errors present in a biological
RNA-seq library. For example, there may be reads derived from spurious transcripts in non-
coding regions of the genome such as pseudogenes. There are also other sequencing errors
unique to a biological dataset such as reverse transcriptase template switching (119, 120). To
compare the alignment accuracy of RNASequel to Tophat2 and STAR, I applied my program to
three biological datasets, one derived from a lymphoblastoid cell line (YH) and two replicates
derived from a lymphoblastoid cell line GM12878 (142, 202). The YH RNA-seq sample used a
library that was poly(A) and ribosomal RNA depleted and was deeply sequenced to a depth of
~400M 2x90 bp pairs. The GM12878 samples were sequenced to a depth of ~100M 2x75 bp
poly(A) selected pairs. For all three samples RNASequel realignment led to the concordant
mapping of more read pairs. For the YH sample, realignment with RNASequel realignment led
to the concordant mapping of ~90% of the read pairs while Tophat2 mapped ~60% and STAR
mapped ~84% (Figure 2.12A). For GM12878-1 the paired alignment rates were ~80% for STAR
and RNASequel while Tophat2 mapped ~48% (Figure 2.12B). Finally, for the GM1278-2
sample RNASequel mapped ~80% of the pairs, while star mapped ~70% and Tophat2 mapped
~45% (Figure 2.12C). In all three of the cases RNASequel identified 0.3% to 6% more pairs as
repeat mapping compared to STAR and Tophat2. For the YH dataset STAR with two passes
mapped a similar number of repeat pairs to RNASequel while mapping 2-3 times less for the
52
GM12878-1 dataset. To further investigate the read mapping improvements conferred by
RNASequel I compared STAR Ann. plus RNASequel to STAR Ann. with two passes with an
additional 25 poly(A) RNA-seq samples from the ENCODE project (Figure 2.13). On average
RNASequel mapped 2.75% more pairs and identified an average of 5.5% more repeat mapped
pairs.
A
B
C
Percent (%)
Percent (%)
Percent (%)
Figure 2.12. Alignment rates for the YH (A), GM12878-1 (B) and GM12878-2 (C), RNA-seq datasets. Pairs were considered aligned if they were indicating as properly mapped by the alignment algorithm and discordant otherwise. Pairs where only one of the reads from the pair was mapped are indicated as singletons.
53
RNASequel realignment attempts to predict whether a read pair is concordant using the
empirically determined fragment size distribution, splice junction predictions and gene
annotations. To compare the effect of this on paired alignment I used my fragment size
determination algorithm on the alignments produced by STAR and Tophat2 to predict whether
the paired alignments have a valid fragment size using the junctions predicted by the tool and
gene annotations. I found that ~1-2% of the pairs uniquely mapped by STAR and Tophat2 had a
fragment size outside of the empirical range determined by my algorithm (Figure 2.14). It
should be noted that Tophat2 does take advantage of a user-provided fragment size mean and
standard deviation. These numbers were also similar for repeat alignments where all or a subset
of the alignments had a fragment size that was not within the empirically determined
Figure 2.13. Comparing pair-end read alignment rates for 25 ENCODE RNA-seq datasets with STAR Ann. with two passes and STAR Ann. with RNASequel. The types of alignments are indicated by the legend.
54
distribution. These represent a small proportion of the alignments that include cases where the
fragment size was outside of the tail of fragment size cases with missing splice junction
annotations and false positive alignments. For STAR ~60-80% for unique pairs and ~20-40% for
repeat pairs fall within 50bp of my confidence interval. However, these alignments can
contribute to artifacts in downstream analysis, especially when identifying variant or RNA
editing calls.
55
A
B
C
Figure 2.14. Application of the RNASequel fragment size estimation and verification algorithm to the alignments produced by Tophat and STAR. The percentage of the total number of read-pairs in the YH (A), GM12878-1 (B) and GM12878-2 (C) RNA-seq dataset is indicated. For unique alignments where the alignment did not fall within the empirically estimated range (Unique -> Unmapped), for repeat alignments where all of the alignments failed to fall within the estimated range (Repeat -> Unmapped), for repeat alignments where all but one of the alignments did not fall within the range (Repeat -> Unique), and cases where a subset of the repeat alignments did not fall within the range (Repeat -> Repeat).
56
2.1.8 RNASequel realignment generates more robust RNA editing calls.
In vertebrates, the ADAR family of enzymes is responsible for the conversion of adenosine to
inosine (A-to-I) in RNA (203). This type of RNA editing is thought to be used as a regulatory
mechanism (204). In RNA-sequencing, A-to-I edits manifest either as A-to-G or T-to-C
substitutions depending on the strand of the transcript. The identification of RNA editing sites
using RNA-seq is difficult due to a number of sequencing and alignment artifacts. To
demonstrate the degree to which RNASequel realignment improves RNA editing calls I
compared my realignment algorithm with Tophat2 and STAR with and without gene model
annotations. The potential nucleotide changes were then filtered to remove common sources of
false positives including alignments to tandem repeats and changes biased to the ends of reads. I
removed somatic polymorphisms (if available) or common polymorphisms in dbSNP (if genome
annotations were not available). Prior to filtering the YH and GM12878 datasets RNAsequel
realignment yielded comparable numbers (+/-1-3%) of A-to-I changes as STAR and Tophat2
RNASequel yielded 20% more A-to-I changes for the YH dataset and 8-10% fewer changes for
the GM12878 datasets (Figure 2.15, 2.16, 2.17). For non-A-to-I changes RNASequel yielded 4-
11% fewer changes compared to STAR and 23-40% fewer compared to Tophat2. I also
compared the total SNV calls between STAR Ann. with RNASequel and STAR Ann. with two
passes for 25 additional ENCODE RNA-seq samples. I found an average decrease in the number
of A-to-I calls by 0.52% and a decrease in non-A-to-I calls by 3.7% (Figure 2.18, 2.19). These
results suggest that RNASequel realignment leads to fewer potential false positives prior to
filtering than STAR and Tophat2. These results are also consistent with my simulated dataset
results that demonstrated the reduction in false positive mismatch calls facilitated by RNASequel
realignment compared to Tophat2 and STAR.
After filtering potential false positives I observed that RNASequel and STAR found similar
somatic SNV calls (~1% difference) (Figure 2.15, 2.16, 2.17). For Tophat2 alignments
RNASequel realignment yielded 20-40% more somatic SNV and dbSNP calls. I also observed an
average 3.1% reduction in dbSNP calls for ENCODE samples (Figure 2.18, 2.19). For A-to-I
calls I observed a comparable number of changes between STAR and RNASequel for the YH
dataset (~0.1-1% increase after realignment) and ~4-10% fewer changes for the GM12878
datasets and for Tophat2 alignments I found 2-3 times as many A-to-I calls. For non-A-to-I
changes I observed a 15-25% decrease in the number of calls after RNASequel realignment
57
compared to STAR and a 1.4-3 times as many compared to Tophat2. For the 25 ENCODE
datasets I found an average of 7.3% fewer A-to-I changes and 10.4% fewer non-A-to-I changes.
Combined together the simulated and biological results suggest that RNASequel realignment
yields fewer false positive SNV calls compared to STAR due to RNASequel realignment
reducing the number of non-A-to-I changes. Furthermore, for the YH-1 dataset I found more
somatic SNV’s suggesting an improved false negative score compared to STAR and Tophat.
Tophat2 uses a global alignment algorithm and low mismatch tolerance that leads to a higher
false negative rate for reads with more than 2 mismatches and a higher false positive rate at the
read end for reads with few mismatches. In conclusion, RNASequel realignment demonstrates a
general reduction in the number the false positives with minimal effect on the false negative rate.
58
A B
Difference After RNASequel
Figure 2.15. YH SNV and edit call comparisons. (A) The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), the number of retained A-to-I changes and non-A-to-I changes. (B) The difference in the number of calls for alignments with STAR compared to alignments with RNASequel and the difference after repeats and pairs with incorrect fragment sizes were removed (labeled with clean suffix). For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.
59
A B
Difference After RNASequel
Figure 2.16. GM12878-1 SNV and edit call comparisons. (A) The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), the number of retained A-to-I changes and non-A-to-I changes. (B) The difference in the number of calls for alignments with STAR compared to alignments with RNASequel and the difference after repeats and pairs with incorrect fragment sizes were removed (labeled with clean suffix). For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.
60
A B
Difference After RNASequel
Figure 2.17. GM12878-2 SNV and edit call comparisons. (A) The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), the number of retained A-to-I changes and non-A-to-I changes. (B) The difference in the number of calls for alignments with STAR compared to alignments with RNASequel and the difference after repeats and pairs with incorrect fragment sizes were removed (labeled with clean suffix). For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.
61
Figure 2.18. Comparing SNV and edit call rates for STAR Ann. with two passes and STAR Ann. + RNASequel. The number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a dbSNP entry (dbSNP), the number of retained A-to-I changes and non-A-to-I changes for the 25 ENCODE RNA-seq samples.
62
Finally, to explore the features of RNASequel realignment that led to improved SNV and RNA
editing calls I assessed the impact of RNASequel’s improved repeat sensitivity and fragment size
estimation algorithms. To assess the impact of repeat mapped reads on calling of RNA editing
sites, I collected the union of reads that mapped across any of the variant sites by either the base
alignment program or the alignment program with RNASequel. Read pairs that were multi-
mapped by one tool and uniquely mapped by the other were discarded and the edit sites were
called and filtered again. To assess the impact of my fragment size determination algorithm on
identifying concordant read pairs I removed uniquely mapped reads that did not have a valid
fragment size as determined by my algorithm. I found that within the union of SNV supporting
alignments RNASequel marked 4-10% of the reads as multi-mapping, compared to STAR and
Tophat2, which marked ~1% as multi-mapping (Figure 2.20). STAR with two passes had 2.7%
multi-mapped read rate for the YH sample. I also identified more alignments marked as singleton
Difference After RNASequel
Figure 2.19. Comparing the differences in SNV and edit call rates for STAR Ann. with two passes and STAR Ann. + RNASequel before (All Pairs) and the difference after repeats and pairs with incorrect fragment sizes were removed (Cleaned Pairs). The difference in the number of unfiltered changes with 10x coverage and at least 10% alternative allele frequency for A-to-I (Total A-to-I) and non-A-to-I (Total Non-A-to-I), the total number of calls that overlapped a genomic SNV (Genome), dbSNP entry (dbSNP), the number of retained A-to-I changes and non-A-to-I changes for the 25 ENCODE RNA-seq samples.
63
compared to STAR (1-10% versus 0%) and fewer than Tophat2 (1-9% versus 17-35%). For the
25 ENCODE samples I observed an average of 13% multi-mapped reads with RNASequel
versus 1.3% with STAR with two passes (Figure 2.21). RNASequel realignments mapped more
pairs where 0.6-4% of the reads were unmapped compared to STAR and Tophat2 where 4-25%
of the pairs were unmapped. RNASequel also mapped more of the alignments than STAR for the
25 ENCODE datasets 1% versus 4%. A portion of the alignments identified by STAR as
concordant pairs were marked as discordant pairs by RNASequel (0.1-0.8% of the alignments).
Tophat2 marked the highest proportion of reads as discordant but this was also the case for the
simulated and full set of reads for the biological datasets. Finally, my fragment size estimation
algorithm identified ~1% of the reads mapped as unique by STAR or Tophat2 as being
discordant. After removing the alignments that were marked as unique by STAR and reads
marked as discordant with my fragment size determination algorithm the difference in the
number of calls was lessened or increased in favor of RNASequel (Figure 2.15, 2.16, 2.17,
2.19). For example, the number of non-A-to-I edits identified by STAR is reduced after
removing reads that were marked as repeat mapping by either STAR or RNASequel.
RNASequel. Collectively, these results imply that the improvements in alignment characteristics,
particularly increased repeat sensitivity and improved identification of concordantly mapped read
pairs leads to an improved alignment for the purposes of calling SNVs and RNA edits.
64
A B C
Repe
at&&
Pair&(%
)&Discordant&
Pair&(%
)&Fragmen
t&Fail&(%
)&Singleton&
(%)&
Repe
at&
Singleton&(%
)&Unm
appe
d&(%
)&
Figure 2.20. Comparison of the alignment type between the union of all reads that support a genomic SNV, dbSNP entry, retained A-to-I change, or retained non-A-to-I change for YH (A) GM12878-1 (B) and GM12878-2 (C). The bar on the left indicates the percentage of alignment types for the labeled tool, the bar on the right indicates the alignment rate for the tool with RNASequel realignment. For STAR with two passes, the alignment rate for RNASequel with STAR as a single pass is used for comparison.
65
Repe
at&&
Pair&(%
)&Discordant&
Pair&(%
)&Fragmen
t&Fail&(%
)&Singleton&
(%)&
Repe
at&
Singleton&(%
)&Unm
appe
d&(%
)&
Figure 2.21. Comparison of the alignment type between the union of all reads that support a genomic SNV, dbSNP entry, retained A-to-I change, or retained non-A-to-I change for the 25 ENCODE RNA-seq datasets. The type of alignment being measured is indicated on the y-axis and each bar represents the percentage of the union of all reads.
66
2.2 Discussion:
By systematically mitigating common artifacts that occur during RNA-seq library preparation
and alignment, RNASequel increases the recall of splice junctions, gaps and mismatches while
decreasing the false discovery rate. When applied to the challenging problem of RNA editing
identification, the RNASequel post-processing method reduces the number of apparent false
positives without adversely affecting sensitivity. I have found that using RNASequel in
combination with STAR provides the best accuracy metrics. Crucially, I show that despite my
higher error tolerance, I identify fewer non-canonical edits compared to STAR on a biological
dataset. This implies that many potential RNA editing calls are due to systematic alignment
errors that can be mitigated with RNASequel realignment thereby strengthening the
interpretation of biological datasets. STAR is also preferred because it has better performance
characteristics than Tophat2. RNASequel realignment is agnostic to the underlying aligners used
for splice junction prediction and contiguous read alignment leading to an adaptable RNA-seq
alignment tool that can take advantage of new alignment methods. In the future, I am
investigating methods to improve the performance and disk space usage of RNASequel by
calling the underlying contiguous aligner as a library. I am also investigating methods to capture
aligned pairs that fall within the tail of the fragment size distribution to increase the number of
concordantly mapped pairs. The improvements facilitated by RNASequel realignment are useful
for the analysis of alternative splicing, gene and isoform expression, sequence variant calling and
RNA editing.
2.3 Methods:
2.3.1 Reference genome and annotations
I downloaded human genome build GRCh37 reference sequences and annotations from the
UCSC Genome Browser (genome.ucsc.edu) and created a gene annotation GTF file
(http://www.ensembl.org/info/website/upload/gff.html) from the knownGene and
knownIsoform tables (205). Reference sequences and annotations for chromosomes 1-22, X, Y
and M were used for GTF and fasta sequence construction.
67
2.3.2 Biological Datasets:
The lymphoblastoid derived ribominus and poly(A) depleted RNA-seq datasets were
downloaded from SRA043767 at the NCBI short read archive (142). The individual lanes were
merged together yielded a total of 421,836,549 2x90 base pair reads. A GFF list of genomic
single nucleotide variants for the individual the cell line was derived were downloaded from
(206): http://yh.genomics.org.cn/download.jsp
The GFF was lifted over to hg19 using the UCSC LiftOver tool and the hg18 to hg19 LiftOver
chain (205).
ENCODE long poly(A) paired-end RNA-seq datasets (202) for 27 samples were downloaded
from:
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeCshlLongRnaSeq/
Each replicate was analyzed independently.
GM12878 genome raw sequencing pairs were downloaded from the 1000 genomes project (207)
using SRA accession ERX000170 and mapped to the human reference genome using BWA mem
and the following parameters: “-k 15 -a -B 2 -M”. The alignment was piled up and any change
with at least 10x coverage and a 10% alternative allele frequency were retained.
2.3.3 Alignment Protocols:
I used STAR (189) version 2.3e and Tophat2 (187) version 2.10 with Bowtie2 (156) 2.1.0 for
spliced alignment with their default parameters with the following exceptions: for STAR
alignments with annotations the UCSC “genes.gtf” was provided with the “--sjdbGTFfile”
parameter. For two pass STAR alignments the “SJ.out.tab” file produced by the first pass was
provided to the genome generation step with the “--sjdbFileChrStartEnd” parameter.
For Tophat2 alignments the insert size was specified as “-r 78 --mate-std-dev 30” for the
simulated datasets and “-r 20 --mate-std-dev 40” for the lymphoblastoid dataset and “-r 59 --
mate-std-dev 40” for the GM12878 datasets. For annotated Tophat2 alignments the genes.gtf file
was provided using the “-G” parameter.
68
2.3.4 RNASequel Realignment
To generate a splice junction index the following command was used:
rnasequel transcriptome -g genes.gtf -r genome.fa –n {read size} -b
denovo_alignment.bam -o tx
The transcriptome index produces two files a text file containing the junction locations relative to
the genome sequence and a fasta file containing the junction sequences. The transcriptome index
generation utilizes the read size using the -n option. For alignments without annotations and de
novo splice junctions the -g option was not included and for annotations alone the -b option was
not included.
The individual reads from each pair were then mapped using BWA mem version 0.7.8 (198)
with the commands indicated below. It should be noted that these commands could be run in
parallel in a computational cluster.
BWA-mem indexing:
bwa index genome.fa
bwa index tx.fa
Reference genome alignment:
bwa mem –L 2,2 -k 15 -a -t 8 -B 2 {index} {reads1 or 2} | samtools view -bS - >
{ref 1 or 2.bam}
Splice junction index alignment:
bwa mem –L 2,2 -c 20000 -M -k 15 -a -t 8 -B 2 tx.fa {read 1 or 2} | samtools view
-bS -F 4 - > {juncs 1 or 2.bam}
Merging the pairs:
69
rnasequel merge -r genome.fa -g genes.gtf -j tx.txt ref1.bam -o align.bam
ref2.bam juncs1.bam juncs2.bam
2.3.5 Splice Junction Definitions and Alignment Scoring
I defined a canonical splice junction as any splice junction with the following motifs: GT-AG,
GC-AG, GC-TG, GC-AA, GC-GG, GT-TG, GT-AA, AT-AC, AT-AA, and AT-AG. The strand
of a splice junction was inferred based on gene annotations and the aforementioned splicing
motifs. Alignments were scored using the following scoring penalties: gap open = -8, gap
extension = -1, splice junction = -4, match = 3, mismatch = -3. For spliced alignments an extra
alignment penalty was added for each splice junction. A penalty of -3 was applied for GT-AG
splice junctions, -6 for other canonical splice motifs and -9 for all other splice motifs. To reduce
the chances of choosing an alignment with a long intron over an alignment with a shorter intron
and a lower score I applied an additional penalty for splice junctions with introns over a pre-
defined length (arbitrarily set at 64kb by default). For these long introns I applied a penalty of
−(log2(isize)−12) , where isize is the size of the intron.
2.3.6 Splice Junction Discovery and Splice Junction Index Generation
The splice junction databases combined reference annotations (if available) and novel splice
junction predictions from Tophat2 or STAR (if used). Only the novel splice junctions meeting
the following criteria were retained (used for analysis): 1) The splice junction must be observed
at least 8 bp away from the ends of at least one read; 2) there are at least 2 different alignment
positions mapping across the pair; 3) the predicted intron size is at least 21 bp and no more than
500 kb in length. For each novel junction I added to the database N base pairs of flanking
sequence on each side of the junction, where N should be chosen based on the size of a
sequencing read, for my case I used 76 or 90. To handle cases in which a read could span
multiple splice junctions, I supplemented my index by including multiple splice junctions on the
same annotated strand if a sequence of length N could span one or more downstream junctions.
Splice junctions with an ambiguous strand were considered on both strands. Finally, redundant
sets of spanning splice junctions were removed to minimize the database size. The splice
junction index can then be used with any contiguous read mapper.
70
2.3.7 Contiguous and Spliced Read Alignment
For mapping reads to the GRCh37 reference genome (contiguous alignments) and splice junction
indexes, I chose BWA-mem version 0.7.8 for its speed and accuracy. However any read mapper
can be used. Read 1 and read 2 from each pair were independently mapped to the reference
genome and the splice junction index. For each splice junction alignment, I resolved the
alignments back to the genomic co-ordinates and removed contiguous alignments. To avoid
alignment artifacts that occur due to reads improperly aligning to intronic sequences, alignments
were trimmed if they overlapped a splice site within 6 base pairs of the end of the alignment. For
each alignment the score was calculated as described above and I defined the minimum
alignment score to be 2×(!"#$%&' !"#$#); any alignments with a score less than this were
discarded. The retained alignments for read 1 and read 2 were then paired by identifying every
potential alignment combination that matched the following criteria: 1) the alignments were on
the same chromosome; 2) the alignments were in the correct orientation; and 3) the distance
between the read pair was less than 1Mb.
2.3.8 Estimating the Empirical Fragment Size Distribution
As noted earlier, the current generation of RNA-seq aligners use an arbitrary cutoff to remove
read pairs that map too far away from each other. RNASequel uses two different methods to
solve this problem in a more disciplined manner. Only read-pairs that mapped uniquely after
discarding alignments that had a score less than (the highest alignment score) – 12 were used for
fragment size estimation. I used a score difference of twelve, which equates to four mismatches
with my default mismatch penalty of three. This number can be adjusted if an increased repeat
sensitivity is desired. In the case in which a gene annotation file is available for the organism
under study, I estimated the expected fragment size distribution from the annotated gene model
introns. For organisms with gene annotations I identified pairs that mapped to long exons (>250
bp) that should be larger than the insert size of the library or pairs that mapped to single isoform
genes (115). In the case in which gene annotations were unavailable, I used maximum distance
criteria of 1,500 bp between the read pairs. In both cases we set a size cutoff to 1,500 bp and
required at least 100,000 fragment size observations. Both methods for estimating the fragment
size distribution may lead to rare cases where an intron is retained and the fragment size is
overestimated. Moreover, there is the possibility that a small fraction of the read-pairs are
71
aligned incorrectly with long fragment sizes. To compensate, the empirical distribution was
normalized and a confidence interval retaining the smallest 99% of the observations was applied.
2.3.9 Resolving Read Pair Alignments
To identify potential concordant read pairs I examined all of the potential combinations between
the alignments for read 1 and read 2 that were correctly oriented, mapped to the same
chromosome, and were less than 1 Mbp apart. For each of these potential pairs every potential
fragment size using different combination of splice junctions between the pairs were compared
to the empirically determined fragment size distribution. Each potential fragment was then
assigned a score of 10 x | (normalized fragment distribution score) / (maximum fragment
distribution score) | + (read 1 alignment score) + (read 2 alignment score). The highest scoring
pair was marked as primary; any pair with a score difference of less than twelve was marked as
secondary and the remaining alignments were discarded. The score difference when calling
repeat alignments should be carefully chosen based on the desired repeat tolerance, for my
purpose I found that twelve, which is equivalent to a difference of 4 mismatches was reasonable.
If no valid pairs were found using the fragment size distribution and the potential read pair was
uniquely aligned it was outputted and marked as discordant. Furthermore, I implemented two
different fallback methods depending on whether or not gene annotations were provided. Both of
these methods are optional and deactivated by default. In the case where gene annotations were
provided if both pairs mapped within the same annotated gene and were less than a user-defined
distance apart they were considered concordant. If no gene annotations were provided I
considered a pair concordant if the distance between the pair was at least a user-defined distance
apart. For alignments where there were no valid alignments for one of the reads in a pair I
reduced the score difference threshold to six, since I am only examining a single read rather than
both reads in a pair. The highest scoring singleton alignment was marked as primary and the
remaining alignments were marked as secondary.
2.3.10 Simulated Dataset Benchmarking
The simulated datasets were previously used to benchmark RNA-seq alignment programs
accuracy (184). The datasets were downloaded from ArrayExpress using the accession number
E-MTAB-1728 (184) and alignments that mapped to “random” and “NA” chromosomes were
removed. To simplify the comparison of alignment pipeline outputs to the “ground truth” of the
72
simulated datasets, I removed read pairs if either read had an edit distance of 25 or more. I left-
shifted gaps, trimmed spliced alignments with less than 8 base pairs of exonic overlap at the read
ends and converted spliced alignments into deletions for predicted introns with a length less than
21 bp. For repeat mapped alignments I considered only the primary alignment. An alignment was
considered perfect if the paired alignment exactly matched the true alignment. Partial alignments
overlapped the true alignment but may have been soft clipped (unmapped sequence at the 5’- or
3’- end of a read) or included alternate insertions or deletions. Singleton alignments were
classified as paired-alignments in which either read 1 or 2 was unmapped. For spliced read
alignment comparisons I counted a junction as correct only if the junction was present in the true
alignment. A spliced alignment was considered partially correct if it contained at least one of the
correct junctions and no incorrect junctions (this also encompasses the case in which the
alignment contained all of the correct junctions but some of them were lost due to soft clipping).
Finally, alignments that were mapped but did not meet the criteria for a perfect or partial
alignment were marked as failed.
2.3.11 Identifying Putative Adenosine to Inosine RNA editing events
The reads from the poly(A)-depleted YH lymphoblastoid cell line were mapped with the same
alignment algorithm combinations as the benchmarking datasets. The alignments were retained if
they had no more than two aligned ambiguous bases and no more than 10 soft clipped bases at
either end. The retained alignments were then searched for potential edits using the following
criteria to discard low quality calls: 1) positions mapping to tandem repeats using trf (208) or low
complexity and simple regions according to RepeatMasker were discarded, 2) for positions
overlapping an inverted repeat annotated by einverted (209) or a repeat element identified by
RepeatMasker I used a less stringent coverage criteria and required at least 10x coverage and a
10% alternative allele frequency, for positions with no repeat overlap I required 16x coverage
and a 20% alternative allele frequency, 3) at least one of the reads supporting the alternative base
were outside of the first and last 8 base pairs of the read ends, 4) potential changes for which
more than 90% of the supporting reads contained an insertion or deletion were removed, 5)
Potential sites where more than 70% of the supporting alignments contained different kinds of
mismatches were discarded. After removing low quality calls, I also discarded changes found in
the UCSC Genome Browser “Common SNP” track, which is derived from dbSNP v141 if no
73
genome sequence was available. For the GM12878 and YH datasets SNPs that were called from
genome sequencing data were discarded (see Supplementary Methods) (206, 207).
74
Identifying RNA Hyper-Editing in C. elegans 3
3.1 Background
Recently, there have been three studies that have profiled RNA hyper-editing on a global scale in
C. elegans with RNA-seq (68, 69, 210). These studies have used poly(A)-selected, total RNA
samples and/or immunoprecipitated RNA. These aforementioned studies used differing edit
calling methods and these methods are difficult to replicate without the original source code.
These issues prevent the construction of a comprehensive and consistent database of RNA
editing sites in C. elegans. To mitigate this, I have mapped and identified A-to-I edits and hyper-
edited regions in 91 C. elegans paired-end RNA-seq samples derived from the aforementioned
studies and additional modENCODE RNA-seq samples (Table 1) (68, 69, 210, 211). These
libraries include 10 adr-2(lf) strains that lack A-to-I editing activity as negative controls.
I have engineered an updated version of RNASequel designed for fast sample processing and
high-mismatch tolerance to align the RNA-seq reads. I have also designed a sensitive RNA
editing identification pipeline to generate, to my knowledge, the most comprehensive map of
adr-2 dependent RNA editing. Consistent with previous reports, my expanded map of A-to-I
edits are strongly associated with non-coding RNA, inverted repeats, transposons and
heterochromatin. Finally, I compiled a list of potential edits within coding exons that lead to
amino acid changes. This expanded and refined reference database of ADAR2-dependent edits
will provide a critical resource for future studies on the biological role of adenosine deamination
in C. elegans.
3.2 Results
To comprehensively detect C. elegans RNA hyper-edited clusters I downloaded 91 RNA-seq
datasets from various stages and library-selections (Table 3.5.1). Included in the samples are 10
negative controls where ADAR2 (adr2(lf)) is catalytically inactive. It is expected that these
samples will not have an enrichment of A-to-I edits and that any A-to-I edits observed will be
false positives. To efficiently map the RNA-seq samples I built a new version of RNASequel that
is faster, more mismatch tolerant and does not produce temporary alignment files. This version
facilitated the quick processing and accurate alignment of the samples. Next, I developed an edit
identification pipeline capable of the sensitive detection of hyper-edited clusters, while
75
maintaining a low rate of non-canonical changes. To validate this pipeline, I demonstrate an
enrichment of edit calls in wildtype samples, a depletion of edits in adr2(lf) samples, and a high
overlap of edit calls with previous studies. Finally, I explored the association of the discovered
hyper-editing regions with repeat elements, heterochromatin, introns and 3’-UTRs.
3.2.1 Improvements to the RNASequel Aligner
To improve the performance and disk space usage of RNASequel for the analysis of multiple
samples and the detection of hyper-editing I made several modifications to the program:
1. The first modification altered the splice-junction indexing generation tool to work with
piped output, which eliminated the need for the temporary alignment file generated by
STAR resulting in run-time performance increases of the novel splice junction detection
step.
2. To eliminate potential false negative mismatch calls due false positive spliced-alignments
to avoid mismatch alignments. The splice-junction stringency of the splice-junction
filtering step was also modified. Because the majority of C. elegans introns are smaller
than humans, I modified the maximum intron size to 32kb and increased the filtering
parameters to require at least 5 unique pairs outside the first and last aligned 15 base pairs
of the read in order for the novel splice junction to be retained.
3. During the alignment step I replaced the four BWA-mem alignments with direct library
calls to BWA mem, eliminating intermediate alignment bam files. I also implemented a
pre-filter step using BWA-mem that discards pairs mapping to rRNA sequences. This
eliminated the four temporary bam files created by the BWA-mem alignment and the
need to run four alignments in parallel and increased performance. To permit alignments
with multiple-mismatches the score filter previously developed with RNASequel was
removed.
4. To resolve issues with fragment sizes falling near the right-tail of the distribution, the
fragment size distribution was smoothed by averaging the the distribution within a sliding
window with 5 bp at each side. The original 0.99 confidence interval cutoff was extended
until the observation count was below: 0.05 x |the observation count at the 0.99
confidence interval cutoff|.
76
3.2.2 RNA-seq sample processing and alignment
Accurately mapping RNA-seq pairs derived from C. elegans samples is paramount for accurate
A-to-I edit calling. One aspect of nematode biology that complicates RNA-seq alignment is the
presence of trans-spliced leader sequences at the 5’-end of many C. elegans mRNA transcripts
(212). These trans-spliced leaders cannot be mapped as splice junctions since they are derived
from independent gene products that may not even be on the same chromosome. These
sequences can be misaligned upstream of the spliced-leader acceptor site causing false positive
mismatch calls. To mitigate this I trimmed the spliced leader sequences off the ends of read-pairs
for each sample.
The 91 C. elegans RNA-seq samples were mapped to the C. elegans genome and the E. coli
OP50 genome with STAR and the improved version of RNASequel (Figure 3.1, Table 3.1). In
general the proportion of uniquely mapped pairs was highest for the poly(A) samples. The total
RNA samples had more repeat mapped pairs, rRNA contamination and unmapped pairs
compared to the poly(A) samples.
Table 3.1. Mapping Rates Sample Pairs Unique Repeat Singletons Repeat Singletons rRNA Unmapped
Poly(A) mean 32,572,076 83.37% 2.51% 1.13% 0.07% 9.09% 3.83%
Poly(A) STD 30,007,517 14.82% 1.34% 0.77% 0.1% 14.63% 4.79%
Total RNA mean 51,625,071 45.61% 4.93% 1.79% 0.65% 34.25% 12.78%
Total RNA STD 50,422,695 28.69% 4.56% 1.17% 0.91% 28.08% 12.71%
adr2(lf) mean 70,928,625 86.06% 4.25% 0.12% 0.03% 8.47% 1.06%
adr2(lf) STD 28,275,978 24.38% 0.96% 0.2% 0.06% 23.52% 0.98%
77
Figure 3.1. Alignment rates for the samples processed in this study. The samples are grouped by whether they are poly(A) selected, total RNA, or adr2(lf) mutants. The color bar indicates the developmental stage of the sample, the first horizontal bar chart indicates the rates for each alignment type, and the second horizontal bar chart indicates the number of clustered A-to-I and non-A-to-I variants.
Pol
y(A
) N
= 4
4 To
tal R
NA
N =
37
adr2
(lf)
N =
10
78
3.2.3 Accurate and sensitive identification of hyper-editing
One of the challenges associated with identifying hyper-edited regions in C. elegans is that they
are commonly associated with low abundance transcripts in RNA-seq experiments; for example,
transposons and introns (53, 68, 69, 210). Therefore, a pipeline to filter and identify RNA editing
in regions of low coverage is essential. The downside of reducing coverage requirements is that
sequencing errors may be misinterpreted as edit calls. To mitigate this issue I have developed a
hyper-editing identification pipeline that filters common sources of false positives and cluster the
edits into hyper-edited regions. Common sources of false positives and how they were mitigated
are indicated:
1. Sequencing errors that are reflected by low base quality scores were discarded by requiring bases to have a base-quality score of at least 25.
2. False positives due to the degradation of read quality towards the 3’-end of sequencing reads and incorrectly mapped reads associated with splice junctions and spliced leaders. These errors generally occur near the read ends and I eliminated many of them by discarding variants with alignment support that were exclusively in the first and last 5 bp of the read.
3. SNVs identified from genome sequencing data were discarded.
4. Error prone alignments across tandem repeat elements can lead to false positive variants. Removing potential variant calls that overlap tandem repeats mitigated these issues.
The alignments for all 91 samples were individually piled-up and variants with at least 5x
coverage and a frequency of at least 5% were retained for processing with my filtering pipeline
(Figure 3.2). The majority of variants were discarded in the base quality and read-end filtering
steps. After removing common false positive calls I applied a clustering step to select for variants
of the same type that occur within close proximity (< 100 bp). Variants that did not cluster were
retained as singletons. Variants that fell into clusters were labeled as clustered and counted
individually. To validate these calls I checked for significant enrichment for A-to-I changes from
the adr(wt) compared to adr(lf) samples. I observed a significant enrichment for all of the
comparisons and a higher significance for clustered edits (p < 0.0001) compared to singleton
edits (p < 0.05 and p < 0.001) (Figure 3.3A). This verifies previous observations that ADAR2 is
the sole source of A-to-I editing in C. elegans and demonstrates that my pipeline does find
enrichment in A-to-I changes.
79
Figure 3.2. Summary variant identification and filtering steps. For each sample the number of edits discarded for the base quality, read-end, genome and tandem repeat steps are indicated as well as the number of clustered and singleton edits retained. The changes are split into A-to-I and non-A-to-I changes. Note the A-to-I counts are divided by 103 and the non-A-to-I changes are divided by 104.
Pol
y(A
) N
= 4
4 To
tal R
NA
N =
37
adr2
(lf)
N =
10
80
I compared the numbers of A-to-I and non-canonical variant calls for clustered edits and edits
that occurred as singletons. I observed that the majority of the clustered A-to-I changes for
adr2(wt) were enriched compared to the adr2(lf) samples (Figure 3.3 B,C). For singleton edits,
the percentage of A-to-I changes was markedly lower than clustered edits (median < 40%
compared to a median >90% for clustered edits), which suggests a higher false positive rate.
There is the possibility that a proportion of these mismatches are due to wobble base-pairing
with inosine that leads to A-to-T or T-to-A mismatches (213). For example, the “N2e-DMM402-
N2eall_L1-V” sample had 87,049 singleton A-to-I variants and 116,604 singleton non-canonical
variants compared to 21,605 clustered A-to-I changes and 281 clustered non-canonical changes.
Furthermore, the number of singleton A-to-I changes was strongly correlated with the number of
non-canonical changes, while clustered edits were weakly correlated (R2 = 0.86 and 0.36
respectively), which suggests that singleton edits may be associated with the overall sequencing
error rate for the sample. Based on these observations I have chosen to focus on clustered edits
which have a lower calculated false positive rate and are more likely to have a direct effect on
RBP binding and / or RNA secondary structure.
81
Typically, an increase in the number of uniquely mapped read pairs lead to an increase in the
number of detected variants (poly(A) R2 = 0.56, total RNA edits R2=0.26, and other changes
R2=0.02 and 0.04 respectively) (Figure 3.4). The total RNA samples had significantly more
clustered edits per unique pair than the poly(A) samples, 1.79 x 10-4 +/- 7.14 x 10-5 versus 5.40 x
10-4 +/- 5.88 x 10-4 average edits per unique pair (P = 8.83 x 10-8; two-tailed Mann-Whitney U).
A
B
Figure 3.3. Comparison of variant call rates for clustered and singleton edits. (A) Boxplots showing the percent of potential A-to-I changes for clustered and singleton variant calls . P-values were calculated using a two-tailed Mann-Whitney U test (*P < 0.05, ***P < 0.001, ****P < 0.0001) . Scatter plots comparing the number of A-to-I and non-A-to-I changes for clustered (B) and singleton (C) calls. Note that the axis in B and C are scaled in thousands.
C
R2 = 0.83
R2 = 0.36
82
Interestingly, the most edits within a sample are observed in the “N2 4” total RNA sample which
had 53,227 identified editing sites. The adr2(lf) samples had the lowest average edits per unique
pair at 1.41 x 10-6 +/- 1.57 x 10-6.
The clustered A-to-I changes from all of the adr2(wt) samples were merged yielding 197,890 A-
to-I edit sites within 10,941 clusters (Table 3.5.2 Supplementary File:
Wilson_Gavin_W_201606_PhD_worm_edits.txt). To investigate the contribution of repeat
alignments to potential A-to-I edit calls I used my filtering pipeline on alignments including
uniquely mapped pairs and primary repeats (ie. the highest scoring alignment for each multi-
mapped pair), and all alignments regardless of their repeat status. This lead to substantial
increase in the number of A-to-I changes 273,582 in 12,598 clusters for primary repeats and
Figure 3.4. Number of clustered A-to-I and non-A-to-I versus the number of uniquely mapped reads. The figures in the 2nd column are zoomed in versions of the figures in the first column. Sample types are indicated in the legend.
Poly(A) R2 = 0.56 Total RNA R2 = 0.26 adr2(lf) Poly(A) R2 = 0.92
PolyA R2 = 0.20 Total RNA R2 = 0.03 adr2(lf) Poly(A) R2 = 0.68
83
409,008 edits in 15,271 clusters for all repeats. This led to substantial increase in the number of
reported A-to-I changes 273,582 in 12,598 clusters for primary repeats and 409,008 edits in
15,271 clusters for all repeats. Collectively, these data suggest the number of edits may be
underestimated due to ambiguous read mappings.
3.2.4 Comparison with other studies
To validate my reported edited sites I compared the overlap between the calls identified in this
study for uniquely mapped reads, primary alignments and repeat alignments (Figure 3.5). I
found a 78-82% overlap with the Zhao et al. study and a 36 - 74% overlap with the sites
identified by Whipple et al (68, 69). The lower overlap with the Whipple et al. dataset may be
due to differential handling of overlapping read pairs. My alignment pipeline discards paired-
alignments with an alignment length less than the read size while Whipple et al included them in
their study. The differences between my calls and the other studies may be due to their different
alignment and edit calling protocols. For example, Zhao et al. calls edits within using a single
supporting read while I require at least three. Nonetheless, I identify ~155,000 additional editing
sites derived from uniquely mapped reads, which is substantially more the aforementioned two
studies due to the additional samples processed in this study and my attempts to maximize
sensitivity.
84
3.2.5 Clusters are enriched for non-coding elements
Early reports have suggested the majority of ADAR2-dependent RNA editing clusters occur in
non-coding sequences (53, 68, 69). These regions are defined as 3’-UTRs, introns, lncRNAs, and
transposable elements. My work has identified two issues that may confound correct editing calls
in C. elegans, largely stemming from incorrect genome annotation. First, I have determined by
RNA-seq analysis that the majority of C. elegans 3’-UTR are consistently longer than reported.
Second, a number of the hyper-edited clusters are predicted to occur within unannotated introns.
A B
C
Figure 3.5. A-to-I clustered edit call comparison with other studies. Overlap between the clustered edit calls in this study (red), Zhang et al (green) and Whipple et al (blue). The size of the circles and overlap are proportional to the number of edits. For edits in (A) unique alignments (B) primary repeat alignments ie. the highest score alignment for each pair and (C) repeat alignments.
85
These introns can be within known genes or within intergenic regions. To address the UTR
length issue, I extended the annotated transcripts using RNA-seq coverage (see methods), while
the unannotated intron issue was managed by constructing a list of novel introns that were
supported by at least 5 unique pairs in at least 5 samples. This resulted in the identification of
10,568 novel splice junctions and the extension of at least 10 bp in 16,101 3’-UTR’s, correcting
earlier annotations in WormBase (Table 3.5.3 Supplementary File:
Wilson_Gavin_W_201606_PhD_utr_extensions.xlsx). The clustered edit calls were then
compared to the novel 3’-UTR positions, introns and gene annotations based on the inferred
strand of the edit to identify the type of base modified.
To explore the genetic elements and repeat types associated with clustered edits, I further
stratified the types of elements edited by their overlap with inverted repeats and transposons. It
should be noted that exonic overlap indicates overlap with coding exons while ncExonic
indicates overlap with non-coding exons. Next, I calculated the rate of clustered editing for each
type using a novel strategy to estimate rates of editing for each base / repeat type and also
expressed the data as a percentage (see methods) (Figure 3.6). Higher rates of edits (1-3 orders
of magnitude greater) were generally observed for elements associated with inverted repeats and
transposons. The majority of edits (~51.65%) observed was associated with intronic sequences;
followed by intergenic sequences (~32.68%), 3’UTRs (6.11%), and cistronic sequences (2.84%).
Finally, 4.86% of clustered edits were within annotated exons and may be associated with non-
coding isoforms or transcripts misclassified as coding.
86
3.2.6 Clustered A-to-I edit replication and properties
The majority of editing sites (64.7%) did not recur in another sample and only ~2.5% recurred in
10 or more samples (Table 3.2). Intronic, cistronic, and 3’-UTR derived edits had the highest
rates of recurrence, while exonic and mixed derived edits had the lowest. The low rate of
recurrence may be due to the rare frequency of edits and the low abundance of their RNA
molecules. This is confounded by the wide variety of samples used in this study and the
heterogeneous nature of whole worm samples. In general, if a sample had a high frequency for a
A
B
C
Figure 3.6. A-to-I editing association with genetic elements (A) Box plots of the rate of editing within the specified base and colored repeat element type (See Methods for rate calculation) IR = Inverted Repeat, Tpn = Transposon. Means are indicated with stars and medians are indicated with horizontal lines. Samples with 0 observations for the base / repeat type combination are not included (B) The percent of the total number of samples included for each rate type. (C) The percent of the total number of edits represented by each combination.
87
given edit site, it recurred more often than a sample with a low edit frequency (Figure 3.7A,C).
A similar result was observed for the read coverage across a given edit, when the edit occurred
with high coverage in at least one sample it recurred more often (Figure 3.7B,C).
Table 3.2 A-to-I clustered edit recurrence rates Genetic Element 1 Sample 2-4 Samples 5-9 Samples 10-19 Samples 20+ Samples
All 59.05% 29.30% 7.68% 2.62% 1.35% Exonic 86.37% 11.31% 1.56% 0.38% 0.38% ncExon 62.72% 24.97% 6.86% 2.43% 3.02 5'-UTR 68.72% 20.30% 5.77% 3.35% 1.86% 3'-UTR 54.57% 23.76% 9.14% 6.95% 5.58% Intronic 51.18% 35.28% 9.73% 2.88% 0.92% Cistronic 39.24% 27.11% 12.50% 9.72% 11.42% Mixed 86.66% 11.05% 1.87% 0.29% 0.29% Intergenic 69.50% 24.15% 4.77% 1.14% 0.44
88
To determine the extent of hyper-editing within clusters I counted the number of edits per-read
pair for each sample. The median number of edits per read pair read pairs was ~5 for each cluster
type (Figure 3.8A,B). The most edits observed within a read pair was 29 and this was only
observed 6 times in 4 different clusters. For example, there were three pairs with 29 edits within
the 3’-UTR of gmn-1, this region has a 686 bp cluster with 165 unique edit sites. The majority of
clusters had between 5 and 64 edits per cluster (Figure 3.8B,C).
A
B
C
Figure 3.7. A-to-I hyper- edit recurrence stratified by the overlapping genetic element and recurrence rate. (A) The maximum edit frequency observed for the samples supporting the edit. (B) The maximum log2 coverage. (C) The number of edits in each group for the previous plots. Recurrence rates are indicated in the legend.
89
3.2.7 A Global map of A-to-I editing
Previous studies have suggested that clustered edits are associated with heterochromatin due to
their localization to the arms of the autosomal chromosomes (68). To more rigorously test this
association I have created chromosomal maps showing their localization in relation to repetitive
elements, heterochromatin marks including H3K9 methylation and HPL-2 binding, gene density
and intron lengths (Figure 3.9, 3.10A, 3.11) (214, 215). In general the majority of intronic and
Figure 3.8. Properties of clusters by the dominant base and repeat type of the edits contained in the cluster. IR = Inverted Repeat, Tpn = Transposon (A) The average number of edits per read pair summed across all samples for each cluster (B) The percent of clusters classified to each base and repeat combination (C) The log2 (total number of edits per cluster) for each base and repeat combination. The total number of clusters for each base type are indicated at the bottom.
A
B
C
A
B
C
90
intergenic edits were localized to the arms of autosomal chromosomes, which are correlated with
inverted repeats, transposons, long introns, and heterochromatin marks and anticorrelated for
gene density. The opposite is observed for 3’-UTR clusters which were more uniformly
distributed across the chromosomes and are enriched for gene density and depleted for long
introns, inverted repeats and transposons. The correlations between clustered edits and other
genomic features were lower than expected, for example, the correlation between intronic edit
density and inverted repeat density was only 0.40. This due to edits being relatively sparse and
that every predicted inverted repeat may not be transcribed or properly folded into dsRNA. To
test if the localization of the clustered edit types were different from the uniform distribution I
used the Kolmogorov-Smirnov (K-S) test (Figure 3.10B). The p-values for all of the tests were
less than 10-10, and the K-S test statistic was generally highest for intronic edits, suggesting they
have the strongest difference from the uniform distribution.
91
Chromosome I
Chromosome III
Chromosome II
Figure 3.9. Global A-to-I cluster localization. Chromosomal maps illustrating edit cluster density (black lines) and the cumulative distribution of edits (red line).The maximum number of clusters in a 2kb bucket is indicated to the right of the boxes. Z-scores for repeat, gene and intron densities, and z-scores for heterochromatin markers with blue indicating positive scores and red negative scores. For more detail see the methods.
Chromosome V
Chromosome X
Chromosome IV
Z"score(
92
A B
Figure 3.10. Chromosomal distribution of clustered edits. (A) The cumulative density of each edit across each C. elegans chromosome. For comparison, a uniform distribution is included. The type of the edit is colored as per the legend. The heatmaps below the chromosomal maps are the normalized z-scores for HPL-2 from Figure 3.9. (B) The Kolmogorov-Smirnov (K-S) test statistic for each edit type compared to the uniform distribution. The lower the K-S test the more likely it is that the two samples have the same distribution.
93
Despite generating the most comprehensive set of A-to-I RNA edits in worms I suspect that this
map is not complete. As a method to test the completeness of this method I focused on introns
containing potential sources of dsRNA. I examined the proportion of introns with edits and
inverted repeat structures stratified by their size (Figure 3.12). I found that a substantial fraction
of introns with repeat structures do not have any called edits (40-90%). This may be due to the
Figure 3.11. Global Pearson correlations between chromatin marks, A-to-I edits, and genetic features. Only buckets with at least one edit cluster were included in the correlations. For edit clusters the edits were binned into 5kbp bins and the z-scores were calculated based on the number of clusters. For chromatin marks and genomic features the normalized z-scores were calculated as per the methods.
-1 +1
Pearson Correlation
Clu
ster
ed
Edi
ts
Gen
omic
Fe
atur
es
Chr
omat
in
Mar
ks
Clustered Edits
Genomic Features
Chromatin Marks
94
introns not being captured for sequencing, poor mappability, or that some inverted repeats may
not be targets for RNA editing. However, this still suggests that there may be some intron edits
that were missed in this analysis.
To further explore the comprehensiveness of my hyper-editing site database I examined the
saturation of number of new sites identified after the addition of a wildtype adr-2 sample. The
results are visualized with a rarefaction curve (Figure 3.13A). I observed that the number of sites
continued to increase but the number of new sites identified with each additional sample
declined. I speculated that this is in part due to samples with reduced sequencing depth and
found that to be the case (Figure 3.13B). The ten samples with the highest number of uniquely
mapped pairs represented 43.55% of uniquely mapped reads from all of the samples and 59.06%
of the clustered A-to-I edits. There is an increase in the number of new sites identified with
additional samples. Collectively, the intronic inverted repeats, multi-mapped reads, and the
Figure 3.12. A-to-I editing events may have been missed within introns. The proportion of introns containing inverted repeats (IR Only), transposons (Tpn Only), both transposons and inverted repeats (Tpn and IR), transposons or inverted repeats (Tpn or IR), and A-to-I hyper-editing clusters (Edit Cluster). The percentage of all the introns at the specified size as indicated.
95
saturation analysis suggest that despite the large number of sites identified in this study the map
of A-to-I edits in C. elegans is not complete.
3.2.8 Intronic edits are depleted near splice-sites
In order to determine if intronic edits could disrupt splice-sites, I profiled the positions of
intronic edits relative to the edited introns splice-sites. I stratified the intronic edits and examined
the length of introns that contained clustered edits. Intron size was determined by calculating the
smallest possible intron using both annotated and novel splice junctions. I found that the majority
of intronic edit-clusters occur within introns over 512 bp in length (Figure 3.14A). This is
expected since larger introns may have more dsRNA substrates including transposon insertions,
which required for ADAR2 editing. Given the promiscuous nature of ADAR2 editing, there is a
Figure 3.13. Saturation analysis of A-to-I hyper-edits. Samples are sorted from highest to lowest numbers of repeat mapped pairs. (A) The percentage of unique A-to-I sites identified as the number of samples are increased for edits supported by uniquely mapped pairs, primary repeat pairs(the highest scoring alignment for an alignment with multiple mappings), and all of the mapped pairs. (B) The number of uniquely mapped pairs versus the percentage the A-to-I edits identified.
A B
96
distinct possibility that RNA editing events at conserved splice-sites residues could directly
impact intron splicing. The most conserved regions include the 5’- and 3’- splice sites at the
intron / exon boundaries and the branch point site and the polypyrimidine track near the 3’-end
of the intron (25, 28). There is an example of this occurring in rats where ADAR2 autoregulates
its self by editing one of its own 3’-splice sites (55). There are additional regulatory motifs that
can be found across the intron such as intron splice silencers and enhancers (25, 28).
To investigate the possibility of ADAR2 editing in worms affecting splice-sites, I localized
clustered edits and repeat elements occur in relation to an intronic splice site signals (Figure 3.14
B, C, D). A depletion of edits and repeat elements near 5’- and 3’- splice was generally observed
with less than 5% of the edit clusters falling within 50 bp of a splice-site. Repeat elements were
also depleted with a rate of less than 10% near splice-sites. The depletion of edits near splice-
sites is suggestive that ADAR2 dependent editing in C. elegans is selected-against near splice-
sites. However, this does not exclude the possibility that edits could affect the rate of splicing by
disrupting RNA secondary structures or splicing regulatory motifs present within the center of an
intron.
To determine if there are any recurrent edits that overlap splice-site signals, I constructed a high
confidence set of edits that overlapped splice site signals. I searched my database of singleton
and clustered A-to-I edits for variants that overlapped splice sites and recurred in at least 5
samples (see Methods 3.4.8). I found 3 potential A-to-I changes that overlapped splice sites, 2 of
these were singleton edits and 1 was clustered (Table 3.3). Intriguingly, all of the edits were
identified in canonical GT-AG splice junctions where the adenosine in the 3’-splice acceptor is
targeted for editing. It is conceivable that edits within a 3’-splice site could abolish its activity.
Therefore, I checked to see if there were detectable counts for the splice junctions in the wildtype
and adr-2(lf) samples. For the edited splice site on chromosome II there were no detectable
counts for the splice junction in any of the samples, which suggests that it may be an annotation
error. For the splice site edit on chromosome X, the number of reads mapping across the splice
junction in all of the samples including the adr-2(lf) samples had counts greater than 10. Finally,
for the splice junction on chromosome V, there were high (>50) counts detected in four of the
flow sorted neuron samples from modENCODE. These samples do not have matched adr-2(lf)
samples so there is a chance that they may still be functional in these samples. I think it would of
interest to further study the splicing of this intron with qRT-PCR.
97
Table 3.3. Recurrent A-to-I edits that overlap splice sites Ref Positiona Edit Junctiona Gene Poly(A)
Samples Poly(A) Stages
Total RNA Samples
Total RNA Called
II 12795452 T>C 12795450, 1279599, GTAG
cpt-1 6 Embryo, L1, L4
8 Embryo, L1, L2, L4
V 11081533 A>G 11072729, 11081535, GTAG
act-1 1 Embryo 7 Embryo, L1, L2
X 13155479 T>C 13155477, 13155580, GTAG
WBGene00008601 3 Embryo, L1, Dauer
4 L1
a Position is zero-based b The splice site based on the strand of the transcript, the edited position is bold and underlined.
98
A
B
C
Figure 3.14. Properties of edit clusters and repeat elements within introns. (A) The length distribution for all introns including novel introns identified in this study and the sizes of introns containing at least one edit cluster (B) The frequency of clustered edits and repeat elements across length-normalized introns. (C,D) The cumulative frequency for the distance (bp) of a clustered edit or repeat element to the nearest 5’- (C) or 3-’(D) splice-site, for repeat elements that overlapped the splice site a distance of -1 was assigned.
D
99
3.2.9 Intergenic edits and antisense transcripts
Since the majority of the RNA-seq libraries processed in this study were not strand-specific I
was unable to calculate rates of antisense transcription. This leaves the possibility that some of
the observed edits are antisense to known genetic features. Using the base annotation scheme I
developed it is possible to quantify how many of the intergenic clustered edits observed are
antisense to another base type. I found that 51.67% of the intergenic edits (37,498 out of 72,574)
are antisense to an annotated base (Figure 3.15). The majority of the antisense transcripts are to
introns associated with inverted repeats and transposons (~20%). The remaining edits were
antisense to 3’-UTR’s (6.39%), coding exons (9.77%), and cistrons (2.36%).
3.2.10 3’-UTR clusters and poly(A) Sites
The 3’-UTR of coding mRNA’s are highly structured non-coding RNA sequences that have
previously been shown to be targets of ADAR-depending RNA editing (70, 216). There is the
possibility that editing could affect alternative polyadenylation, where a proximal or distal
poly(A) site could be preferentially used for 3’-UTR cleavage due on A-to-I editing (113). For
example, A-to-I editing within the 3’-UTR of the human gene EAAT2 can directly impact the
poly(A) signals by activating a cryptic poly(A) site (217). Furthermore, edits that occur
Figure 3.15. Properties of intergenic edits antisense to annotated genetic elements. The element type and total percent of the intergenic edits represented by that type are indicated at the bottom. The repeat element association are indicated as per the legend.
100
downstream of poly(A) sites may not be included in the polyadenylated transcript because the
downstream sequence would be cleaved off prior to addition of adenosine nucleotides.
In this study I have identified 13,346 clustered edits within 3’-UTRs and 10,484 (78.03%)
overlapped a 3’-UTR with at least one poly(A) site. Of these, 567 3’-UTR’s with at least one
clustered edit and at least one poly(A) site. On average 20% of the edits within a 3’-UTR
occurred upstream of the first poly(A) site. An additional 20% of the edits occurred between the
first and last poly(A) site and 60% of the edits occurred downstream of the last poly(A) site. To
profile the position of edits in more detail I examined the position of edits with respect to
poly(A) sites for 3’-UTR’s with 1 to 5 poly(A) sites (Figure 3.16A). I observed that in each case
the majority of the edits occurred after the last poly(A) site. This trend continued for all of 3’-
UTR’s with poly(A) sites and edits (Figure 3.16B,C). The length of UTR sequence after the last
poly(A) site tends to be longer than the length of 3’-UTR before the first poly(A) site and
between the first and last poly(A) site. This data supports the notion that the majority of 3’-UTR
edits may not be present polyadenylated transcript.
Finally, to profile whether edits could directly modify the poly(A) site hexamer I looked for edits
that overlapped annotated poly(A) sites. I found 128 poly(A) sites with at least one edit in the
hexamer sequence. Some of the poly(A) sites identified had more than one A-to-I edit: 24 of the
sites (18.8%) had two edits, 6 sites had 3 edits (~5%), and 2 sites had 4 edits (1.5%). To build a
high confidence set of edits that could affect poly(A) signals, I searched my database of singleton
and clustered A-to-I edits for variants that recurred in at least 5 samples (see Methods 3.4.8).
After selecting for highly replicated edits I retained 3 out of the 128 clustered edits and found 0
singleton edits (Table 3.4). Interestingly, these edits tended to recur more frequently in the total
RNA samples, which suggests that the edits may ablate the poly(A) signal preventing detection
in poly(A)-selected samples. It is possible that editing within these poly(A) sites may affect
polyadenylation and 3’-UTR cleavage.
101
Table 3.4.Recurrent A-to-I edits that overlap annotated poly(A) signals Ref Positiona Edit Poly(A)
signal Gene Poly(A)
Samples Poly(A) Stages
Total RNA Samples
Total RNA Called
I 11953624 A>G 11953626 WBGene00011060 1 Embryo 5 Embryo, L1, L2
III 3317763 A>G 3317760 ral-1 0 7 Embryo, L1, L2
IV 13380403 A>G 13380404 mau-8 3 L1, Adult
2 L1
a Position is zero-based
3.2.11 Identifying putative A-to-I dependent amino acid changes
In humans, mice and flies, RNA editing dependent recoding of amino acids has been identified
as a developmental modulator of receptor activity (218). Currently, it is not known if amino acid
recoding occurs in C. elegans. To explore if this type of editing occurs, I designed a stringent
filtering scheme to identify recurrent edits and applied it to both singleton and clustered edit calls
(See Methods 3.4.8). I identified 76 A-to-I edits that lead to a nonsynonymous amino acid
substitution. Of these, 15 are clustered edits and 61 are singleton edits (Table 3.5.2). To
determine if these amino acid changes affect the translated protein in a functional way I looked
for changes within predicted PFAM domains. The majority of these edits (49 / 76) did not
overlap a predicted PFAM domain and the most frequently edited domain type was collagen with
three different edits. The most common recoding events are: serine to proline (12), leucine to
proline (9), and aspartic acid to glycine (6). The two most common changes both lead to
incorporation of the conformationally rigid proline amino acid (21 / 76 edits) into the peptide
chain, which could affect the functional properties of the protein. However, 15 of these do not
overlap predicted protein domains and the remaining 6 mapped to proteins that did not affect the
known phenotypes associated with adr-2 knockouts. Further investigation into the effects of
these amino acid coding events on the encoded proteins enzymatic function would be worthwhile
if these edits validate using a targeted DNA and RNA sequencing approach.
102
B
Figure 3.16. Localization of 3’-UTR edits with respect to poly(A) sites. UTR’s were only retained if they contained at least one clustered edit and at least one annotated poly(A) site. (A) The number of edits before the first poly(A) site and after subsequent poly(A) sites stratified by the number of poly(A) sites present in the 3’-UTR. (B) Boxplots of the distance between the start of the 3’-UTR and the first poly(A) site (Before), the distance between the first and last poly(A) sites (Between), and the distance between the last poly(A) site and the end of the 3’-UTR (After). (C) Volcano plots of the distribution of the percent of the edits within each UTR before the first poly(A) site (before), between the first and last poly(A) sites (between), and after the last poly(A) site (after). The median is indicated with a bar and the mean is indicated with an asterisk.
A
C
103
3.3 Discussion
ADAR-dependent hyper-editing has been previously profiled in C. elegans on a global scale
using RNA-seq. However, these studies were limited in scope, using a small number of samples
and inconsistent alignment and edit calling algorithms. To more robustly characterize hyper-
editing in C. elegans, I have constructed a consistent and accurate pipeline using my RNA-seq
realignment program RNASequel and a novel-clustering algorithm. Using this algorithm I have
analyzed 91 C. elegans RNA-seq samples and identified 197,890 A-to-I edit sites within 10,941
hyper-edited clusters, generating the most comprehensive database of ADAR2-dependent RNA
editing in worms to date. My results validated previous reports that the majority of clustered
edits are associated with non-coding sequences and sources of structured RNA’s including long
introns, intergenic regions and 3’-UTR’s. These regions are commonly associated with inverted
repeat and transposable elements.
One of the prevailing challenges with determining the role of A-to-I editing in C. elegans is
determining which edits have a functional role. There is the possibility that a proportion of the
edits are a consequence of the transcription of structured RNA’s such as transposable elements
and inverted repeats and do not have a functional role. The observation that ADAR knockouts in
C. elegans do not affect transposon silencing provides further evidence that this may be the case
(219, 220). There is the possibility that ADAR’s do a play a role in transposon silencing within
somatic tissue, where transposable elements are active (221). The majority of the edits I observed
are associated with transposons and inverted repeats (74.62%). These edits are most common
within introns (51.65%) and intergenic regions (32.68%), which are locations that are likely to be
phenotypically neutral for transposon insertions.
The edits I observed within introns could affect splicing efficiency, however, I observed a
depletion of somatic transposon insertions and edits near the exon-intron boundaries where the
conserved splice-site signals occur. Moreover, I identified three recurrent edits that may affect a
splice site. This does not leave out the possibility that editing could affect intron splicing in some
cases, for example, through the biogenesis of circular RNA’s or the efficiency of intron splicing
due to the presence of dsRNA structures. Circular RNA’s occur when a pre-mRNA transcript is
spliced back onto itself (71, 222). Studies have shown that introns associated with circular
transcripts biogenesis are enriched for RNA structures, complementary sequences, and RNA
104
edits. Furthermore, these studies have shown that ADAR1 in humans is essential for circular
RNA biogenesis (71). Transposon mediated dsRNA structures within introns have been
demonstrated to promote the transition of neighboring exons from constitutively spliced to
alternatively spliced in humans (223, 224). There are examples of ADAR dependent edits within
these structures in humans but their effect on splicing has not been explored (223). Finally, it is
unknown whether intronic dsRNA structures and edits have an effect on C. elegans intron
splicing. In the future, it would be worthwhile to perform deep RNA-seq on triplicates derived
from similar C. elegans stages with all single and double knockouts of adr1 and adr2. These
could be used to test if ADAR-dependent RNA editing has an effect on the splicing of introns
targeted for editing. This could provide insight into whether ADAR-dependent RNA editing
within intronic dsRNA structures plays a role in transcript splicing or is a side effect of the
dsRNA structures present in the intron without an effect on splicing. Furthermore, RNAse that
digests linear and unstructured RNA’s could be used to enrich samples for circular RNA’s and
probe whether ADAR proteins have an affect on their biogenesis (225).
I observed that ~6.11% of the clustered edits were within 3’-UTR’s. Edits within 3’-UTR’s could
affect alternative polyadenylation, translation, RNA localization, or turnover (70). ADAR-
dependent editing has been shown to be co-transcriptional and I have observed that 3’-UTR
editing can occur proximal, between, and distal to polyadenylation sites (17). Therefore, it is
possible that editing could play a role in the selection of the polyadenylation sites either by
relaxing secondary structures or altering RNA protein binding prior to polyadenylation. An
altered 3’-UTR length could alter the regulatory signals present in the UTR and lead to
additional or reduced miRNA or RNA binding protein sites (216, 226). There is an example of
this occurring in the EAAT2 pre-mRNA where A-to-I editing activates a cryptic polyadenylation
site (217). To resolve the role of A-to-I editing on alternative polyadenylation, I would perform
TAIL-seq on wildtype, adr-1, adr-2, and double knockouts to quantitatively compare 3’-UTR
use and possibly identify which edits are associated with differential polyadenylation sites (227).
TAIL-seq would also permit me to identify cryptic poly(A) sites that may be activated by RNA
editing.
I searched for potential clustered and singleton edits that could affect amino acid changes and
identified 76 potential events. This is unexpected since these events have not been previously
identified in C. elegans EST sequencing, which is the method that led to the identification of the
105
AMPA receptor editing in humans and flies (228-230). I suspect it would be worthwhile to
validate if these changes can be identified in C. elegans EST sequencing data. Furthermore,
targeted sequencing of the edited transcript and genome sequence would be essential to
determine if these are in fact real editing events. MS/MS sequencing of the protein peptides to
confirm that the amino acid change is incorporated to the final protein product. Finally,
functional screens should be performed on the most compelling targets. If any of these edits are
validated it would provide a much clearer view of when A-to-I editing induced amino acid
recoding has evolved.
On a global scale edits within intronic and intergenic regions appear to be enriched for
heterochromatin marks including H3K9 methylation and HPL-2 binding. Heterochromatin is
most prevalent on the arms of the autosomal chromosomes. Whether RNA editing participates in
heterochromatin deposition or is a consequence of a higher transposon and inverted repeat on the
autosomal arms is unknown. Previous studies have implicated small RNA’s in heterochromatin
deposition and adr knockouts have a marked dysregulation of small RNA processing (53, 54,
231, 232). There is evidence of ADAR-dependent RNA editing regulation transposon mediated
heterochromatic gene silencing in D. melanogaster through the editing of a long structured RNA
derived from the Hoppel transposon (56). This gene was verified as an ADAR target and deletion
of this transposon altered heterochromatic gene silencing. It would be interesting to perform
CHIP-seq for HPL-2 and H3K9 marks in wildtype and adr knockout backgrounds to look for
alterations in heterochromatin deposition.
Recent evidence in mice has demonstrated that RNA editing is important for the discrimination
of endogenous and exogenous sources of dsRNA (67). In this model RNA derived from
exogenous RNA is sensed by MDA5 leading to activation of the cytosolic dsRNA-sensing
pathways and the interferon response (67, 233). Conversely, exogenous sources of RNA that
harbor A-to-I edits are not sensed by MDA5. Furthermore, mice with embryonic lethal
knockouts of ADAR1 were rescued by the inactivation of MDA5. It is possible that RNA editing
in C. elegans has a similar role through ADAR’s role in suppressing the processing of A-to-I
edited transcripts by the RNAi pathway. The cytosolic RNAi machinery in C. elegans is essential
for anti-viral defense (53, 234).
106
This comprehensive and rigorous detection and analysis of A-to-I editing sites will be useful for
further studies investigating the functional role of ADAR proteins in C. elegans. In the future, I
hope to prepare and submit this work as an original publication to make the list of A-to-I edit
sites and clusters available to the scientific community.
3.4 Methods
3.4.1 C. elegans gene annotations and reference sequences.
The C. elegans reference, gene annotations, and cistron (operon) annotations were downloaded
from WormBase (235) release WS245. The reference sequences were supplemented with the E.
coli OP50 genome sequence to remove potential bacterial contamination due to the growth
medium.
3.4.2 Samples.
The RNA-seq samples processed in this study are listed in Table 3.5.1.
3.4.3 RNA-seq preprocessing and alignment.
Spliced leader sequences (SL1 and SL2) downloaded from WormBase WS245 and were
trimmed from the ends of pairs using an in-house developed program. For the RNA-seq datasets
downloaded from Whipple et al. we also trimmed Illumina universal sequencing adapters. We
used STAR (189) version 2.3e for the identification of novel splice junctions with the default
parameters along with a GTF file using the “--sjdbGTFfile” parameter. The output from STAR
was piped into the RNASequel transcriptome generation tool using the following command:
STAR --genomeDir star-index --readFilesIn r1.fastq.gz r2.fastq.gz --readFilesCommand zcat --outSAMstrandField intronMotif --genomeLoad NoSharedMemory --outFileNamePrefix star --outStd SAM | samtools view -bSu - | rnasequel transcriptome –g genes.gtf -n ${read_size} -r genome.fa –b - -o tx --skip-ambiguous --max-intron 32000 --skip MtDNA,OP50
Splice junctions identified in the MtDNA and E. coli OP50 genome were discarded, the
maximum intron size was limited to 32,000 bp, and non-canonical splice junctions were
discarded.
BWA mem indexes were then generated for each sample using the following command:
107
bwa index tx.fa
bwa index genome.fa
Finally, RNASequel was used to remap the reads to the splice-junction index and reference
index:
rnasequel merge –filter rRNA.fa -r genome.fa –g genes.gtf -f tx.txt -o align r1.fastq.gz r2.fastq.gz bwa-index/genome.fa ./tx.fa
3.4.4 Whole Genome Alignment and Variant Calling.
The reads pairs were aligned to the C. elegans genome using BWA mem:
bwa mem -M -a -t 8 -B 2 genome.fa r1.fastq.gz r2.fastq.gz | samtools view -bS - > pairs.bam
The alignments were then sorted using samtools. A minimum base quality was set at 5 for a
position to be counted. Positions with at least 10x coverage, an alternative allele frequency >
0.10 and average base quality supporting the alternative allele of at least 25 was retained.
3.4.5 Identifying potential A-to-I editing events.
The RNA-seq alignments were piled up using an in house program. The alignments were
retained if they had no more than two mapped ambiguous. Positions with a single alternate allele,
with a frequency greater than 0.05, a minimum base quality of 5 and at least 2x coverage were
retained. The retained alignments were then searched for potential edits using the following
criteria to discard low quality calls: 1) at least one uniquely mapped pair supporting the change
(to eliminate alignment artifacts due to singleton read alignments) 2) an average base quality of
at least 25 for alignments supporting the alternative allele 3) positions mapping to tandem repeats
using trf (208) or low complexity and simple regions according to RepeatMasker were discarded,
4) at least one of the reads supporting the alternative base were outside of the first and last 5 base
pairs of the read ends, 5) at least 10x coverage and an alternative allele frequency of less than
10% at the same position in the genome sequencing data. Variants of the same type were
clustered into regions by at least 1x coverage or a read-pair spanning the region. Clusters were
retained if they had at least 5 variants, an average distance between the variants of less than 150
and at least 3 different uniquely mapped read-pairs with support for the clustered variant. For
108
clusters with an average distance between variants of less than 150bp, the variant with the
longest distance at either end of the cluster is trimmed off until the average distance is less than
150 bp (retained) or the number of edits falls below 5 (discarded). After all of the samples were
individually processed, overlapping clusters from the adr-2(lf) and adr-2(wt) gene were merged
separately.
3.4.6 Annotating edits and clusters.
Clusters and edits were annotated for their inverted repeat, transposon, gene and cistron (operon)
overlap. We included a merged set of novel splice junctions from all of the samples to annotate
edit clusters within novel introns. We extended the 3’-UTR of all transcripts that did not overlap
an operon until coverage reached 0 or the extension overlapped an annotated exon. The base
types of an edit position were inferred using the strand of the edit and gene annotations. If an edit
position overlapped more than one feature the feature was marked as ambiguous. For base-type
and repeat-type rates A’s on the plus strand, T’s on the minus strand and both A’s and T’s for
ambiguous positions were counted.
3.4.7 Chromosomal maps
The number of edited clusters of each base type was counted in 2kb bins and the cumulative
frequency of clustered edits with the same base type was also calculated.
Inverted repeat and transposon density were calculated by collapsing their respective annotations
and the proportion of bases overlapped by each type were calculated per 10kb window. Introns
were identified by merging genome annotations and novel junctions that had at least 5 unique
pairs mapping across them in 10% of the samples and the shortest intron was then identified by
collapsing the annotated and novel splice junctions. Gene density was calculated using all of the
annotated protein coding genes and binned into 10kb buckets. The buckets values were Box-Cox
transformed and then z-score normalized.
Normalized ChIP-chip data was downloaded from GEO with the accessions GSE58764 for HPL-
2 (214) and GSE26186 for embryonic H3K9 (215). The probe positions were converted to
WS245 using the WormBase remap script. ChIP-chip data was normalized to z-scores and the
median z-score was calculated for each 2kb window.
109
3.4.8 Detection recurrent A-to-I editing events within splice sites, polyadenylation signals, and coding regions
Clustered and singleton A-to-I edits that overlapped either coding regions, splice sites, or
polyadenylation signals were retained for further analysis if they matched the following criteria:
1) at least 10X coverage and no overlap with transposable or inverted repeat elements, 2) at least
10X coverage in five of the adr-2(lf) samples with no evidence of the same variant, and 3)
replicated in at least five wildtype adr-2 samples.
To select for edits that may regulate splice site selection I searched for edits that overlapped the
first and last two base-pairs of annotated and de novo discovered introns on the correct strand
(See Methods 3.46 for de novo intron discovery).
For edits that overlapped polyadenylation signals I required the edit to overlap the six base pair
annotated polyadenlyation signal in Wormbase WS245.
To further filter edits within coding regions, I annotated putative amino acid with Annovar
version 2015dec14 and only nonsynonymous changes were retained (236). I required the edit to
be replicated in at least one poly(A)-enriched sample. Finally, to annotate the domains where
potential amino acid changes occur, I downloaded the protein peptide sequences from
WormBase WS245 and predicted their domains using PFAM build 29.0 (December 2015) (237).
110
3.5 Appendix Table 3.5.1. C. elegans Samples Processed in this Study
Sample Full Name Study Stage Library
BB2-adult1 adr-1(gpv6) SRX335728 Adult Poly(A)
BB4-adult1 adr-1(gv6); adr-2(gv42) SRX335736 Adult Poly(A)
BB3-adult1 adr-2(gv42) SRX335732 Adult Poly(A)
N2e-YA_RZ-1 N2 1 SRS269392 Adult Total RNA
N2-adult1 N2 2 SRX335723 Adult Poly(A)
N2-adult8 N2 3 SRX335724 Adult Poly(A)
N2e-Ad_gonad-1-RZLI N2 Gonad SRS344182 Adult Total RNA
IP-N2-J2 N2; J2 SRR1581229 Adult Total RNA
IP-dcr-1-J2 IP dcr-1(XX); J2 SRR1581228 Adult Total RNA
N2e-DEntryDAF2-1-1 daf-2(el370) Entry 1 SRS269109 Dauer Poly(A)
N2e-DEntryDAF2-4-1 daf-2(el370) Entry 2 SRS269110 Dauer Poly(A)
N2e-DauerDAF2-2-1 daf-2(el370) Entry 3 SRS269389 Dauer Poly(A)
N2e-DExitDAF2-3-1 daf-2(el370) Exit 1 SRS269108 Dauer Poly(A)
N2e-DExitDAF2-6-1 daf-2(el370) Exit 2 SRS269111 Dauer Poly(A)
N2e-DauerDAF2-5-1 daf-2(el370) Exit 3 SRS269391 Dauer Poly(A)
IP-dcr-1-embryo-dcr IP dcr-1(XX); dcr SRR1581224 Embryo Total RNA
IP-rde-4-embryo-rde IP rde-4(XX); rde SRR1581225 Embryo Total RNA
N2e-4cell_EE_RZ-56 4 Cell SRS311761 Embryo Total RNA
BB2-embryo adr-1(gpv6) SRX335725 Embryo Poly(A)
BB4-embryo adr-1(gv6); adr-2(gv42) SRX335733 Embryo Poly(A)
BB3-embryo adr-2(gv42) SRX335729 Embryo Poly(A)
111
Sample Full Name Study Stage Library
N2e-E2-E8_sorted E cells E2-E8 SRS311762 Embryo Total RNA
N2e-DMM260_N2eref_EE Early Embryo Reference SRS344159 Embryo Total RNA
N2e-EE_DSN-51 Early embryos SRS311763 Embryo Total RNA
N2e-EE_RZ-54 Early embryos SRS311907 Embryo Total RNA
IP-N2-embryo-dcr IP N2; dcr SRR1581226 Embryo Total RNA
IP-N2-embryo-rde IP N2; rde SRR1581227 Embryo Total RNA
N2-embryo N2 SRX335693 Embryo Poly(A)
N2e-EE_50-0 N2 50-0 SRS258165 Embryo Poly(A)
N2e-EE_50-150 N2 50-150 SRS266274 Embryo Poly(A)
N2e-EE_50-180 N2 50-180 SRS266269 Embryo Poly(A)
N2e-EE_50-210 N2 50-210 SRS266275 Embryo Poly(A)
N2e-EE_50-240 N2 50-240 SRS266270 Embryo Poly(A)
N2e-EE_50-270 N2 50-270 SRS266276 Embryo Poly(A)
N2e-EE_50-30 N2 50-30 SRS258085 Embryo Poly(A)
N2e-EE_50-300 N2 50-300 SRS266880 Embryo Poly(A)
N2e-EE_50-330 N2 50-330 SRS266265 Embryo Poly(A)
N2e-EE_50-360 N2 50-360 SRS266266 Embryo Poly(A)
N2e-EE_50-390 N2 50-390 SRS266267 Embryo Poly(A)
N2e-EE_50-420 N2 50-420 SRS266268 Embryo Poly(A)
N2e-EE_50-450 N2 50-450 SRS266277 Embryo Poly(A)
N2e-EE_50-480 N2 50-480 SRS266271 Embryo Poly(A)
N2e-EE_50-510 N2 50-510 SRS266278 Embryo Poly(A)
N2e-EE_50-540 N2 50-540 SRS266272 Embryo Poly(A)
N2e-EE_50-570 N2 50-570 SRS266261 Embryo Poly(A)
112
Sample Full Name Study Stage Library
N2e-EE_50-600 N2 50-600 SRS266262 Embryo Poly(A)
N2e-EE_50-630 N2 50-630 SRS266263 Embryo Poly(A)
N2e-EE_50-660 N2 50-660 SRS266264 Embryo Poly(A)
N2e-EE_50-690 N2 50-690 SRS266273 Embryo Poly(A)
N2e-EE_50-720 N2 50-720 SRS266279 Embryo Poly(A)
N2e-EE_50-90 N2 50-90 SRS258166 Embryo Poly(A)
N2e-EE_50-120 N2 50-120 SRS242382 Embryo Poly(A)
N2e-EE_50-60 N2 50-60 SRS242229 Embryo Poly(A)
N2e-DMM239_Z1Z4_Em Z1/Z4 SRS344160 Embryo Total RNA
BB2-L1 adr-1(gpv6) SRX335726 L1 Poly(A)
BB4-L1 adr-1(gv6); adr-2(gv42) SRX335734 L1 Poly(A)
BB3-L1 adr-2(gv42) SRX335730 L1 Poly(A)
N2e-DMM383_all-nrn_L1 All Neuron 1 SRS344161 L1 Total RNA
N2e-DMM381_all-nrn_L1 All Neuron 2 SRS344162 L1 Total RNA
N2e-DMM383-all-nrn_L1-V All Neurons 3 SRS311746 L1 Total RNA
N2e-DMM391-all-nrn_L1-V All Neurons 4 SRS311749 L1 Total RNA
N2e-DMM387-NSML_NSMR-nrn_L1-V NSML and NSMR Neurons 1 SRS311748 L1 Total RNA
N2e-DMM386-NSML_NSMR-nrn_L1 NSML and NSMR Neurons 2 SRS308486 L1 Total RNA
N2e-DMM386-NSML_NSMR-nrn_L1-DSN NSML and NSMR Neurons 3 SRS308486 L1 Total RNA
N2e-DMM402_N2eall_L1-DSN N2 1 SRS311687 L1 Total RNA
N2e-DMM401_N2eall_L1-DSN N2 2 SRS311684 L1 Total RNA
N2e-DMM401-N2eall_L1-V N2 3 SRS311747 L1 Total RNA
113
Sample Full Name Study Stage Library
N2e-DMM402-N2eall_L1-V N2 4 SRS311750 L1 Total RNA
N2-L1 N2 5 SRX335720 L1 Poly(A)
N2e-DMM389_NSM_L1 NSM neurons SRS344181 L1 Total RNA
N2e-DMM408_Amot_nrn_L2-DSN A motor neuron 1 SRS311688 L2 Total RNA
N2e-DMM414_Amot_nrn_L2-DSN A motor neuron 2 SRS311689 L2 Total RNA
N2e-DMM415_Amot_nrn_L2-DSN A motor neuron 3 SRS311690 L2 Total RNA
N2e-L2_DSN-50 N2 1 SRS344178 L2 Total RNA
N2-L2L3 N2 2 SRX335721 L2 Poly(A)
N2e-L2_RZ-53 N2 3 SRS311908 L2 Total RNA
N2e-pharyngeal N2 pharyngeal muscle SRS242498 L2 Poly(A)
BB2-L4 adr1(gpv6) SRX335727 L4 Poly(A)
BB4-L4 adr1(gv6); adr-2(gv42) SRX335735 L4 Poly(A)
adar2-r1 adr2(gv42) SRS706525 L4 Total RNA
BB3-L4 adr2(gv42) 1 SRX335731 L4 Poly(A)
adar2-polyA adr2(gv42) 2 L4 Poly(A)
wt-polyA N2 1 SRX707276 L4 Poly(A)
N2-L4 N2 2 SRX335722 L4 Poly(A)
wt-r1 N2 3 SRS706527 L4 Total RNA
wt-r2 N2 4 SRS706528 L4 Total RNA
wt-r3 N2 5 SRS706526 L4 Total RNA
tdp-polyA tdp-1(ok803) 1 SRX707279 L4 Poly(A)
tdp-R1 tdp-1(ok803) 2 SRS706529 L4 Total RNA
114
Sample Full Name Study Stage Library
tdp-R2 tdp-1(ok803) 3 SRS706530 L4 Total RNA
tdp-R3 tdp-1(ok803) 4 SRS706531 L4 Total RNA
115
Table 3.5.2 A-to-I Edits predicted to affect amino acid changes within transcripts
Ref Position (Zero-based) Edit
Poly(A) Samples
Poly(A) Stages
Total RNA Samples
Total RNA Stages Substitution Gene Transcripts
PFAMDomain
I 109754 A>G 2 Embryo 3 L1, Adult I64T rab-11.1 F53G12.1.1, F53G12.1.2 Ras
I 1701680 T>C 2 Embryo, Adult 3 Embryo, Adult S73P lsm-6 Y71G12B.14 LSM
I 1990465 A>G 4 Embryo, L1, L4 1 Adult L395P gap-3 Y20F4.3b -
I 1990465 A>G 4 Embryo, L1, L4 1 Adult L726P gap-3 Y20F4.3a -
I 4214436 A>G 11 Embryo, L1, L4, Adult, Dauer 1 Adult S295P WBGene00015976 C18E3.9a -
I 4214436 A>G 11 Embryo, L1, L4, Adult, Dauer 1 Adult S299P WBGene00015976 C18E3.9b -
I 8462478 T>C 1 Dauer 8 Embryo, L1, L2 S265P unc-120 D1081.2 -
I 9256992 T>C 14 Embryo, Dauer 0 T637A WBGene00009500 F36H2.3d -
I 9256992 T>C 14 Embryo, Dauer 0 T707A WBGene00009500 F36H2.3f -
I 9256992 T>C 14 Embryo, Dauer 0 T708A WBGene00009500 F36H2.3c -
I 9256992 T>C 14 Embryo, Dauer 0 T840A WBGene00009500 F36H2.3a -
I 9256992 T>C 14 Embryo, Dauer 0 T848A WBGene00009500 F36H2.3b -
I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V597A WBGene00009500 F36H2.3d Sushi
I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V667A WBGene00009500 F36H2.3f Sushi
I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V668A WBGene00009500 F36H2.3c Sushi
I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V800A WBGene00009500 F36H2.3a Sushi
I 9257111 A>G 17 Embryo, Dauer 2 L2, L4 V808A WBGene00009500 F36H2.3b Sushi
I 14829894 A>G 5 Embryo, Dauer 0 Y351H eva-1 F32A7.3b -
I 14829894 A>G 5 Embryo, Dauer 0 Y456H eva-1 F32A7.3a -
I 15020092 T>C 2 Embryo, Dauer 3 Embryo, L2 D640G dog-1 F33H2.1.1, F33H2.1.2 -
II 33340 T>C 7 Embryo, Dauer 0 S264P fbxc-54 F23F1.3 -
II 2443950 A>G 6 Embryo 0 D120G vab-19 T22D2.1 -
116
Ref Position (Zero-based) Edit
Poly(A) Samples
Poly(A) Stages
Total RNA Samples
Total RNA Stages Substitution Gene Transcripts
PFAMDomain
II 3899258 A>G 4 Embryo 1 Embryo E80G WBGene00018750 F53C3.6b -
II 3899258 A>G 4 Embryo 1 Embryo E179G WBGene00018750 F53C3.6a -
II 5675381 A>G 2 Embryo 4 Embryo, L1, L2, L4 V763A WBGene00016117 C25H3.8 -
II 5675849 A>G 3 Embryo 6 Embryo, L1, L2, L4 L709S WBGene00016117 C25H3.8 -
II 7153223 A>G 2 Dauer 4 L4, Adult D67G fust-1 C27H5.3.1, C27H5.3.2 -
II 7496628 A>G 5 Embryo 0 K40E spi-1 R10H1.4 -
II 8485988 A>G 11 Embryo, L2, Dauer 0 N18D gbh-2 M05D6.7 -
II 9056616 T>C 5 L1, L2, L4, Adult 0 K55E rpl-32
T24B8.1b.1, T24B8.1b.2, T24B8.1b.3
Ribosomal_L32e
II 9056616 T>C 5 L1, L2, L4, Adult 0 K108E rpl-32
T24B8.1a.2, T24B8.1a.3, T24B8.1a.1
Ribosomal_L32e
II 10158354 T>C 10 Embryo, Dauer 1 L1 E506G daf-19 F33H1.1c -
II 10158354 T>C 10 Embryo, Dauer 1 L1 E522G daf-19 F33H1.1e -
II 10158354 T>C 10 Embryo, Dauer 1 L1 E545G daf-19 F33H1.1d -
II 10158354 T>C 10 Embryo, Dauer 1 L1 E664G daf-19 F33H1.1a -
II 10158354 T>C 10 Embryo, Dauer 1 L1 E689G daf-19 F33H1.1b -
II 11096953 T>C 5 Embryo, L1, L2, Adult 0 K65E rpl-41 C09H10.2
Ribosomal_L44
II 11211721 T>C 13 Embryo, L4, Adult, Dauer 4 L1, L2, Adult I995T srap-1 T06D8.1c -
II 11211721 T>C 13 Embryo, L4, Adult, Dauer 4 L1, L2, Adult I1199T srap-1
T06D8.1a, T06D8.1b -
II 12075627 A>G 2 Embryo, Adult 4 Embryo, L1 S347P cnt-1 Y17G7B.15b -
II 12075627 A>G 2 Embryo, Adult 4 Embryo, L1 S431P cnt-1 Y17G7B.15a -
II 12896646 T>C 5 L1, L4, Adult 0 S226P clec-63 F35C5.6 VWA
II 13390908 A>G 4 Embryo, Dauer 1 L1 D117G WBGene00012995 Y48C3A.14a Toprim
117
Ref Position (Zero-based) Edit
Poly(A) Samples
Poly(A) Stages
Total RNA Samples
Total RNA Stages Substitution Gene Transcripts
PFAMDomain
II 13539119 T>C 4 Embryo, Dauer 1 Embryo K396E WBGene00013001 Y48E1B.2a -
III 940955 T>C 4 Embryo 1 Embryo N292D fbxa-59 T12B5.8 -
III 1872247 A>G 22 Embryo 6 Embryo, L2, Adult H260R WBGene00021444 Y39A3CR.3b -
III 1872247 A>G 22 Embryo 6 Embryo, L2, Adult H656R WBGene00021444 Y39A3CR.3a -
III 2698418 A>G 4 Embryo, L2 2 Embryo, Adult L87P WBGene00022174 Y71H2AM.9 -
III 4476646 T>C 1 Embryo 4 Embryo, L2, Adult Y203H rps-1 F56F3.5
Ribosomal_S3Ae
III 5277404 A>G 7 Embryo, Dauer 0 H417R toc-1 ZC395.3a.2, ZC395.3a.1 -
III 5277404 A>G 7 Embryo, Dauer 0 H435R toc-1 ZC395.3b -
III 5499876 T>C 5 Embryo, Dauer 0 X166Q WBGene00019838 R02F2.9 -
III 5783791 A>G 12 Embryo, Dauer 4 Embryo, L2 L136P WBGene00021345 Y37B11A.3 -
III 7985044 T>C 24 Embryo 5 Embryo, L2, Adult I635T clp-1
C06G4.2b.1, C06G4.2b.2 Calpain_III
III 7985044 T>C 24 Embryo 5 Embryo, L2, Adult I660T clp-1 C06G4.2d Calpain_III
III 7985044 T>C 24 Embryo 5 Embryo, L2, Adult I681T clp-1 C06G4.2a Calpain_III
III 8914116 A>G 4 Embryo, Dauer 2 L4, Adult N97D trxr-2 ZK637.10.2, ZK637.10.1 Pyr_redox_2
III 8973641 T>C 6 Embryo, Dauer 1 Adult I363T WBGene00011144 R08D7.4a.2, R08D7.4a.1 -
III 13780290 A>G 5 Embryo, L2 1 L4 K22E pot-3 3R5.1a, 3R5.1b POT1PC
III 13780291 A>G 3 Embryo, L2 3 L1, L2, L4 K22R pot-3 3R5.1a, 3R5.1b POT1PC
IV 256130 T>C 5 Embryo 1 Adult N243D WBGene00019825 R02D3.8 -
IV 992351 A>G 1 Dauer 7 Embryo, L1, L2, L4 T21A WBGene00021924
Y55F3AM.6b, Y55F3AM.6a zf-CCCH_3
IV 1300749 T>C 3 Embryo 6 L1, L2, L4 I531V WBGene00018776 F53H1.1d, F53H1.1c -
IV 1300749 T>C 3 Embryo 6 L1, L2, L4 I668V WBGene00018776 F53H1.1a, F53H1.1b -
IV 1963819 A>G 11 Embryo 2 Embryo, L2 I53V ulp-3 Y48A5A.2.2, Y48A5A.2.1
Peptidase_C48
118
Ref Position (Zero-based) Edit
Poly(A) Samples
Poly(A) Stages
Total RNA Samples
Total RNA Stages Substitution Gene Transcripts
PFAMDomain
IV 4308765 T>C 4 Embryo, Dauer 1 Embryo S227P set-9 F15E6.1 -
IV 4313266 A>G 5 Embryo 1 Embryo Q1100R set-9 F15E6.1 -
IV 4316338 A>G 2 Embryo, Dauer 3 L1 D1588G set-9 F15E6.1 -
IV 7026563 A>G 2 Embryo, L1 3 L1 D17G WBGene00015551 C06G3.5b A_deaminase
IV 7026563 A>G 2 Embryo, L1 3 L1 D59G WBGene00015551 C06G3.5a A_deaminase
IV 7781944 A>G 3 Embryo, Dauer 2 Embryo L124P sec-10 C33H5.9 Sec10
IV 10839352 A>G 3 Embryo 4 Embryo, L1, L2, L4 I555T WBGene00011720
T11G6.5b.1, T11G6.5b.2 -
IV 10839352 A>G 3 Embryo 4 Embryo, L1, L2, L4 I612T WBGene00011720 T11G6.5a -
IV 11509905 T>C 4 Embryo, Adult 3 L1 N89S WBGene00010848 M04B2.6 -
IV 12090044 A>G 4 Embryo, Dauer 1 L2 L465P mig-32 F11A10.3a -
IV 12142249 T>C 4 Embryo, Dauer 1 L1 I288V WBGene00007090 B0001.5 -
IV 12646747 A>G 4 Embryo, Dauer 1 L4 Y527C vha-7 C26H9A.1a V_ATPase_I
IV 12646747 A>G 4 Embryo, Dauer 1 L4 Y771C vha-7 C26H9A.1b V_ATPase_I
IV 12773897 A>G 1 L4 4 L1, L2 S5G unc-31
ZK897.1b, ZK897.1h, ZK897.1i, ZK897.1o, ZK897.1k, ZK897.1l, ZK897.1r, ZK897.1p, ZK897.1q, ZK897.1j, ZK897.1m, ZK897.1c, ZK897.1a, ZK897.1n -
V 6360107 A>G 2 L1 8 L1, L2 N199S WBGene00017430 F13H6.1a -
V 6360107 A>G 2 L1 8 L1, L2 N203S WBGene00017430 F13H6.1b.1, F13H6.1b.2 -
V 6502324 T>C 1 L1 6 L1, L2 L440P WBGene00020909 W01A11.1 -
119
Ref Position (Zero-based) Edit
Poly(A) Samples
Poly(A) Stages
Total RNA Samples
Total RNA Stages Substitution Gene Transcripts
PFAMDomain
V 8050237 A>G 7 Embryo 0 L186P col-43 ZC513.8 Collagen
V 8050247 T>C 7 Embryo 0 T183A col-43 ZC513.8 Collagen
V 8050382 T>C 8 Embryo, L2 1 L4 T138A col-43 ZC513.8 -
V 8050399 A>G 8 Embryo, L2 0 V132A col-43 ZC513.8 -
V 8268568 T>C 5 Embryo 1 Adult T108A asp-6 F21F8.7 Asp
V 9778852 A>G 2 Embryo, Dauer 3 L2, Adult S192P WBGene00009769 F46B6.4 -
V 11237607 A>G 6 Embryo, Dauer 0 S349P dvc-1
T19B10.6.2, T19B10.6.3, T19B10.6.1 -
V 11746239 A>G 5 Dauer 0 N1211S WBGene00011436 T04F3.1b -
V 11746239 A>G 5 Dauer 0 N1278S WBGene00011436 T04F3.1a -
V 12042436 A>G 8 Embryo, L2 0 I235V WBGene00008645 F10C2.4 DNA_pol_B_exo1
V 12866164 A>G 16 Embryo, Dauer 4 L2, L4, Adult Y51C WBGene00011591 T07F10.5.2, T07F10.5.1 -
V 13198692 A>G 1 L1 4 L1 V192A col-159 F57B1.3 Collagen
V 15197445 T>C 4 Embryo, Dauer 1 Embryo L856P plc-2 Y75B12B.6 -
V 18796242 A>G 2 L4, Dauer 3 L1 S376P nhr-170 C54E10.5 Hormone_recep
V 18796248 A>G 2 L4, Dauer 4 L1, L4 Y374H nhr-170 C54E10.5 Hormone_recep
V 18796287 A>G 2 L4, Dauer 3 L1 Y361H nhr-170 C54E10.5 Hormone_recep
V 18796290 A>G 3 L4, Dauer 2 L1 Y360H nhr-170 C54E10.5 Hormone_recep
X 4715660 T>C 4 Embryo, Dauer 1 L1 L118P rpl-25.1 F55D10.2 Ribosomal_L23
X 5462827 A>G 4 Embryo, L2, Dauer 1 Embryo N249D ddr-1 C25F6.4 -
X 6047311 A>G 4 Embryo 2 Embryo, L4 S113P pak-1 C09B8.7c -
X 6047311 A>G 4 Embryo 2 Embryo, L4 S159P pak-1 C09B8.7b -
X 6047311 A>G 4 Embryo 2 Embryo, L4 S162P pak-1 C09B8.7a -
120
Ref Position (Zero-based) Edit
Poly(A) Samples
Poly(A) Stages
Total RNA Samples
Total RNA Stages Substitution Gene Transcripts
PFAMDomain
X 14072009 A>G 22 Embryo 5 Embryo, L2, Adult E78G WBGene00007904 C33G3.4 -
X 14352031 T>C 1 Embryo 4 Embryo, L1 S393P wdr-5.2 K04G11.4 -
X 17459884 A>G 2 Embryo, Dauer 5 L1 I99V mlc-1 C36E6.3 -
121
Discussion 4The advent of nucleic acids sequencing has led to an exponential increase in the information
available regarding the DNA and RNA content of a biological sample. DNA sequencing
technologies have progressed from sequencing single genes to gigabase-sized genomes. This
technology has also been adapted in tandem with improvements to DNA sequencing to sequence
the RNA content of a cell by taking advantage of reverse transcriptase proteins, permitting
scientists to sequence the expressed portion of a sample’s genome. RNA sequencing technology
evolved from EST and SAGE techniques to whole transcriptome sequencing using Illumina and
other high-throughput sequencers (10, 130, 131). RNA-seq permits quantitative transcriptome
profiling at single base-pair resolutions and experiments based on this technology have revealed
the tremendous and dynamic nature of an organism’s transcriptome. For example, RNA-seq has
been used to identify and quantify RNA editing and alternative splicing on a global scale (12,
116, 138, 238). However, due to the relatively short read lengths of Illumina sequencers (<150
bp) the down-stream analysis of RNA-seq experiments has been difficult. This is especially the
case when analyzing post-regulatory events such as RNA editing and alternative splicing.
My first thesis objective was concerned with the development of an accurate RNA-seq alignment
program. Variant and repeat tolerant RNA-seq alignment is a challenging problem that can
impact the downstream interpretation of the data. RNA-seq alignment requires the alignment of
short-exonic alignments across introns that can span hundreds of kilobases. To date, the majority
of RNA-seq alignment tools have been concerned with the sensitive detection of splice junctions
with little emphasis on managing reads aligning to repeats and variants. Ideally, an RNA-seq
alignment tool must be capable of repeat tolerant and accurate gapped, mismatch and spliced-
read alignment. High accuracy for mismatches, gaps and spliced alignments are all
interdependent. For instance, a read incorrectly mapped into the intronic sequence rather than
across a splice junction can lead to false positive mismatches or an incorrect non-canonical splice
junction is chosen rather than inserting a gap. Repeat alignment sensitivity is critical for the
accurate identification of mismatches such as RNA edits. Alignments incorrectly marked as
mapping to a single location could have additional mappings to similar sequences elsewhere.
These issues are also common when identifying RNA edits since they commonly occur in
122
transposable and repeat elements (34). One study that suffered from this issue used RNA-seq to
identify non-canonical RNA editing events (193). It was found that many of the non-canonical
events were due to the missed alignments to paralogous genes and incorrectly spliced read
alignments.
There has been a lack of freely available RNA-seq alignment algorithms dedicated to the
detection of SNVs and RNA edits. To mitigate this I developed a novel post-processing strategy
called RNASequel that mitigates the aforementioned RNA-seq alignment issues. The primary
innovations implemented by RNASequel have lead to an accurate and dynamic RNA-seq
alignment methodology. RNASequel is designed to be durable by not depending on any specific
tool for de novo splice junction detection and contiguous read alignment. Four primary features
have lead to RNASequel’s highly accurate alignment: 1) a splice junction-database that
integrates novel splice junctions and is capable of handling reads that map across multiple splice
junctions, 2) independent mapping of the reads from each pair to maximize repeat sensitivity 3)
utilizing highly accurate contiguous read aligners 4) an algorithm engineered to empirically
determined the fragment size distribution and to identify repeat and concordant paired-
alignments.
To validate and benchmark the accuracy of the improvements facilitated by RNASequel versus
traditional RNA-seq alignment tools I compared RNASequel to two popular RNA-seq alignment
algorithms (Tophat2 and STAR). I utilized two simulated human RNA-seq datasets and 28
human-derived biological datasets. For the two simulated datasets I demonstrated that
RNASequel post-processing leads to a marked increase in the accuracy of mismatch, gap and
spliced alignments. I further show that RNASequel is more sensitive to repeat alignments and
that my fragment size estimation algorithm aids in the choice of the primary alignment. For the
biological datasets I demonstrate that RNASequel realignment leads to increase mapping rates, a
reduction in the number of non-canonical edits (false positives), an increase in the number of
somatic SNVs, and similar levels of A-to-I RNA edits calls. The improved variant calls were
partially facilitated by an improved repeat sensitivity. In conclusion RNASequel is an important
innovation for RNA-seq alignments and will be useful for RNA-seq experiments where accuracy
is paramount such as the identification of SNVs and RNA edits.
123
The original published version of RNASequel had a few issues, particularly in its disk space
usage and its empirical fragment size estimation algorithm, which could miss valid alignments
towards the tail of the fragment size distribution. This limited the number of samples that could
be processed quickly and caused RNASequel alignments to miss some pair alignments. I
mitigated both of the aforementioned issues for my second data chapter where the new version of
RNASequel was used to profile A-to-I RNA editing in C. elegans.
RNASequel is a useful tool to refine RNA-seq alignments and with future extensions
RNASequel could be used to improve the alignments of circular RNA’s, fusions and trans-
splicing events (222, 239, 240). The splice-juntion database implemented by RNASequel could
be modified to realign reads across any of the aforementioned events to take advantage of
RNASequel’s high sensitivity and low false positive rate. Furthermore, as more organisms
without complete reference genomes are profiled with RNA-seq, taking advantage of the output
of de novo assemblers will be essential. There will be a need to map accurately map short reads
from the assembled RNA-seq data to the assembled reference genome (241, 242). RNASequel
can be modified to work with the assembled contigs in combination with the available reference
genome data to maximize alignment accuracy.
The biological role of RNA editing within an organism has not been fully explored. One of the
reasons for this is that ADAR knockouts in higher order metazoa are lethal, however, other
organisms such as C. elegans and D. melanogaster remain viable (44, 243). The nematode C.
elegans encodes two ADAR genes: adr-2 encodes the catalytically active enzyme and adr-1 the
catalytically inactive enzyme (44). ADAR1 proteins are thought to modulate the editing activity
of ADAR2 and play a regulatory role in small RNA processing. Double knockouts of the adr
genes causes an up-regulation of small RNA expression (53). The caveat to using C. elegans as a
model system is that it may not have evolved the same dependence on A-to-I RNA as higher
order metazoa. ADAR2 dependent RNA editing in C. elegans has been shown to occur in hyer-
edited clusters that localize to non-coding genetic elements such as introns, 3’-UTR’s and
intergenic regions (70). These elements are common sources of structured RNA’s due to inverted
repeats and transposable element insertions.
This association with repeat elements increases the complexity when trying to identify RNA
edits using RNA-seq. Repeat sensitivity is essential to accurately map the relatively short (<100
124
bp each read) RNA-seq pairs. Recent studies have globally profiled RNA editing in C. elegans
by using RNA-seq and complex in-house alignment and hyper-editing detection. (53, 68, 69).
Combining the calls from these papers are suboptimal because of their different analysis
pipelines and a unified method and profile is essential to construct a comprehensive map of
ADAR2 dependent RNA editing in C. elegans.
The final objective of my thesis has been to construct an accurate and comprehensive map of A-
to-I RNA editing sites in C. elegans. To accomplish this I utilized the improved version of
RNASequel and a sensitive hyper-editing cluster identification pipeline. Using both of these
software innovations I have pushed the detection of ADAR2 editing sites to its near-limit. For
cluster edits I require a minimum coverage of 3 unique reads. The minimum coverage could be
reduced, but the false positive rate would increase substantially. The 91 C. elegans RNA-seq
samples processed in this study include adr-2 knockouts, poly(A) selected and total RNA
wildtype adr-2 samples. After merging the edits discovered in all of these samples I identified
197,890 A-to-I edit sites within 10,941 hyper-edited clusters. I verified previous observations
that A-to-I editing returns to background level in the adr-2 knockout samples and the edits are
strongly associated with non-coding genetic elements and repeat DNA. To verify the efficacy of
my pipeline I compared the overlap between my edit calls and those by Zhao et al. (69) and
Whipple et al. (68) and found a ~80% and ~74% concordance respectively. I also identified
~155,000 additional A-to-I hyper-editing sites.
Despite the comprehensiveness of the A-to-I edit sites constructed in this study I suspect that
there are still more undiscovered sites based on three observations; the first is that when
ambiguously mapped reads are included in the dataset the number of edit sites nearly doubles to
409,008 edits in 15,271 clusters. This suggests that a proportion of the sites mapped to by
ambiguously mapped reads may contain additional editing sites. It cannot be asserted that all of
the sites are indeed expressed and edited at some point in the worm life cycle. The second
observation is that there are inverted repeat structures within long introns that may also be targets
for ADAR2 but may have been missed from in the sequencing data (Figure 3.11). The third
observation was based on saturation analysis that the number of new hyper-edited sites that I
identified in this study increased with the addition of new samples (Figure 3.12).
125
Delineating the full extent of A-to-I editing in an organism’s genome is a challenging endeavor
and is limited by three major factors: the read length and throughput of RNA-seq experiments;
the rarity of edited transcripts (for example pre-mRNAs); and the fact that not every ADAR
dsRNA substrate may be expressed in a given cell type, tissue or condition. Furthermore, the
heterogeneity of whole organism derived samples such as C. elegans worms may obscure
biologically relevant signals in rare cell types or rarely expressed genes. These issues could be
mitigated by three primary methods 1) improved sample purification methods 2) improved
sequencing technologies 3) targeted sequencing approaches. Selective sample isolation such as
flow sorting for C. elegans samples would aid in understanding of editing within different worm
tissues such as neuronal or germline tissues (244). Single cell sequencing could further resolution
in the understanding of editing at the cellular level by permitting the quantification of editing
within a tissue on a single cell basis (245). Improved sequencing technologies can resolve repeat
and low coverage issues by increasing sequencing throughput and/or read length. Finally,
targeted sequencing can help select for genetic elements with low coverage such as nascent
transcripts, determine the cellular localization of edited transcripts, and identify the transcripts
associated with inosine binding proteins. A combination of all three aforementioned methods
could be used to increase resolution further.
Sample heterogeneity is a confounding factor when analyzing whole tissues or organisms for
RNA editing (245-247). A typical RNA-seq experiment is a snapshot of the cell population at a
given time. This may lead to scientific discoveries being biased towards the most common cell
types and may obscure other biologically relevant signals (245-247). A notable example of this is
cancer stem cells being obscured by other tumor cells (246, 248). There is a currently a push to
develop commercial kits for single-cell RNA-seq and I believe this will be the future of RNA
sequencing. However, even with new Illumina sequencers the cost of deeply sequencing single
cells is prohibitively expensive and the majority of experiments aim for low sequencing depth
(10-20M reads per cell) for gene expression studies. This depth is not suitable for RNA editing
identification since the majority of inosine containing transcripts are rare. Furthermore, single
cell experiments tend to have an increased end bias due to poly(A) amplification compared to
traditional RNA-seq experiments (245). As sequencing throughput and read-length increases and
single-cell RNA kits improve this technology will become essential to understand RNA
regulation in the context of splicing, gene expression and RNA editing in a population of single
126
cells. Single cell RNA-seq does lead to a challenge when mapping read pairs since there can be
thousands of samples. The upgraded version of RNASequel, would be more than capable of
processing this many samples since it does not produce temporary files and functions with raw
sequencing reads.
There are two primary developments to sequencing technology that will improve RNA editing
detection. Increasing sequencing depth will increase the read depth for rare transcripts such as
pre-mRNA’s. This can be achieved using the Illumina platform by utilizing additional
sequencing lanes, but as of now the cost is still prohibitive. As the cost of sequencing decreases
it will become more common to sequence libraries to high depth. This will increase coverage
across intronic sequences, for example, which may lead to the identification of more edit sites
and sites with a low frequency. The second sequencing development that will aid in the
identification of RNA editing is single molecule sequencing. The sequencers capable of high-
throughput single molecule sequencing are still in their infancy but they will be useful to resolve
ambiguous read mappings and the extent of RNA editing in single transcripts. Two notable
single molecule sequencing technologies are the Pacific Biosciences SMRT sequencer (249)
capable of generating an average read length of 10 – 15kbp and the Oxford Biosciences
Nanopore (250) sequencer that can generate ~5kbp read lengths. Both of these technologies
current have high error rates and lower throughput than existing sequencing technologies but as
they improve they could be useful for profiling RNA edits. Selecting for transcripts that are
likely to be edited could mitigate the lower throughput of these instruments.
The genetic elements targeted by ADAR proteins in C.elegans are generally rare with the
majority derived from intronic and intergenic sequences. Despite sequencing technologies
yielded unprecedented throughput, the cost of deeply sequencing a large number of RNA-seq
samples from differing tissues or whole worms would be restrictive. Therefore, it would be ideal
if the regions likely to be edited could be directly targeted for sequencing. I foresee two useful
sets of techniques to select for edited transcripts. The first involves RNA immunoprecipitation
followed by high-throughput sequencing where RNA’s bound by dsRBPs are precipitated (251).
RNA with structured regions can also be directly precipitated using dsRNA specific antibodies
(68, 210, 252). The second set of methods involves selecting for transcripts with specific
properties such as intron lariats or nascent transcripts (pre-mRNA’s). These methods would not
only reduce the sequencing cost but they may aid in elucidating the biological context of A-to-I
127
RNA editing by determining what proteins appear to target edited transcripts and their cellular
localization.
Immunoprecipitation of structured RNA’s associated with dsRBPs could be used to explore
which proteins bind inosine containing and RNA and the possibility that some dsRBPs may play
a regulatory role by competing with ADAR’s for dsRNA targets. In C. elegans adr-1
immunoprecipitation with RNA-seq was used to sequence the targets of ADAR1 in vivo (62).
This method could be extended and combined with deeper sequencing and paired-end reads to
isolate structured RNA’s that may also have A-to-I edits. The one caveat of these methods is that
if ADAR2 editing disrupts the dsRNA structure the RNA’s may no longer be bound by these
proteins. Other proteins known to bind dsRNA’s could also be used for precipitation such as
Staufen, NONO, RNAi components, or Vigilins (57, 58, 68, 253).
Transcripts with 3’-UTR A-to-I edits have been shown to co-localize with nuclear paraspeckles
by NONO binding (57, 254). However, a cursory literature search has not revealed evidence of
nuclear paraspeckles in C. elegans. Homologues to some of the proteins required for paraspeckle
formation such as NONO (nono-1 on WormBase) are present but have not been characterized. It
would be interesting to verify if C. elegans NONO binds inosine containing dsRNA and if it
leads to the nuclear retention of edited dsRNAs.
The Vigilin proteins have been shown to have high affinity for A-to-I containing RNA’s and
localize to heterochromatin (58). These proteins combined with the observation hyper-edited
RNA transcripts are commonly associated with heterochromatin in C. elegans. Knockouts of the
C. elegans Vigilin homolog increased chromosomal nondisjunction (191). It would be interesting
to verify RNA targets bound by the C. elegans Viglin homolog and to explore whether ADAR
knockouts have an affect on Viglin binding or heterochromatin deposition.
Previous studies profiling RNA editing in C. elegans have taken advantage of the monoclonal J2
anti-dsRNA antibody, which is capable of binding dsRNA helices of at least 40 bp in length (68,
210, 252). The study by Saldi et al. used single end reads making it difficult to resolve
ambiguously mapped reads and the study by Whipple et al. had extensive rRNA contamination
(Figure 3.1). J2 precipitation combined with improved rRNA depletion and paired-end
sequencing could provide a method to identify select for and profile A-to-I editing long
structured RNA’s.
128
RNAse R, a 3’ -> 5’ exoribonuclease is capable of digesting ssRNA but does not digest most
dsRNA, intron lariats and circRNA’s (225). This method in combination with RNA-seq has been
used to identify circRNA’s in both human’s and C. elegans. Since the majority of edits (53.3%)
identified in the C. elegans samples processed in this study were intronic RNAse R digestion
may select for intron lariats containing edits. This could permit the deep profiling of intronic
edits which could be combined with alternative splicing analysis to determine if these RNA
editing regulates splicing in C. elegans. This method could also be used to further explore
evidence that structured RNA’s and RNA editing may play a role in circular RNA biogenesis
(71, 222).
ADAR RNA editing has previously been shown to be co-transcriptional in Drosophila (17, 19,
37). Moreover, the majority of the edits I observed were intronic and associated with transposons
and/or inverted repeats. Therefore it would be worthwhile to sequence nascent RNA transcripts
to isolate intronic sequences prior to splicing (19). This method would validate that RNA editing
occurs co-transcriptionally in C. elegans.
Long non-coding RNA’s are a recently discovered class of non-coding transcripts with a length
of at least 200 nts and have been associated with development and disease (13). Many of these
lncRNA transcripts contain a transposable element such as endogenous retroviral elements or
Alu elements (255). The extensive secondary structure of lncRNA transcripts could be targets for
RNA editing. The best example of this is the C. elegans lncRNA rncs-1, which is a small RNA
pathway antagonist and a target for extensive A-to-I editing (256). I think further exploring how
many lncRNA genes undergo editing and whether there are dsRNA binding proteins that inhibit
this to protect the lncRNA from editing or induced structural changes. Another possibility is that
the lncRNA requires protein binding to fold properly. LncRNA editing has the potential to have
a profound impact on cellular and disease development.
The complete biological role of A-to-I RNA editing has yet to be elucidated, however, as
sequencing technologies and library preparation methods improve the biology will become
clearer. Tools to analyze and interpret this data will be essential and my thesis project has
contributed RNASequel and a sensitive pipeline for the identification of RNA edits that will be
useful to aid in the analysis of A-to-I RNA editing as new datasets are generated and new
methods are devised.
129
References 1. Watson,J.D. and Crick,F.H. (1974) Molecular structure of nucleic acids: a structure for
deoxyribose nucleic acid. JD Watson and FHC Crick. Published in Nature, number 4356 April 25, 1953. Nature.
2. Sanger,F. (1988) Sequences, sequences, and sequences. Annu. Rev. Biochem.
3. Hutchison,C.A. (2007) DNA sequencing: bench to bedside and beyond. Nucleic Acids Research, 35, 6227–6237.
4. Kircher,M. and Kelso,J. (2010) High-throughput DNA sequencing - concepts and limitations. BioEssays, 32, 524–536.
5. Mardis,E.R. (2012) PERSPECTIVE. Nature, 470, 198–203.
6. Metzker,M.L. (2009) Sequencing technologies — the next generation. Nature Reviews Genetics, 11, 31–46.
7. Stein,L.D. (2010) The case for cloud computing in genome informatics. Genome biology, 11, 207.
8. Lander,E.S., Linton,L.M., Birren,B., Nusbaum,C., Zody,M.C., Baldwin,J., Devon,K., Dewar,K., Doyle,M., FitzHugh,W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921.
9. The Human Genome Project Completion: Frequently Asked Questions The Human Genome Project Completion: Frequently Asked Questions National Human Genome Research Institute.
10. Wang,Z., Gerstein,M. and Snyder,M. (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10, 57–63.
11. Zhao,S., Fung-Leung,W.-P., Bittner,A., Ngo,K. and Liu,X. (2014) Comparison of RNA-Seq and Microarray in Transcriptome Profiling of Activated T Cells. PLoS ONE, 9, e78644–13.
12. Ramaswami,G., Lin,W., Piskol,R., Tan,M.H., Davis,C. and Li,J.B. (2012) Accurate identification of human Alu and non-Alu RNA editing sites. Nature Methods, 9, 1–5.
13. Gibb,E.A., Brown,C.J. and Lam,W.L. (2011) The functional role of long non-coding RNA in human carcinomas. Mol Cancer, 10, 38.
14. Ponting,C.P., Oliver,P.L. and Reik,W. (2009) Evolution and Functions of Long Noncoding RNAs. Cell, 136, 629–641.
15. Lee,T.I. and Young,R.A. (2013) Transcriptional Regulation and Its Misregulation in Disease. Cell, 152, 1237–1251.
16. Halbeisen,R.E., Galgano,A., Scherrer,T. and Gerber,A.P. (2007) Post-transcriptional gene
130
regulation: From genome-wide studies to principles. Cell. Mol. Life Sci., 65, 798–813.
17. Bentley,D.L. (2014) Coupling mRNA processing with transcription in time and space. Nature Reviews Genetics, 15, 163–175.
18. Maniatis,T. and Reed,R. (2002) An extensive network of coupling among gene expression machines. Nature, 416, 499–506.
19. Rosbash,J.R.J.M.M., Menet,J.S. and Rosbash,M. (2012) Nascent-Seq Indicates Widespread Cotranscriptional RNA Editing in Drosophila. Mol Cell, 47, 27–37.
20. Kornblihtt,A.R., la Mata,de,M., Fededa,J.P., Munoz,M.J. and Nogues,G. (2004) Multiple links between transcription and splicing. RNA (New York, N.Y.), 10, 1489–1498.
21. Ryman,K., Fong,N., Bratt,E., Bentley,D.L. and Ohman,M. (2007) The C-terminal domain of RNA Pol II helps ensure that editing precedes splicing of the GluR-B transcript. RNA (New York, N.Y.), 13, 1071–1078.
22. Berget,S.M., Moore,C. and Sharp,P.A. (1977) Spliced segments at the 5'terminus of adenovirus 2 late mRNA. In.
23. Chow,L.T., Gelinas,R.E., Broker,T.R. and Roberts,R.J. (1977) An amazing sequence arrangement at the 5' ends of adenovirus 2 messenger RNA. Cell, 12, 1–8.
24. Sharp,P.A. (2005) The discovery of split genes and RNA splicing. Trends in biochemical sciences, 30, 279–281.
25. Chen,M. and Manley,J.L. (2009) Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nature reviews. Molecular cell biology, 10, 741–754.
26. William Roy,S. and Gilbert,W. (2006) The evolution of spliceosomal introns: patterns, puzzles and progress. Nature Reviews Genetics, 7, 211–221.
27. Reed,R. (1989) The organization of 3' splice-site sequences in mammalian introns. Genes & Development, 3, 2113–2123.
28. Irimia,M. and Blencowe,B.J. (2012) Alternative splicing: decoding an expansive regulatory layer. Current Opinion in Cell Biology, 24, 323–332.
29. Ghigna,C., Valacca,C. and Biamonti,G. (2008) Alternative splicing and tumor progression. Curr. Genomics, 9, 556–570.
30. Hallegger,M., Llorian,M. and Smith,C.W.J. (2010) Alternative splicing: global insights. FEBS J, 277, 856–866.
31. Knoop,V. (2010) When you can’t trust the DNA: RNA editing changes transcript sequences. Cell. Mol. Life Sci., 68, 567–586.
32. Keegan,L.P., Gallo,A. and O'Connell,M.A. (2001) The many roles of an RNA editor. Nature
131
Reviews Genetics, 2, 869–878.
33. Moris,A., Murray,S. and Cardinaud,S. (2014) AID and APOBECs span the gap between innate and adaptive immunity. Front Microbiol, 5, 534.
34. Bass,B.L. (2002) RNA editing by adenosine deaminases that act on RNA. Annu. Rev. Biochem., 71, 817–846.
35. Vendeix,F.A.P., Munoz,A.M. and Agris,P.F. (2009) Free energy calculation of modified base-pair formation in explicit solvent: A predictive model. RNA (New York, N.Y.), 15, 2278–2287.
36. Zinshteyn,B. and Nishikura,K. (2009) Adenosine-to-inosine RNA editing. Wiley interdisciplinary reviews Systems biology and medicine, 1, 202–209.
37. Laurencikiene,J., K allman,A.M., Fong,N., Bentley,D.L. and Ohman,M. (2006) RNA editing and alternative splicing: the importance of co-transcriptional coordination. EMBO reports, 7, 303–307.
38. Valente,L. and Nishikura,K. (2005) ADAR gene family and A-to-I RNA editing: diverse roles in posttranscriptional gene regulation. Progress in nucleic acid research and molecular biology, 79, 299–338.
39. Grice,L.F. and Degnan,B.M. (2015) The origin of the ADAR gene family and animal RNA editing. BMC Evol Biol, 15, 4.
40. Kuttan,A. and Bass,B.L. (2012) Mechanistic insights into editing-site specificity of ADARs. Proc Natl Acad Sci USA, 109, E3295–E3304.
41. Wang,Q. (2000) Requirement of the RNA Editing Deaminase ADAR1 Gene for Embryonic Erythropoiesis. Science, 290, 1765–1768.
42. Higuchi,M., Maas,S., Single,F.N., Hartner,J., Rozov,A., Burnashev,N., Feldmeyer,D., Sprengel,R. and Seeburg,P.H. (2000) Point mutation in an AMPA receptor gene rescues lethality in mice deficient in the RNA-editing enzyme ADAR2. Nature, 406, 78–81.
43. Jepson,J.E.C. and Reenan,R.A. (2009) Adenosine-to-Inosine Genetic Recoding Is Required in the Adult Stage Nervous System for Coordinated Behavior in Drosophila. The Journal of biological chemistry, 284, 31391–31400.
44. Tonkin,L.A., Saccomanno,L., Morse,D.P., Brodigan,T., Krause,M. and Bass,B.L. (2002) RNA editing by ADARs is important for normal behavior in Caenorhabditis elegans. The EMBO Journal, 21, 6025–6035.
45. Yang,J.H., Sklar,P., Axel,R. and Maniatis,T. (1997) Purification and characterization of a human RNA adenosine deaminase for glutamate receptor B pre-mRNA editing. Proceedings of the National Academy of Sciences, 94, 4354–4359.
46. Melcher,T., Maas,S., Herb,A., Sprengel,R., Seeburg,P.H. and Higuchi,M. (1996) A
132
mammalian RNA editing enzyme. Nature, 379, 460–464.
47. Horsch,M., Seeburg,P.H., Adler,T., Aguilar-Pimentel,J.A., Becker,L., Calzada-Wack,J., Garrett,L., Götz,A., Hans,W., Higuchi,M., et al. (2011) Requirement of the RNA-editing enzyme ADAR2 for normal physiology in mice. The Journal of biological chemistry, 286, 18614–18622.
48. Hideyama,T., Yamashita,T., Aizawa,H., Tsuji,S., Kakita,A., Takahashi,H. and Kwak,S. (2012) Profound downregulation of the RNA editing enzyme ADAR2 in ALS spinal motor neurons. Neurobiol. Dis., 45, 1121–1128.
49. Kim,D.D.Y. (2004) Widespread RNA Editing of Embedded Alu Elements in the Human Transcriptome. Genome Research, 14, 1719–1725.
50. Athanasiadis,A., Rich,A. and Maas,S. (2004) Widespread A-to-I RNA editing of Alu-containing mRNAs in the human transcriptome. PLoS Biology, 2, e391.
51. Bass,B.L. and Weintraub,H. (1987) A developmentally regulated activity that unwinds RNA duplexes. Cell, 48, 607–613.
52. Bass,B.L. and Weintraub,H. (1988) An unwinding activity that covalently modifies its double-stranded RNA substrate. Cell, 55, 1089–1098.
53. Wu,D., Lamm,A.T. and Fire,A.Z. (2011) Competition between ADAR and RNAi pathways for an extensive class of RNA targets. Nature Structural & Molecular Biology, 18, 1094–1101.
54. Warf,M.B., Shepherd,B.A., Johnson,W.E. and Bass,B.L. (2012) Effects of ADARs on small RNA processing pathways in C. elegans. Genome Research, 22, 1488–1498.
55. Rueter,S.M., Dawson,T.R. and Emeson,R.B. (1999) Regulation of alternative splicing by RNA editing. Nature, 399, 75–80.
56. Savva,Y.A., Jepson,J.E.C., Chang,Y.-J., Whitaker,R., Jones,B.C., St Laurent,G., Tackett,M.R., Kapranov,P., Jiang,N., Du,G., et al. (2013) RNA editing regulates transposon-mediated heterochromatic gene silencing. Nature Communications, 4, 2745.
57. Zhang,Z. and Carmichael,G.G. (2001) The fate of dsRNA in the nucleus: a p54(nrb)-containing complex mediates the nuclear retention of promiscuously A-to-I edited RNAs. Cell, 106, 465–475.
58. Wang,Q., Zhang,Z., Blackwell,K. and Carmichael,G.G. (2005) Vigilins bind to promiscuously A-to-I-edited RNAs and are involved in the formation of heterochromatin. Curr. Biol., 15, 384–391.
59. Prasanth,K.V., Prasanth,S.G., Xuan,Z., Hearn,S., Freier,S.M., Bennett,C.F., Zhang,M.Q. and Spector,D.L. (2005) Regulating gene expression through RNA nuclear retention. Cell, 123, 249–263.
133
60. Zheng,H., Fu,T.B., Lazinski,D. and Taylor,J. (1992) Editing on the genomic RNA of human hepatitis delta virus. J. Virol., 66, 4693–4697.
61. Luciano,D.J., Mirsky,H., Vendetti,N.J. and Maas,S. (2004) RNA editing of a miRNA precursor. RNA (New York, N.Y.), 10, 1174–1177.
62. Washburn,M.C., Kakaradov,B., Sundararaman,B., Wheeler,E., Hoon,S., Yeo,G.W. and Hundley,H.A. (2014) The dsRBP and Inactive Editor ADR-1 Utilizes dsRNA Binding to Regulate A-to-I RNA Editing across the C. elegans Transcriptome. CellReports, 6, 599–607.
63. Palladino,M.J., Keegan,L.P., O'Connell,M.A. and Reenan,R.A. (2000) A-to-I pre-mRNA editing in Drosophila is primarily involved in adult nervous system function and integrity. Cell, 102, 437–449.
64. XuFeng,R., Boyer,M.J., Shen,H., Li,Y., Yu,H., Gao,Y., Yang,Q., Wang,Q. and Cheng,T. (2009) ADAR1 is required for hematopoietic progenitor cell survival via RNA editing. Proc Natl Acad Sci USA, 106, 17763–17768.
65. Hoogstrate,S.W., Volkers,R.J., Sterken,M.G., Kammenga,J.E. and Snoek,L.B. (2014) Nematode endogenous small RNA pathways. Worm, 3, e28234–11.
66. Tonkin,L.A. and Bass,B.L. (2003) Mutations in RNAi rescue aberrant chemotaxis of ADAR mutants. Science (New York, N.Y.), 302, 1725–1725.
67. Liddicoat,B.J., Piskol,R., Chalk,A.M., Ramaswami,G., Higuchi,M., Hartner,J.C., Li,J.B., Seeburg,P.H. and Walkley,C.R. (2015) RNA editing by ADAR1 prevents MDA5 sensing of endogenous dsRNA as nonself. Science, 349, 1115–1120.
68. Whipple,J.M., Youssef,O.A., Aruscavage,P.J., Nix,D.A., Hong,C., Johnson,W.E. and Bass,B.L. (2015) Genome-wide profiling of the C. elegans dsRNAome. RNA (New York, N.Y.), 21, 786–800.
69. Zhao,H.Q., Zhang,P., Gao,H., He,X., Dou,Y., Huang,A.Y., Liu,X.M., Ye,A.Y., Dong,M.Q. and Wei,L. (2014) Profiling the RNA editomes of wild-type C. elegans and ADAR mutants. Genome Research, 10.1101/gr.176107.114.
70. Hundley,H.A. and Bass,B.L. (2010) ADAR editing in double-stranded UTRs and other noncoding RNA sequences. Trends in biochemical sciences, 35, 377–383.
71. Ivanov,A., Memczak,S., Wyler,E., Torti,F., Porath,H.T., Orejuela,M.R., Piechotta,M., Levanon,E.Y., Landthaler,M., Dieterich,C., et al. (2014) Analysis of Intron Sequences Reveals Hallmarks of Circular RNA Biogenesis in Animals. CellReports, 10, 1–9.
72. Morse,D.P., Aruscavage,P.J. and Bass,B.L. (2002) RNA hairpins in noncoding regions of human brain and Caenorhabditis elegans mRNA are edited by adenosine deaminases that act on RNA. Proc Natl Acad Sci USA, 99, 7906–7911.
73. Holley,R.W., Everett,G.A., Madison,J.T. and Zamir,A. (1965) Nucleotide Sequences in the Yeast Alanine Transfer Ribonucleic Acid. The Journal of biological chemistry, 240, 2122–
134
2128.
74. Sanger,F., Brownlee,G.G. and Barrell,B.G. (1965) A two-dimensional fractionation procedure for radioactive nucleotides. J. Mol. Biol., 13, 373–IN4.
75. Min Jou,W., Haegeman,G., Ysebaert,M. and Fiers,W. (1972) Nucleotide sequence of the gene coding for the bacteriophage MS2 coat protein. Nature, 237, 82–88.
76. Fiers,W., Contreras,R., Duerinck,F., Haegeman,G., Iserentant,D., Merregaert,J., Min Jou,W., Molemans,F., Raeymaekers,A., Van den Berghe,A., et al. (1976) Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature, 260, 500–507.
77. Wu,R. (1994) Development of the primer-extension approach: a key role in DNA sequencing. Trends in biochemical sciences, 19, 429–433.
78. Wu,R. (1970) Nucleotide sequence analysis of DNA. I. Partial sequence of the cohesive ends of bacteriophage lambda and 186 DNA. J. Mol. Biol., 51, 501–521.
79. Wu,R. and Taylor,E. (1971) Nucleotide sequence analysis of DNA. II. Complete nucleotide sequence of the cohesive ends of bacteriophage lambda DNA. J. Mol. Biol., 57, 491–511.
80. Sanger,F. and Coulson,A.R. (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J. Mol. Biol., 94, 441–448.
81. Sanger,F., Air,G.M., Barrell,B.G., Brown,N.L., Coulson,A.R., Fiddes,C.A., Hutchison,C.A., Slocombe,P.M. and Smith,M. (1977) Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 265, 687–695.
82. Maxam,A.M. and Gilbert,W. (1977) A new method for sequencing DNA. Proceedings of the National Academy of Sciences, 74, 560–564.
83. Peattie,D.A. (1979) Direct chemical method for sequencing RNA. In.
84. Sanger,F., Nicklen,S. and Coulson,A.R. (1977) DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences, 74, 5463–5467.
85. Sanger,F. and Coulson,A.R. (1978) The use of thin acrylamide gels for DNA sequencing. FEBS Lett., 87, 107–110.
86. Smith,L.M., Sanders,J.Z., Kaiser,R.J., Hughes,P., Dodd,C., Connell,C.R., Heiner,C., Kent,S.B. and Hood,L.E. (1986) Fluorescence detection in automated DNA sequence analysis. Nature, 321, 674–679.
87. Prober,J.M., Trainor,G.L., Dam,R.J., Hobbs,F.W., Robertson,C.W., Zagursky,R.J., Cocuzza,A.J., Jensen,M.A. and Baumeister,K. (1987) A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science, 238, 336–341.
88. Zagursky,R.J. and Berman,M.L. (1984) Cloning vectors that yield high levels of single-
135
stranded DNA for rapid DNA sequencing. Gene, 27, 183–191.
89. Sinville,R. and Soper,S.A. (2007) High resolution DNA separations using microchip electrophoresis. J. Sep. Sci., 30, 1714–1728.
90. Karger,B.L. and Guttman,A. (2009) DNA sequencing by Capillary Electrophoresis . Electrophoresis, 30 Suppl 1, S196–202.
91. Swerdlow,H., Wu,S.L., Harke,H. and Dovichi,N.J. (1990) Capillary gel electrophoresis for DNA sequencing. Laser-induced fluorescence detection with the sheath flow cuvette. J. Chromatogr., 516, 61–67.
92. Swerdlow,H. and Gesteland,R. (1990) Capillary gel electrophoresis for rapid, high resolution DNA sequencing. Nucleic Acids Research, 18, 1415–1419.
93. Luckey,J.A., Drossman,H., Kostichka,A.J., Mead,D.A., D'Cunha,J., Norris,T.B. and Smith,L.M. (1990) High speed DNA sequencing by capillary electrophoresis. Nucleic Acids Research, 18, 4417–4421.
94. Pariat,Y.F., Berka,J., Heiger,D.N., Schmitt,T., Vilenchik,M., Cohen,A.S., Foret,F. and Karger,B.L. (1993) Separation of DNA fragments by capillary electrophoresis using replaceable linear polyacrylamide matrices. J Chromatogr A, 652, 57–66.
95. Huang,X.C., Quesada,M.A. and Mathies,R.A. (1992) DNA sequencing using capillary array electrophoresis. Anal. Chem., 64, 2149–2154.
96. Saiki,R.K., Gelfand,D.H., Stoffel,S., Scharf,S.J., Higuchi,R., Horn,G.T., Mullis,K.B. and Erlich,H.A. (1988) Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. Science, 239, 487–491.
97. Chien,A., Edgar,D.B. and Trela,J.M. (1976) Deoxyribonucleic acid polymerase from the extreme thermophile Thermus aquaticus. J. Bacteriol., 127, 1550–1557.
98. Murray,V. (1989) Improved double-stranded DNA sequencing using the linear polymerase chain reaction. Nucleic Acids Research, 17, 8889.
99. Ronaghi,M. (2001) Pyrosequencing sheds light on DNA sequencing. Genome Research, 11, 3–11.
100. Ronaghi,M., Karamohamed,S., Pettersson,B., Uhlén,M. and Nyrén,P. (1996) Real-time DNA sequencing using detection of pyrophosphate release. Anal. Biochem., 242, 84–89.
101. Bentley,D.R., Balasubramanian,S., Swerdlow,H.P., Smith,G.P., Milton,J., Brown,C.G., Hall,K.P., Evers,D.J., Barnes,C.L., Bignell,H.R., et al. (2008) Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456, 53–59.
102. Johnson,D.S., Mortazavi,A., Myers,R.M. and Wold,B. (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, N.Y.), 316, 1497–1502.
136
103. Fedurco,M. (2006) BTA, a novel reagent for DNA attachment on glass and efficient generation of solid-phase amplified DNA colonies. Nucleic Acids Research, 34, e22–e22.
104. Southern,E.M., Maskos,U. and Elder,J.K. (1992) Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: evaluation using experimental models. Genomics, 13, 1008–1017.
105. Adessi,C., Matton,G., Ayala,G., Turcatti,G., Mermod,J.J., Mayer,P. and Kawashima,E. (2000) Solid phase DNA amplification: characterisation of primer attachment and amplification mechanisms. Nucleic Acids Research, 28, E87.
106. Turcatti,G., Romieu,A., Fedurco,M. and Tairi,A.P. (2007) A new class of cleavable fluorescent nucleotides: synthesis and optimization as reversible terminators for DNA sequencing by synthesis. Nucleic Acids Research, 36, e25–e25.
107. Schirmer,M., Ijaz,U.Z., D'Amore,R., Hall,N., Sloan,W.T. and Quince,C. (2015) Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform. Nucleic Acids Research, 43, e37.
108. Fuller,C.W., Middendorf,L.R., Benner,S.A., Church,G.M., Harris,T., Huang,X., Jovanovich,S.B., Nelson,J.R., Schloss,J.A., Schwartz,D.C., et al. (2009) The challenges of sequencing by synthesis. Nature Biotechnology, 27, 1013–1023.
109. Kircher,M., Stenzel,U. and Kelso,J. (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome biology, 10, R83.
110. Aird,D., Ross,M.G., Chen,W.-S., Danielsson,M., Fennell,T., Russ,C., Jaffe,D.B., Nusbaum,C. and Gnirke,A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome biology, 12, R18.
111. Benjamini,Y. and Speed,T.P. (2012) Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Research, 40, e72.
112. Thomson,J.M. (2006) Extensive post-transcriptional regulation of microRNAs and its implications for cancer. Genes & Development, 20, 2202–2207.
113. Shi,Y. (2012) Alternative polyadenylation: new insights from global analyses. RNA (New York, N.Y.), 18, 2105–2117.
114. Sleator,R.D. (2010) An overview of the current status of eukaryote gene prediction strategies. Gene, 461, 1–4.
115. Trapnell,C., Williams,B.A., Pertea,G., Mortazavi,A., Kwan,G., van Baren,M.J., Salzberg,S.L., Wold,B.J. and Pachter,L. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology, 28, 511–515.
116. Wang,E.T., Sandberg,R., Luo,S., Khrebtukova,I., Zhang,L., Mayr,C., Kingsmore,S.F., Schroth,G.P. and Burge,C.B. (2008) Alternative isoform regulation in human tissue
137
transcriptomes. Nature, 456, 470–476.
117. Temin,H.M. and Mizutani,S. (1970) RNA-dependent DNA polymerase in virions of Rous sarcoma virus. Nature, 226, 1211–1213.
118. Sim,G.K., Kafatos,F.C., Jones,C.W., Koehler,M.D., Efstratiadis,A. and Maniatis,T. (1979) Use of a cDNA library for studies on evolution and developmental expression of the chorion multigene families. Cell, 18, 1303–1316.
119. Cocquet,J., Chong,A., Zhang,G. and Veitia,R.A. (2006) Reverse transcriptase template switching and false alternative transcripts. Genomics, 88, 127–131.
120. Houseley,J. and Tollervey,D. (2010) Apparent Non-Canonical Trans-Splicing Is Generated by Reverse Transcriptase In Vitro. PLoS ONE, 5, e12271.
121. Menéndez-Arias,L. (2002) Molecular basis of fidelity of DNA synthesis and nucleotide specificity of retroviral reverse transcriptases. Progress in nucleic acid research and molecular biology, 71, 91–147.
122. Champoux,J.J. and Schultz,S.J. (2009) Ribonuclease H: properties, substrate specificity and roles in retroviral reverse transcription. FEBS Journal, 276, 1506–1516.
123. Velculescu,V.E., Zhang,L., Vogelstein,B. and Kinzler,K.W. (1995) Serial analysis of gene expression. Science, 270, 484–487.
124. Sutcliffe,J.G., Milner,R.J. and Bloom,F.E. (1982) Common 82-nucleotide sequence unique to brain RNA. In.
125. Putney,S.D., Herlihy,W.C. and Schimmel,P. (1983) A new troponin T and cDNA clones for 13 different muscle proteins, found by shotgun sequencing. Nature, 302, 718–721.
126. Brosseau,J.-P., Lucier,J.-F.C., Lapointe,E., Durand,M., Gendron,D., Gervais-Bird,J., Tremblay,K., Perreault,J.-P. and Elela,S.A. (2010) High-throughput quantification of splicing isoforms. RNA (New York, N.Y.), 16, 442–449.
127. Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B. and Moreno,R.F. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252, 1651–1656.
128. Sakharkar,M.K., Chow,V.T.K. and Kangueane,P. (2004) Distributions of exons and introns in the human genome. In Silico Biol. (Gedrukt), 4, 387–393.
129. Parkinson,J. (2009) Expressed Sequence Tags (ESTs).
130. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST--database for "expressed sequence tags". Nature genetics, 4, 332–333.
131. Harbers,M. and Carninci,P. (2005) Tag-based approaches for transcriptome research and genome annotation. Nature Methods, 2, 495–502.
138
132. Lister,R., O'Malley,R.C., Tonti-Filippini,J., Gregory,B.D., Berry,C.C., Millar,A.H. and Ecker,J.R. (2008) Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis. Cell, 133, 523–536.
133. Mortazavi,A., Williams,B.A., McCue,K., Schaeffer,L. and Wold,B. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5, 621–628.
134. Nagalakshmi,U., Wang,Z., Waern,K., Shou,C., Raha,D., Gerstein,M. and Snyder,M. (2008) The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (New York, N.Y.), 320, 1344–1349.
135. Trapnell,C., Hendrickson,D.G., Sauvageau,M., Goff,L., Rinn,J.L. and Pachter,L. (2012) Differential analysis of gene regulation at transcript resolution with rNA-seq. Nature Biotechnology, 31, 46–53.
136. Hodgkinson,A., Idaghdour,Y., Gbeha,E., Grenier,J.-C., Hip-Ki,E., Bruat,V., Goulet,J.-P., de Malliard,T. and Awadalla,P. (2014) High-resolution genomic analysis of human mitochondrial RNA sequence variation. Science (New York, N.Y.), 344, 413–415.
137. Guttman,M., Garber,M., Levin,J.Z., Donaghey,J., Robinson,J., Adiconis,X., Fan,L., Koziol,M.J., Gnirke,A., Nusbaum,C., et al. (2010) Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnology, 28, 503–510.
138. Pan,Q., Shai,O., Lee,L.J., Frey,B.J. and Blencowe,B.J. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature genetics, 40, 1413–1415.
139. Pickrell,J.K., Marioni,J.C., Pai,A.A., Degner,J.F., Engelhardt,B.E., Nkadori,E., Veyrieras,J.-B., Stephens,M., Gilad,Y. and Pritchard,J.K. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature, 464, 768–772.
140. Levin,J.Z., Yassour,M., Adiconis,X., Nusbaum,C., Thompson,D.-A., Friedman,N., Gnirke,A. and Regev,A. (2010) Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods, 7, 709–715.
141. van Dijk,E.L., Jaszczyszyn,Y. and Thermes,C. (2014) Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell Res., 322, 12–20.
142. Peng,Z., Cheng,Y., Tan,B.C.-M., Kang,L., Tian,Z., Zhu,Y., Zhang,W., Liang,Y., Hu,X., Tan,X., et al. (2012) Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nature Biotechnology, 30, 1–10.
143. He,S., Wurtzel,O., Singh,K., Froula,J.L., Yilmaz,S., Tringe,S.G., Wang,Z., Chen,F., Lindquist,E.A., Sorek,R., et al. (2010) Validation of two ribosomal RNA removal methods for microbial metatranscriptomics. Nature Methods, 7, 807–812.
144. Yi,H., Cho,Y.-J., Won,S., Lee,J.-E., Jin Yu,H., Kim,S., Schroth,G.P., Luo,S. and Chun,J. (2011) Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq.
139
Nucleic Acids Research, 39, e140.
145. Hansen,K.D., Brenner,S.E. and Dudoit,S. (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, 38, e131–e131.
146. van Gurp,T.P., McIntyre,L.M. and Verhoeven,K.J.F. (2013) Consistent Errors in First Strand cDNA Due to Random Hexamer Mispriming. PLoS ONE, 8, e85583.
147. Roberts,A., Trapnell,C., Donaghey,J., Rinn,J.L. and Pachter,L. (2011) Improving RNA-Seq expression estimates by correcting for fragment bias. Genome biology, 12, R22.
148. Ameur,A., Zaghlool,A., Halvardson,J., Wetterbom,A., Gyllensten,U., Cavelier,L. and Feuk,L. (2011) Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain. Nature Structural & Molecular Biology, 18, 1435–1440.
149. Haas,B.J., Chin,M., Nusbaum,C., Birren,B.W. and Livny,J. (2012) How deep is deep enough for RNA-Seq profiling of bacterial transcriptomes? BMC Genomics, 13, 734.
150. Toung,J.M., Lahens,N., Hogenesch,J.B. and Grant,G. (2014) Detection Theory in Identification of RNA-DNA Sequence Differences Using RNA-Sequencing. PLoS ONE, 9, e112040.
151. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195–197.
152. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453.
153. Gotoh,O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol., 162, 705–708.
154. Farrar,M. (2007) Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23, 156–161.
155. Zhao,M., Lee,W.-P., Garrison,E.P. and Marth,G.T. (2013) SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications. PLoS ONE, 8, e82138.
156. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods, 9, 357–359.
157. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear space. Comput. Appl. Biosci., 4, 11–17.
158. Zhang,Z., Schwartz,S., Wagner,L. and Miller,W. (2000) A greedy algorithm for aligning DNA sequences. Journal of computational biology : a journal of computational molecular cell biology, 7, 203–214.
159. Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological sequence comparison.
140
Proceedings of the National Academy of Sciences, 85, 2444–2448.
160. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
161. Kent,W.J. (2002) BLAT---The BLAST-Like Alignment Tool. Genome Research, 12, 656–664.
162. Li,H. and Durbin,R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760.
163. Li,H. and Homer,N. (2010) A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11, 473–483.
164. Ma,B., Tromp,J. and Li,M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England), 18, 440–445.
165. Li,H., Ruan,J. and Durbin,R. (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18, 1851–1858.
166. Li,R., Li,Y., Kristiansen,K. and Wang,J. (2008) SOAP: short oligonucleotide alignment program. Bioinformatics, 24, 713–714.
167. Homer,N., Merriman,B. and Nelson,S.F. (2009) BFAST: an alignment tool for large scale genome resequencing. PLoS ONE, 4, e7767.
168. Rumble,S.M., Lacroute,P., Dalca,A.V., Fiume,M., Sidow,A. and Brudno,M. (2009) SHRiMP: Accurate Mapping of Short Color-space Reads. PLoS Comput Biol, 5, e1000386.
169. Weiner,P. (1973) Linear pattern matching algorithms. Switching and Automata Theory, 19, 331–353.
170. Gusfield,D. (1997) Algorithms on Strings, Trees and Sequences Cambridge University Press.
171. Delcher,A.L., Kasif,S., Fleischmann,R.D., Peterson,J., White,O. and Salzberg,S.L. (1999) Alignment of whole genomes. Nucleic Acids Research, 27, 2369–2376.
172. Shrestha,A.M.S., Frith,M.C. and Horton,P. (2014) A bioinformatician's guide to the forefront of suffix array construction algorithms. Briefings in bioinformatics, 15, 138–154.
173. Manber,U. and Myers,G. (1989) Suffix Arrays.
174. Abouelhoda,M.I., Kurtz,S. and Ohlebusch,E. (2004) Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2, 53–86.
175. Ferragina,P. and Manzini,G. (2000) Opportunistic data structures with applications. In. IEEE Comput. Soc, pp. 390–398.
176. Burrows,M. and Wheeler,D.J. (1994) A Block-sorting Lossless Data Compression
141
Algorithm.
177. Lam,T.W., Sung,W.K., Tam,S.L., Wong,C.K. and Yiu,S.M. (2008) Compressed indexing and local alignment of DNA. Bioinformatics (Oxford, England), 24, 791–797.
178. Langmead,B., Trapnell,C., Pop,M. and Salzberg,S.L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10, R25.
179. Li,R., Yu,C., Li,Y., Lam,T.-W., Yiu,S.-M., Kristiansen,K. and Wang,J. (2009) SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics (Oxford, England), 25, 1966–1967.
180. Li,H. (2014) Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics (Oxford, England), 30, 2843–2851.
181. Hastings,M.L. and Krainer,A.R. (2001) Pre-mRNA splicing in the new millennium. Current Opinion in Cell Biology, 13, 302–309.
182. Garber,M., Grabherr,M.G., Guttman,M. and Trapnell,C. (2011) Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods, 8, 469–477.
183. Alamancos,G.P., Agirre,E. and Eyras,E. (2014) Methods to Study Splicing from High-Throughput RNA Sequencing Data. In Methods in Molecular Biology, Methods in Molecular Biology. Humana Press, Totowa, NJ, Vol. 1126, pp. 357–397.
184. Engström,P.G., Sipos,B., Alioto,T., Behr,J., Bohnert,R., Campagna,D., Davis,C.A., Dobin,A., Gingeras,T.R., Goldman,N., et al. (2013) Systematic evaluation of spliced alignment programs for RNA-seq data. Nature Methods, 10.1038/nmeth.2722.
185. Trapnell,C., Pachter,L. and Salzberg,S.L. (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics (Oxford, England), 25, 1105–1111.
186. Wang,K., Singh,D., Zeng,Z., Coleman,S.J., Huang,Y., Savich,G.L., He,X., Mieczkowski,P., Grimm,S.A., Perou,C.M., et al. (2010) MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research, 38, e178–e178.
187. Kim,D., Pertea,G., Trapnell,C., Pimentel,H., Kelley,R. and Salzberg,S.L. (2013) TopHat2: accurate alignment of transcriptomes inthe presence of insertions, deletions and genefusions. Genome biology, 14, R36.
188. Wu,T.D. and Nacu,S. (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics (Oxford, England), 26, 873–881.
189. Dobin,A., Davis,C.A., Schlesinger,F., Drenkow,J., Zaleski,C., Jha,S., Batut,P., Chaisson,M. and Gingeras,T.R. (2012) STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England), 29, 15–21.
190. Stein,L.D., Bao,Z., Blasiar,D., Blumenthal,T., Brent,M.R., Chen,N., Chinwalla,A., Clarke,L., Clee,C., Coghlan,A., et al. (2003) The genome sequence of Caenorhabditis
142
briggsae: a platform for comparative genomics. PLoS Biology, 1, E45.
191. Sijen,T. and Plasterk,R.H.A. (2003) Transposon silencing in the Caenorhabditis elegans germ line by natural RNAi. Nature, 426, 310–314.
192. C. elegans Sequencing Consortium (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 2012–2018.
193. Piskol,R., Peng,Z., Wang,J. and Li,J.B. (2013) Lack of evidence for existence of noncanonical RNA editing. Nature Biotechnology, 31, 19–20.
194. Li,J.B., Levanon,E.Y., Yoon,J.-K., Aach,J., Xie,B., Leproust,E., Zhang,K., Gao,Y. and Church,G.M. (2009) Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science (New York, N.Y.), 324, 1210–1213.
195. Li,M., Wang,I.X., Li,Y., Bruzel,A., Richards,A.L., Toung,J.M. and Cheung,V.G. (2011) Widespread RNA and DNA Sequence Differences in the Human Transcriptome. Science (New York, N.Y.), 10.1126/science.1207018.
196. Sakurai,M., Yano,T., Kawabata,H., Ueda,H. and Suzuki,T. (2010) Inosine cyanoethylation identifies A-to-I RNA editing sites in the human transcriptome. Nature Chemical Biology, 6, 733–740.
197. Wilson,G.W. and Stein,L.D. (2015) RNASequel: accurate and repeat tolerant realignment of RNA-seq reads. Nucleic Acids Research, 10.1093/nar/gkv594.
198. Li,H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.
199. Au,K.F., Jiang,H., Lin,L., Xing,Y. and Wong,W.H. (2010) Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Research, 38, 1–9.
200. Grant,G.R., Farkas,M.H., Pizarro,A., Lahens,N., Schug,J., Brunk,B., Stoeckert,C.J., Hogenesch,J.B. and Pierce,E.A. (2011) Comparative Analysis of RNA-Seq Alignment Algorithms and the RNA-Seq Unified Mapper (RUM). Bioinformatics, 10.1093/bioinformatics/btr427.
201. Odelberg,S.J., Weiss,R.B., Hata,A. and White,R. (1995) Template-switching during DNA synthesis by Thermus aquaticus DNA polymerase I. Nucleic Acids Research, 23, 2049–2057.
202. Djebali,S., Davis,C.A., Merkel,A., Dobin,A., Lassmann,T., Mortazavi,A., Tanzer,A., Lagarde,J., Lin,W., Schlesinger,F., et al. (2013) Landscape of transcription in human cells. Nature, 488, 101–108.
203. Gott,J.M. and Emeson,R.B. (2000) Functions and mechanisms of RNA editing. Annu. Rev. Genet., 34, 499–531.
204. Nishikura,K. (2010) Functions and Regulation of RNA Editing by ADAR Deaminases.
143
Annu. Rev. Biochem., 79, 321–349.
205. Karolchik,D., Barber,G.P., Casper,J., Clawson,H., Cline,M.S., Diekhans,M., Dreszer,T.R., Fujita,P.A., Guruvadoo,L., Haeussler,M., et al. (2013) The UCSC Genome Browser database: 2014 update. Nucleic Acids Research, 42, D764–D770.
206. Wang,J., Wang,W., Li,R., Li,Y., Tian,G., Goodman,L., Fan,W., Zhang,J., Li,J., Zhang,J., et al. (2008) The diploid genome sequence of an Asian individual. Nature, 456, 60–65.
207. Consortium,T.1.G.P., The 1000 Genomes Consortium Participants are arranged by project role,T.B.I.A.A.F.A.W.I.E.F.P.I.A.P.L.A.I., author,C., committee,S., Medicine,P.G.B.C.O., Broad Institute of MIT and Harvard, Max Planck Institute for Molecular Genetics, Washington University in St Louis, Wellcome Trust Sanger Institute, Affymetrix,A.G., et al. (2013) An integrated map of genetic variation from 1,092 human genomes. Nature, 490, 56–65.
208. Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27, 573–580.
209. Rice,P., Longden,I. and Bleasby,A. (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends in Genetics, 16, 276–277.
210. Saldi,T.K., Ash,P.E., Wilson,G., Gonzales,P., Garrido-Lecca,A., Roberts,C.M., Dostal,V., Gendron,T.F., Stein,L.D., Blumenthal,T., et al. (2014) TDP-1, the Caenorhabditis elegans ortholog of TDP-43, limits the accumulation of double-stranded RNA. The EMBO Journal, 10.15252/embj.201488740.
211. Gerstein,M.B., Lu,Z.J., Van Nostrand,E.L., Cheng,C., Arshinoff,B.I., Liu,T., Yip,K.Y., Robilotto,R., Rechtsteiner,A., Ikegami,K., et al. (2010) Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project. Science (New York, N.Y.), 330, 1775–1787.
212. Hastings,K.E.M. (2005) SL trans-splicing: easy come or easy go? Trends in Genetics, 21, 240–247.
213. Vendeix,F.A.P., Munoz,A.M. and Agris,P.F. (2009) Free energy calculation of modified base-pair formation in explicit solvent: A predictive model. RNA (New York, N.Y.), 15, 2278–2287.
214. Garrigues,J.M., Sidoli,S., Garcia,B.A. and Strome,S. (2015) Defining heterochromatin in C. elegansthrough genome-wide analysis of the heterochromatin protein 1 homolog HPL-2. Genome Research, 25, 76–88.
215. Liu,T., Rechtsteiner,A., Egelhofer,T.A., Vielle,A., Latorre,I., Cheung,M.-S., Ercan,S., Ikegami,K., Jensen,M., Kolasinska-Zwierz,P., et al. (2011) Broad chromosomal domains of histone modification patterns in C. elegans. Genome Research, 21, 227–236.
216. Mignone,F., Gissi,C., Liuni,S. and Pesole,G. (2002) Untranslated regions of mRNAs. Genome biology, 3, REVIEWS0004.
144
217. Flomen,R. and Makoff,A. (2011) Increased RNA editing in EAAT2 pre-mRNA from amyotrophic lateral sclerosis patients: involvement of a cryptic polyadenylation site. Neurosci. Lett., 497, 139–143.
218. Hogg,M., Paro,S., Keegan,L.P. and O'Connell,M.A. (2011) 3 - RNA Editing by Mammalian ADARs 1st ed. Elsevier Inc.
219. Vastenhouw,N.L., Fischer,S.E.J., Robert,V.J.P., Thijssen,K.L., Fraser,A.G., Kamath,R.S., Ahringer,J. and Plasterk,R.H.A. (2003) A genome-wide screen identifies 27 genes involved in transposon silencing in C. elegans. Curr. Biol., 13, 1311–1316.
220. Pothof,J., van Haaften,G., Thijssen,K., Kamath,R.S., Fraser,A.G., Ahringer,J., Plasterk,R.H.A. and Tijsterman,M. (2003) Identification of genes that protect the C. elegans genome against mutations by genome-wide RNAi. Genes & Development, 17, 443–448.
221. Emmons,S.W. and Yesner,L. (1984) High-frequency excision of transposable element Tc 1 in the nematode Caenorhabditis elegans is limited to somatic cells. Cell, 36, 599–605.
222. Jeck,W.R., Sorrentino,J.A., Wang,K., Slevin,M.K., Burd,C.E., Liu,J., Marzluff,W.F. and Sharpless,N.E. (2013) Circular RNAs are abundant, conserved, and associated with ALU repeats. RNA (New York, N.Y.), 19, 141–157.
223. Lev-Maor,G., Ram,O., Kim,E., Sela,N., Goren,A., Levanon,E.Y. and Ast,G. (2008) Intronic Alus influence alternative splicing. PLoS genetics, 4, e1000204.
224. Keren,H., Lev-Maor,G. and Ast,G. (2010) Alternative splicing and evolution: diversification, exon definition and function. Nature Reviews Genetics, 11, 345–355.
225. Suzuki,H. (2006) Characterization of RNase R-digested cellular RNA source that consists of lariat and circular RNAs from pre-mRNA splicing. Nucleic Acids Research, 34, e63–e63.
226. He,L. and Hannon,G.J. (2004) MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5, 522–531.
227. Chang,H., Lim,J., Ha,M. and Kim,V.N. (2014) TAIL-seq: genome-wide determination of poly(A) tail length and 3' end modifications. Mol Cell, 53, 1044–1052.
228. Stapleton,M., Carlson,J.W. and Celniker,S.E. (2006) RNA editing in Drosophila melanogaster: New targets and functional consequences. RNA (New York, N.Y.), 12, 1922–1932.
229. Levanon,E.Y. (2005) Evolutionarily conserved human targets of adenosine to inosine RNA editing. Nucleic Acids Research, 33, 1162–1168.
230. Clutterbuck,D.R., Leroy,A., O'Connell,M.A. and Semple,C.A.M. (2005) A bioinformatic screen for novel A-I RNA editing sites reveals recoding editing in BC10. Bioinformatics (Oxford, England), 21, 2590–2595.
231. Guérin,T.M., Palladino,F. and Robert,V.J. (2014) Transgenerational functions of small
145
RNA pathways in controlling gene expression in C. elegans. Epigenetics, 9, 37–44.
232. Holoch,D. and Moazed,D. (2015) RNA-mediated epigenetic regulation of gene expression. Nature Reviews Genetics, 16, 71–84.
233. Vitali,P. and Scadden,A. (2010) Double-stranded RNAs containing multiple IU pairs are sufficient to suppress interferon induction and apoptosis. Nature Structural & Molecular Biology, 17, 1043–1050.
234. Ermolaeva,M.A. and Schumacher,B. (2014) Insights from the worm: The C. elegans model for innate immunity. Seminars in immunology, 26, 303–309.
235. Harris,T.W., Antoshechkin,I., Bieri,T., Blasiar,D., Chan,J., Chen,W.J., La Cruz,De,N., Davis,P., Duesbury,M., Fang,R., et al. (2010) WormBase: a comprehensive resource for nematode research. Nucleic Acids Research, 38, D463–7.
236. Wang,K., Li,M. and Hakonarson,H. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research, 38, e164–e164.
237. Finn,R.D., Bateman,A., Clements,J., Coggill,P., Eberhardt,R.Y., Eddy,S.R., Heger,A., Hetherington,K., Holm,L., Mistry,J., et al. (2014) Pfam: the protein families database. Nucleic Acids Research, 42, D222–30.
238. Park,E., Williams,B., Wold,B.J. and Mortazavi,A. (2012) RNA editing in the human ENCODE RNA-seq data. Genome Research, 22, 1626–1633.
239. Maher,C.A., Kumar-Sinha,C., Cao,X., Kalyana-Sundaram,S., Han,B., Jing,X., Sam,L., Barrette,T., Palanisamy,N. and Chinnaiyan,A.M. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature, 458, 97–101.
240. Wu,C.S., Yu,C.Y., Chuang,C.Y., Hsiao,M., Kao,C.F., Kuo,H.C. and Chuang,T.J. (2014) Integrative transcriptome sequencing identifies trans-splicing events with important roles in human embryonic stem cell pluripotency. Genome Research, 24, 25–36.
241. Haas,B.J., Papanicolaou,A., Yassour,M., Grabherr,M., Blood,P.D., Bowden,J., Couger,M.B., Eccles,D., Li,B., Lieber,M., et al. (2013) De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols, 8, 1494–1512.
242. Grabherr,M.G., Haas,B.J., Yassour,M., Levin,J.Z., Thompson,D.A., Amit,I., Adiconis,X., Fan,L., Raychowdhury,R., Zeng,Q., et al. (2011) Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29, 644–652.
243. Paro,S., Li,X., O'Connell,M.A. and Keegan,L.P. (2012) Regulation and functions of ADAR in drosophila. Curr. Top. Microbiol. Immunol., 353, 221–236.
244. Spencer,W.C., McWhirter,R., Miller,T., Strasbourger,P., Thompson,O., Hillier,L.W., Waterston,R.H. and Miller,D.M. (2014) Isolation of specific neurons from C. elegans larvae for gene expression profiling. PLoS ONE, 9, e112102.
146
245. Shapiro,E., Biezuner,T. and Linnarsson,S. (2013) Single-cell sequencing-based technologies will revolutionize whole-organism science. Nature Reviews Genetics, 14, 618–630.
246. Stegle,O., Teichmann,S.A. and Marioni,J.C. (2015) Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16, 133–145.
247. Kolodziejczyk,A.A., Kim,J.K., Svensson,V., Marioni,J.C. and Teichmann,S.A. (2015) The Technology and Biology of Single-Cell RNA Sequencing. Mol Cell, 58, 610–620.
248. Navin,N.E. (2014) Cancer genomics: one cell at a time. Genome biology, 15, 452.
249. Eid,J., Fehr,A., Gray,J., Luong,K., Lyle,J., Otto,G., Peluso,P., Rank,D., Baybayan,P., Bettman,B., et al. (2009) Real-time DNA sequencing from single polymerase molecules. Science (New York, N.Y.), 323, 133–138.
250. Clarke,J., Wu,H.-C., Jayasinghe,L., Patel,A., Reid,S. and Bayley,H. (2009) Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol, 4, 265–270.
251. Milek,M., Wyler,E. and Landthaler,M. (2012) Transcriptome-wide analysis of protein-RNA interactions using high-throughput sequencing. Seminars in Cell and Developmental Biology, 23, 206–212.
252. Schönborn,J., Oberstrass,J., Breyel,E., Tittgen,J., Schumacher,J. and Lukacs,N. (1991) Monoclonal antibodies to double-stranded RNA as probes of RNA structure in crude nucleic acid extracts. Nucleic Acids Research, 19, 2993–3000.
253. LeGendre,J.B., Campbell,Z.T., Kroll-Conner,P., Anderson,P., Kimble,J. and Wickens,M. (2013) RNA Targets and Specificity of Staufen, a Double-stranded RNA-binding Protein in Caenorhabditis elegans. The Journal of biological chemistry, 288, 2532–2545.
254. Bond,C.S. and Fox,A.H. (2009) Paraspeckles: nuclear bodies built on long noncoding RNA. J. Cell Biol., 186, 637–644.
255. Kelley,D. and Rinn,J. (2012) Transposable elements reveal a stem cell-specific class of long noncoding RNAs. Genome biology, 13, R107.
256. Hellwig,S. and Bass,B.L. (2008) A starvation-induced noncoding RNA modulates expression of Dicer-regulated genes. Proc Natl Acad Sci USA, 105, 12897–12902.