transaction on computational biology and …jkalita/papers/2018/... · steps of rna-seq data...

20
TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1 Differential Expression Analysis of RNA-seq Reads: Overview, Taxonomy and Tools Hussain A. Chowdhury, Dhruba K. Bhattacharyya, Jugal K. Kalita Abstract—Analysis of RNA-sequence (RNA-seq) data is widely used in transcriptomic studies and it has many applications. We review RNA-seq data analysis from RNA-seq reads to the results of differential expression analysis. In addition, we perform descriptive comparison of tools used in each step of RNA-seq data analysis along with a discussion of important characteristics of these tools. A taxonomy of tools is also provided. A discussion of issues in quality control and visualization of RNA-seq data is also included along with useful tools. Finally, we provide some guidelines for the RNA-seq data analyst, along with research issues and challenges which should be addressed. Index Terms—RNA-seq Data Analysis, Differential Expression Analysis, Alignment, De novo Reconstruction, Transcriptome, Quantification, Normalization, Functional Profiling, Visualization, Quality Control, RNA-seq, scRNA-seq, Microarray. 1 I NTRODUCTION Transcriptome analysis enables researchers to under- stand the molecular basis of phenotype variation in general, and in the context of diseases. Microarrays have been widely used for such analysis, but the evolution of RNA-seq has made transcriptome analysis more effective. Gene expres- sion refers to the amount of mRNA produced by a gene at a particular time. Gene expression data consist of expression levels of genes over conditions such as development stages of diseases. Gene expression data are generated using two technologies, namely Microarrays [1] and RNA-sequencing (RNA-Seq) [2]. Both of these technologies have pros and cons, and limitations and similarities. Most recent advances in RNA-seq include the development of scRNA-seq where quantification of distribution of expression levels for each gene across a population of cells [3] is performed. RNA-seq experiments allow studying the whole tran- scriptome in a single experiment. In typical RNA-seq ex- periments, RNAs are first converted into a library of cDNA fragments and then sequencing adaptors are added to the cDNA fragments, and short sequences are obtained using high throughput sequencing technology to get the raw reads dataset. Reads are the sequences of DNA fragments present in the raw data [4]. The reads are aligned to an annotated reference sequence: genome or transcriptome and the expressions of genes are obtained. In scRNA-seq, before library preparation, Unique Molecular Identifiers (UMI) 1 are added during reverse transcription to produce cDNAs and then amplification is performed. UMI enables assigning reads to individual molecules. If no reference genome is available, de novo transcript reconstruction is required to assemble RNA-seq reads into a H. A. Chowdhury is with the Department of Computer Science and Engineering, Tezpur University, Assam, India, 784028. E-mail: [email protected] D. K. Bhattacharyya and J. K. Kalita are with Tezpur University and University of Colorado, respectively. E-mail: [email protected], [email protected] 1. https://hemberg-lab.github.io/scRNA.seq.course/ construction-of-expression-matrix.html#umichapter trascriptome [5]. RNA-seq data has different inherent biases such as hexamer, 3 0 , GC, amplification, mapping, sequence- specific, and fragment-size biases [6]. So, in every step of RNA-seq data analysis, quality control is required to reduce biases that arise during sequencing, alignment, de novo assembly and expression quantification. Expression quantification is an important application of RNA-seq. It can be performed at exon or gene or transcript level. In expression quantification, we count the number of reads mapped to a transcript sequence. The outcome is a count data matrix. For further analysis, normalization of the count data and handling batch effects are neces- sary to ensure accurate inference of gene expression and subsequent analysis. For effective normalization, one can take into account factors such as transcript size, GC-content, sequencing depth, sequencing error rate and insert size [7]. The most common use of transcriptome profiling is to find differentially expressed genes (DEG), i.e., determine whether for a given gene, an observed difference in read count is significant between conditions. In other words, whether the gene expression level differences between con- ditions are greater than what is expected due to natural random variations. Differential expression (DE) analysis is usually performed based on fold changes of read counts or in the presence of significant differences. Power analysis estimates the likelihood of successfully finding the statistical significance in a dataset [8]. Functional profiling is also an important step in tran- scriptomic studies to characterize molecular functions or pathways in which DEGs are involved. To gain insight and to find biological meaning in the data through human interpretation and judgment, data visualization is needed. It is difficult to survey the entire gamut of work that constitutes RNA-Seq analysis in a single review. Therefore, we restrict ourselves to reporting recent advances in RNA- seq data analysis and new RNA-seq technologies, which are changing the state-of-the-art in transcriptome studies. Interested readers can refer to recent articles by Dal et al. [9] for detailed review of workflow of scRNA-seq and

Upload: others

Post on 19-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1

Differential Expression Analysis of RNA-seqReads: Overview, Taxonomy and Tools

Hussain A. Chowdhury, Dhruba K. Bhattacharyya, Jugal K. Kalita

Abstract—Analysis of RNA-sequence (RNA-seq) data is widely used in transcriptomic studies and it has many applications. Wereview RNA-seq data analysis from RNA-seq reads to the results of differential expression analysis. In addition, we perform descriptivecomparison of tools used in each step of RNA-seq data analysis along with a discussion of important characteristics of these tools. Ataxonomy of tools is also provided. A discussion of issues in quality control and visualization of RNA-seq data is also included alongwith useful tools. Finally, we provide some guidelines for the RNA-seq data analyst, along with research issues and challenges whichshould be addressed.

Index Terms—RNA-seq Data Analysis, Differential Expression Analysis, Alignment, De novo Reconstruction, Transcriptome,Quantification, Normalization, Functional Profiling, Visualization, Quality Control, RNA-seq, scRNA-seq, Microarray.

F

1 INTRODUCTION

Transcriptome analysis enables researchers to under-stand the molecular basis of phenotype variation in general,and in the context of diseases. Microarrays have been widelyused for such analysis, but the evolution of RNA-seq hasmade transcriptome analysis more effective. Gene expres-sion refers to the amount of mRNA produced by a gene at aparticular time. Gene expression data consist of expressionlevels of genes over conditions such as development stagesof diseases. Gene expression data are generated using twotechnologies, namely Microarrays [1] and RNA-sequencing(RNA-Seq) [2]. Both of these technologies have pros andcons, and limitations and similarities. Most recent advancesin RNA-seq include the development of scRNA-seq wherequantification of distribution of expression levels for eachgene across a population of cells [3] is performed.

RNA-seq experiments allow studying the whole tran-scriptome in a single experiment. In typical RNA-seq ex-periments, RNAs are first converted into a library of cDNAfragments and then sequencing adaptors are added to thecDNA fragments, and short sequences are obtained usinghigh throughput sequencing technology to get the rawreads dataset. Reads are the sequences of DNA fragmentspresent in the raw data [4]. The reads are aligned to anannotated reference sequence: genome or transcriptome andthe expressions of genes are obtained. In scRNA-seq, beforelibrary preparation, Unique Molecular Identifiers (UMI)1

are added during reverse transcription to produce cDNAsand then amplification is performed. UMI enables assigningreads to individual molecules.

If no reference genome is available, de novo transcriptreconstruction is required to assemble RNA-seq reads into a

• H. A. Chowdhury is with the Department of Computer Science andEngineering, Tezpur University, Assam, India, 784028.E-mail: [email protected]

• D. K. Bhattacharyya and J. K. Kalita are with Tezpur University andUniversity of Colorado, respectively.E-mail: [email protected], [email protected]

1. https://hemberg-lab.github.io/scRNA.seq.course/construction-of-expression-matrix.html#umichapter

trascriptome [5]. RNA-seq data has different inherent biasessuch as hexamer, 3′, GC, amplification, mapping, sequence-specific, and fragment-size biases [6]. So, in every step ofRNA-seq data analysis, quality control is required to reducebiases that arise during sequencing, alignment, de novoassembly and expression quantification.

Expression quantification is an important application ofRNA-seq. It can be performed at exon or gene or transcriptlevel. In expression quantification, we count the numberof reads mapped to a transcript sequence. The outcomeis a count data matrix. For further analysis, normalizationof the count data and handling batch effects are neces-sary to ensure accurate inference of gene expression andsubsequent analysis. For effective normalization, one cantake into account factors such as transcript size, GC-content,sequencing depth, sequencing error rate and insert size [7].

The most common use of transcriptome profiling is tofind differentially expressed genes (DEG), i.e., determinewhether for a given gene, an observed difference in readcount is significant between conditions. In other words,whether the gene expression level differences between con-ditions are greater than what is expected due to naturalrandom variations. Differential expression (DE) analysis isusually performed based on fold changes of read countsor in the presence of significant differences. Power analysisestimates the likelihood of successfully finding the statisticalsignificance in a dataset [8].

Functional profiling is also an important step in tran-scriptomic studies to characterize molecular functions orpathways in which DEGs are involved. To gain insightand to find biological meaning in the data through humaninterpretation and judgment, data visualization is needed.

It is difficult to survey the entire gamut of work thatconstitutes RNA-Seq analysis in a single review. Therefore,we restrict ourselves to reporting recent advances in RNA-seq data analysis and new RNA-seq technologies, whichare changing the state-of-the-art in transcriptome studies.Interested readers can refer to recent articles by Dal etal. [9] for detailed review of workflow of scRNA-seq and

Page 2: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2

by Haque et al. [10] for different computational methodsdeveloped for scRNA-seq data analysis.

1.1 Prior Surveys and MotivationThere are several published surveys on RNA-seq data analy-sis [11], [5], [12], [13], [14]. Chen et al. [11], Oshlack et al. [13],Conesa et al. [5], Zhao et al. [12] and Han et al. [14], do notprovide sufficient comparison among the tools needed in allsteps of RNA-seq data analysis. Chen et al. [11] and Oshlacket al. [13] provide an overview of RNA-seq data analysis inbrief with a few tools, but do not discuss quality controland visualization tools. Chen et al. [11] do not discussnormalization of RNA-seq data. Zhao et al. [12] discuss theimportant steps in RNA-seq data analysis in a general way,and provide a discussion on advanced technologies that arechanging the state-of-the-art of transcriptomics, but do notdiscuss any tools for de novo transcript reconstruction andthe scope of discussion on quality control tools is limitedto raw read quality checking only. Conesa et al. [5] reviewall the major steps and provide a generic roadmap forRNA-seq computational analysis. Han et al. [14] providean overview of the applications of RNA-seq technologyand the challenges that need to be addressed in all majorsteps of RNA-seq data analysis. They discuss tools used ineach step of RNA-seq data analysis in a limited way, butno comparisons are provided. None of these RNA-seq dataanalysis papers include a taxonomy, any kind of comparisonof tools at each step with proper description or any basicguidelines for the RNA-seq data analyst. Our survey differsfrom these previous surveys in the following ways.a) Like Oshlack et al. [13] and Chen et al. [11], we discuss

the workflow of RNA-seq data analysis with all majorsteps from raw reads to results of DE analysis of RNA-seq data. In addition, we present a pipeline for RNA-seq data analysis and include all recent tools includingRNA-seq data visualization. We also include importantresearch issues and challenges for the RNA-seq dataanalyst.

b) Like Han et al. [14], we provide a description of eachmentioned tool, and also provide descriptive compari-son among them by considering crucial parameters. Inaddition, we include workflow, taxonomy of tools andgeneral guidelines for the RNA-seq data analyst.

c) Unlike Zhao et al. [12] and Conesa et al. [5], our survey isnot restricted to descriptive comparison among tools inRNA-seq data analysis. We also provide a taxonomy ofall discussed tools and general guidelines for the RNA-seq data analyst.

d) Unlike Zhao et al. [12], Oshlack et al. [13] and Chen etal. [11], we do not restrict ourselves to discussing onlyquality control issues in RNA-seq data analysis.

e) Unlike other surveys, we discuss widely used RNA-seqdatasets and statistical tests for DE analysis.

f) Unlike other surveys, we mention tools used in bothRNA-seq and scRNA-seq data analysis.

1.2 Our ContributionsIn this paper, we present a structured and comprehen-sive survey of RNA-seq data analysis in terms of generaloverview, taxonomy, tools with descriptive comparison,

useful guidelines, and research issues and challenges forresearchers. Our paper is detailed with ample comparisonswhere necessary and intended for readers who wish tobegin research in this field. The major contributions of thissurvey are the following.

• We discuss microarray and RNA-seq technologiesconsidering pros, cons, similarities and differences,limitations and applications.

• We present a workflow for RNA-seq data analysiswith diagrams.

• We review each step from RNA-seq read to the resultof DE analysis in detail. In addition, we presenttools that are used in each step with descriptivecomparisons among them.

• We include a discussion of widely used RNA-seqdatasets and test statistics used for analysis.

• We emphasize quality control in RNA-seq analysis,and enumerate widely used tools for this purpose.

• To understand the RNA-seq data intuitively, visual-ization is important. We present a list of visualizationtools and compare them.

• If a reference sequence is not available for furtheranalysis of RNA-seq data, de novo transcript recon-struction is required, but most existing surveys donot pay much attention to it. We discuss this stepand refer to many tools used for de novo trascript re-construction and quality control, including necessaryquality measures.

• We discuss power analysis of DE methods.• We include a discussion on gene fusion detection

with useful tools.• We provide a list of useful guidelines for the RNA-

seq data analyst.• We highlight important research issues from both

theoretical and practical viewpoints.• We mention tools used in both RNA-seq and scRNA-

seq data analysis.

2 MICROARRAY VERSUS RNA-SEQ

2.1 Microarrays

The main purpose of microarrays is to study the expressionlevels of many genes simultaneously in response to biolog-ical activities or environmental conditions. This technologyuses a hybridization process, which is the basis for manyexperiments in molecular biology.

There are many inherent limitations of microarray tech-nology. For example, knowledge of the sequences is a pre-requisite for array design, and analysis of highly correlatedsequences is difficult because of cross-hybridization. A ma-jor challenge is the difficulty in reproducibility of results be-tween laboratories and across platforms. In addition, DNAmicroarrays lack sensitivity to genes expressed either at lowor very high levels, and therefore have a much smallerdynamic range (few-hundred folds) whereas RNA-seq datahas very large dynamic range (greater than 9000 folds) ofexpression levels over which transcripts can be detected[15]. The non-specific binding of cDNA, which are only com-plementary to the probes, creates background noise in geneexpression measurements. It is also unreliable to compare

Page 3: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 3

between two transcripts in the same microarray because ofnon-specific binding of cDNA, and this technology cannotdetect differential expressions of the same probe targetsbetween samples [16].

2.2 RNA-seqRNA-seq is a sequencing based technology to measure geneexpression. There are many high throughput sequencingtechnologies for RNA-seq. Some of the recent technologiesinclude Illumina HiSeq/NextSeq/MiSeq/NovaSeq, Roche454 FLX Titanium/FLX+, and Life Technologies SOLiD/IonPGM/Ion Proton [17], [18]. Following sequencing, the re-sulting reads are either aligned to a reference genome orreference transcriptome, or assembled de novo without thegenomic sequence to produce a genome-scale transcrip-tion map that consists of both the transcriptional structureand/or level of expression for each gene. Usually the readsare classified as three types: exonic reads, junction readsand poly(A) end-reads. These three types are used to gen-erate a base-resolution expression profile for each gene [19].scRNA-seq is one of the most recent advances in RNA-seqthat enables the identification of new, uncharacterized celltypes in tissues [5].

RNA-seq technology overcomes most limitations of mi-croarrays. In a single RNA-seq experiment, it is possible toinvestigate not only gene expression, but also alternativesplicing [20], novel transcript expression [21], allele specificexpression [22], gene fusion events [23] and genetic varia-tion. This technology shows highly accurate measurementof expression levels [15]. One major advantage of RNA-seqis that it can measure almost 70000 non-coding RNAs, whichcannot be done by microarrays. Such non-coding RNAs playvital roles in diseases [24]. The results of RNA-Seq also showhigh levels of reproducibility, for technical replicates [15].Although a decade ago, RNA-seq data analysis was a trou-blesome task due to non-availability of a standard pipelineand gold standards, in recent years, the complexity of thetask has been significantly minimized due to availabilityof relatively standard and well-defined pipelines. RNA-seqgenerates big volumes of data, and therefore, needs special-ized algorithms and powerful servers to conduct analysis.

These two technologies are governed by same statisticalprinciples and they generate statistically relevant sampleswhich have no major statistically significant differences.Both methods suffer from background noise and biase. Ithas been reported that correlation between gene expressionprofiles generated by Affymetrix one-channel microarrayand RNA-seq is a Spearman correlation coefficient of 0.8[25]. Moreover, it has been observed that measuring expres-sion of low abundance genes with both the technologies ischallenging, due to the presence of background noise. Theselow abundance genes have poorer correlations than highabundance genes.

A microarray covers approximately 20% of all genes onaverage whereas RNA-seq provides a comprehensive viewof the transcriptome. Microarray analysis is straightforwardand it has lower cost than RNA-seq. However, there is thepossibility of reducing the cost of RNA-seq using lowersequencing depths. Unlike continuous probe intensity basedmicroarrays, de-novo analysis of samples without a ref-erence genome is possible in RNA-seq. When analyzing

microarray data, researchers assume that gene intensitiesacross experimental replicates follow a normal distributionor a heavier tailed distribution whereas in RNA-seq, theyassume that the read counts follow Poisson or NegativeBinomial distribution [26]. The higher the superiority andsensitivity of the RNA-seq data, higher is the ability toidentify accurate DEGs.

3 TRANSCRIPTOME PROFILING

The starting point for RNA-seq data analysis is RNA-seqreads. Subsequent analysis steps include quality control,mapping of reads with or without reference sequence:genome or transcriptome, expression quantification, DEanalysis, and pathway or network analysis to gain biologicalinsight. In this section, we address three major analysissteps, viz., quality control and read alignment with orwithout reference sequence. In every step of RNA-seq dataanalysis, there is scope for adoption of big data technologiesto handle massive sizes of data [27]. We present a workflowfor DE analysis for RNA-seq data in Figure 1. In Figure2, we present another view of the pipeline to illustrateeach step better. A taxonomy of RNA-seq data analysistools is presented in Figure 3. Well-known repositoriesand blogs for RNA-seq data analysis are available at Bio-conductor [28], https://omictools.com/rna-seq-category,http://bioinformaticssoftwareandtools.co.in/ngs.phpand http://www.rna-seqblog.com/category/technology/methods/data-analysis/.

Apart from the mentioned workflow of RNA-seq dataanalysis in Figure 1, scRNA-seq requires an additional stepcalled cell quality control. Here, detection and filtration oflow quality cells are performed. The low quality cells maybe considered dead cells [29]. SCell [30] is a tool whichcan filter such low quality cells by comparing the numberof genes expressed in the background level to the averagenumber of genes expressed for a given sample. scPipe [31]is another effective tool (pipeline) which supports differentpreprocessing steps including filtration of low quality cells.SinQC [32] can find the low quality samples by integratinggene expression pattern and data quality information.

3.1 RNA-seq Reads Quality Checking

Low quality reads can arise due to problems in librarypreparation and sequencing. In addition, PCR artifacts,untrimmed adaptor sequences, and sequence specific biascan also lead to low quality reads. So, to analyze the rawread data, it is necessary to perform quality checking tohandle information loss and to improve downstream bio-logical analysis. Quality checking helps identify and reducebias from RNA-seq data. To deal with parameters suchas read quality, presence of adaptors, GC content, overrepresentation of k-mer and reads that are duplicated, anappropriate quality control mechanism is necessary. Severaltools are available for quality analysis. These are discussedbelow and a comparison is given in Table 1. Detailed per-formance assessment of these tools can be seen in [33]. Mosttools run on Linux, Windows and MAC operating systems.All tools discussed below for quality control of RNA-seqreads support reads generated from the Illumina platform,

Page 4: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 4

Fig. 2. RNA-seq data analysis pipeline for DE analysis. Reads of samples generated from the sequencing machine are aligned to the referencegenome or transcriptome, and mapped reads are summed to quantify the expression. Each RNA-seq experiment generates a vector of read countsof p genes. Combining results of experiments generates a p×n matrix, DE analysis identifies list of DEGs. Finally, biological insight can be gained.

Fig. 3. Taxonomy of RNA-seq data analysis tools. RNA-seq analysis is divided into three categories: (I) Pre-analysis where read quality controlis performed (II) Core analysis performs categorization into three ways: (a) transcriptome profiling, (b) differential expression analysis and (c)functional profiling. In transcriptome profiling, alignment and expression quantification is performed. Expression quantification may be done at exon,gene or transcript level. Based on whether reference sequence is available or not, alignment are performed by mapping reads to the referencesequence by a splice aware or unaware aligner or by constructing a de novo trascriptome assembly by taking strand information as well and qualitycontrol. DE analysis tools can be classified into parametric and non-parametric approaches. Power analysis is performed before DE analysis. Twocommonly used approaches for functional profiling are enrichment analysis and comparing DEGs against the rest of the genome. (III) Advancedanalysis includes gene fusion detection and visualization. For visualization of analysis, web-based or stand-alone tools can be used.

RQC: Reads Quality Control; TP: Transcriptome Profiling; EQ: Expression Quantification; UB: Union-exon Based; TB:Transcript Based; DNT: De novo Transcriptome; DEA: Differential Expression Analysis; FP: Functional Profiling; EA:Enrichment Analysis; CAG: Comparing Against Genome; SA: Stand-alone; WB: Web-Based.

whereas a few tools support reads generated from otherplatforms.

a) FastQC1: FastQC provides a quality control (QC) report,which can eliminate problems that originate during thesequencing process or when starting library preparation.It can run in interactive or non-interactive mode. Thisstandard tool can be used to perform quality control ofscRNA-seq reads. This well-documented tool has beendeveloped using the Java programming language andrequires a suitable Java Runtime Environment.

b) NGSQC [34]: This comprehensive quality control toolcan be used in any deep sequencing platform. It checkswhether the most interesting sequencing data are in-fluenced by quality issues. It supports cluster comput-ing, and can be used in a large sequencing project.This well-documented tool has been developed using aGNU make, shell script and the Python programminglanguage, and runs in Linux environment. It requiresBOWTIE, gnuplot and Sun Grid Engine or TORQUE ascluster manager if NGSQC is running on a Linux cluster.Non-academician need license to use it.

c) HTQC [35]: The High-Throughput Quality Control toolincludes programs for read quality control checking, read

1. https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

filtration and generation of graphic reports. This well-maintained tool runs in Linux and was developed usingC++ and Perl programming languages. It has high run-time efficiency.

d) PRINSEC [36]: This tool is for easy and rapid qualitycontrol. It provides a variety of options such as filteringof sequences, reformatting and trimming to improvedownstream analysis. This Linux based tool was devel-oped using the Perl language. However, the tool is notwell-documented or well-maintained.

e) Trimmomatic [37]: Trimmomatic is a flexible and efficienttool, which can correctly handle paired end data. It canbe used to drop low quality reads present through 3′ end.It works well for both reference based and de novo tasks.This OS independent tool was developed in Java 1.5 andit is well-documented.

f) Cutadapt [38]: This command-line tool offers trimming ofadaptors added during library preparation and can han-dle contaminating sequences based on user parameters.This very well-documented and maintained tool wasdeveloped using Python and C programming languages.It runs on Windows, Linux and Mac OS X.

g) SolexaQA [39]: SolexaQA is a fast, automated user-friendly package that can compute and control readquality for data from Illumina sequencing technology.

Page 5: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 5

Fig. 1. RNA-seq data analysis workflow. If reference sequence genomeor transcriptome is present, quality reads are mapped to the genome ortranscriptome using reference sequence, else de novo transcript recon-struction is required to map reads to the trascriptome. Next, mappedreads for each sample are assembled into gene-level, exon-level ortranscript-level for expression quantification. Then, the summarized dataare normalized to generate count data. DE analysis is performed next tofind DEGs. Finally, biological insight from these DEG lists can be gained.

It was developed using Perl and R, and runs in Unixenvironments. The maintenance and documentation ofthis tool are satisfactory.

3.2 Read Alignment

The conventional pipeline for RNA-Seq data includes align-ing reads to an annotated reference genome or transcrip-tome, and quantifying the expression of genes. Alignmentis necessary to discover origins of reads with respect to thereference sequence. It is important to consider parameterssuch as type of read, single- or paired-end (SE or PE),strandedness of RNA-seq library and length of sequencedfragments.

The main aim of alignment is to accurately align se-quence reads to intron boundaries. Some features of the ref-erence genome may create problems during read alignmentfor a subset of reads. These features include assembly errors,and the presence of repetitive regions and assembly gaps.Polymorphism is another problem which occurs when readsare aligned to multiple locations on the reference, also calledmultireads. It is important to resolve mapping ambiguityfor accurate detection and quantification of a transcript [40].This problem can be handled by discarding multireads or byrandomly assigning multireads to a position from possiblematches.

Read mappers for RNA-Seq data are divided broadlyinto two categories, unspliced aligner and spliced aligner[11]. Splice aware aligners are of two types, seed-and-extend and Burrows-Wheeler transform (BWT) [41] based.An unspliced aligner aligns reads against the transcriptomefor expression quantification estimation. On the other hand,spliced read aligners are generally used to align readsonto reference genomes. Spliced read aligners allow largegaps to match introns during alignment. When a reference

genome is available, after mapping of reads using a splicedaligner to the genome, transcript discovery and identifica-tion are performed with or without an annotation file [5]. Iftranscript discovery is not required, reads can be mappedto the transcriptome using an unspliced aligner and thentranscript identification is performed [5]. In principle, mostmethods developed for RNA-seq alignment can also be usedfor scRNA-seq alignment [42], [43]. But, the barcodes (UMI)attached to cDNAs need to be removed before alignment[44]. This can be done using UMI-based tools [9], [45], [46]. Asystematic evaluation of spliced and unspliced aligners canbe found in [47]–[49]. For more information on alignmentalgorithms, refer to [50]. We discuss tools for mapping oralignment below, with a comparison among them in Table2.a) TopHat [43]: It uses a fast RNA-seq mapping strategy,

and can quickly align reads to a reference genome with-out relying on unknown splice sites. Initially, it mapsunspliced reads to locate exons, and finally, unmappedreads are split and aligned independently to identifyexon junctions [51]. TopHat2 [51] is scalable and supportsa wide range of read lengths, making it useful for mostRNA-seq and scRNA-seq experiments [44]. It is a well-documented and well-maintained tool, implemented inC++ and Python, and runs on Linux and Mac OS. It is afast tool and can map 2.2 million reads in an hour.

b) GSNAP [52]: It allows fast detection of complex variantsand splicing in short SE and PE reads. It can detect shortand long-distance splicing, using probabilistic models ora database of known splice sites. It can perform SNP-tolerant alignment to a reference space. This command-line tool is implemented in C and Perl and can be usedin scRNA-seq reads alignment [44]. This tool is well-documented and updated regularly.

c) MapSplice [53]: This efficient tool is not dependent on anyprior knowledge of splice sites to detect splice junctions.It finds non-canonical junctions and other novel splicingevents using the quality and diversity of read align-ments. It performs well in terms of sensitivity, specificity,CPU utilization and memory efficiency.

d) STAR [54]: Spliced Transcript Alignment to a Reference(STAR) is an ultrafast sequence aligner. It uses sequentialmaximum mappable seed search in uncompressed suffixarrays, followed by seed clustering and stitching. STARoutperforms other aligners in mapping speed by a factorof 50 or more, while improving alignment sensitivityand precision. It can map full-length RNA sequences.This tool is also used in scRNA-seq read alignment3.This well-documented and maintained stand-alone toolis implemented in C++ and can run on Linux.

e) GEM mapper [55]: Genome Multitool (GEM) mapper usesstring matching to search the alignment space efficiently.It has high precision and speed, and returns all possi-ble matched strings, including gapped ones. This well-documented command-line tool is implemented in C andPython.

f) Bowtie [56]: This ultrafast, memory-efficient tool alignsshort DNA sequence reads to large genomes. It can-

3. https://hemberg-lab.github.io/scRNA.seq.course/introduction-to-single-cell-rna-seq.html#scrna-seq

Page 6: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 6

TABLE 1RNA-seq Reads Quality Control Tools and their Comparisons

Name & Year Input Output Functionalities WB/SA GUI/CLAvailabilityFastQC, 2014 FASTQ, SAM, BAM HTML Reporting SA GUI www.bioinformatics.babraham.ac.uk/

projects/fastqc/NGSQC, 2012 FASTA, FASTQ HTML Trimming SA CL nipgr.res.in/ngsqctoolkit.htmlHTQC, 2013 FASTQ FASTQ Filtering and trimming SA CL https://sourceforge.net/projects/htqc/PRINSEQ, 2011 FASTA, FASTQ FASTA, FASTQ, HTML Filtering and trimming Both CL prinseq.sourceforge.net/Trimmomatic, 2014 FASTQ FASTQ Filtering and trimming SA CL www.usadellab.org/cms/index.php?page=

trimmomaticCutadapt, 2011 FASTQ FASTQ Trimming SA CL cutadapt.readthedocs.io/en/stable/SolexaQA, 2010 FASTQ FASTQ, PNG Trimming SA CL solexaqa.sourceforge.net/

SAM: Sequence Alignment/Map, BAM: Binary SAM, HTML: HyperText Markup Language, PNG: Portable Network Graphics, WB: Web-Based, SA: Stand-Alone,GUI: Graphical User Interface, CL: Command-Line, FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, FASTQformat is a text-based format for storing both a biological sequence and its corresponding quality scores.

not handle spliced transcripts. Alignments can be par-allelized by distributing reads across search threads.Bowtie 2 is faster and aligns a bigger fraction of readsthan Bowtie [57]. It can run on all platforms and binariesare currently available for 64-bit Intel architectures. It iswell-documented and well-maintained.

g) HISAT [58]: A widely used tool to align reads to agenome, HISAT2 is the successor of HISAT and TopHat2and is expected to be the core of the next version ofTopHat (TopHat3). It can detect splice variants and usesless memory than STAR. This stand-alone command-linetool can run on any platform. It is well-documented andmaintained.

3.3 Quality Control for Mapping Reads

RNA-Seq data has numerous inherent sources of bias, in-cluding hexamer, 3′, GC, amplification, mapping, sequence-specific, and fragment-size biases. Therefore, quality controlis important for downstream biological analysis. The successof reference-based alignment depends on the quality ofthe reference genomes. These techniques may suffer frommissing or erroneous information, because they rely on thereference genome and annotation information [59]. Often,the sequencing technology, protocol and/or the selectedmapping algorithm introduce unwanted biases in alignmentdata as well in results. The detection of such biases is non-trivial. Most tools for quality control display their outputin graphs. Here are a few tools to ensure quality control ofmapping reads and these are compared in Table 3.

a) Picard4: Picard is a group of command-line tools forquality control of mapping reads. This tool producesmetrics detailing the quality of read alignments. Thisis a well-documented and well-maintained stand-alonecommand-line tool, implemented in Java.

b) RSeQC [60]: This package takes into account different as-pects of RNA-seq experiments, such as sequence quality,GC bias, polymerase chain reaction bias, nucleotide com-position bias, sequencing depth, strand specificity, cov-erage uniformity and read distribution in the genome.This tool is also used with scRNA-seq data3. It is well-documented and maintained. It was implemented inPython and C. This also runs using GenePattern5 for webinterface.

4. http://broadinstitute.github.io/picard5. http://www.GenePattern.org

c) AlignerBoost [61]: AlignerBoost can estimate the qualityof ambiguously mapped reads, and can increase map-ping precision. It uses a Bayesian framework to findquality of mapping reads. If we can provide knownSNPs, it can achieve higher quality alignment. It isa Java based command-line stand-alone tool that canrun on any platform. This well-maintained and well-documented tool needs an aligner of user choice.

d) QoRTs [62]: It can detect and identify errors, biases andartifacts produced by paired-end RNA-seq technology.

e) RNA-SeQC: [63]: This is a tool that provides qualitymeasures for alignment and duplication rates, GC bias,rRNA content, regions of alignment (exon, intron andintragenic), continuity of coverage, 3′/5′ bias and countof detectable transcripts, among others to estimate thequality of data. It provides three quality control metrics:Read Counts, Coverage and Correlation. This tool is alsoused with scRNA-seq data. It can be used in any platformthat supports Java and R. It is a well-documented andwell-maintained command-line tool.

f) QuaCRS [64]: This integrated quality control tool com-bines FastQC, RNA-SeQC, and selected functions fromRSeQC to provide a detailed quality report. This is astand-alone, well-documented GUI based tool imple-mented in Python.

g) MultiQC [65]: It has the ability to support different align-ers, processing tools and QC programs. Currently thesupported tools can be viewed at https://github.com/ewels/MultiQC. It scans given directories for recognizedlog files and generates a summary of parsed log files inHTML format. This command-line and well-documentedtool are implemented in Python.

3.4 De Novo Transcript Assembly

When there is no reference genome for transcript assem-bly, it is necessary to perform de novo reconstruction thatleverages the redundancy of short-read sequencing to findoverlaps between the reads and assemble them into a tran-script [66]. However, de novo methods are computationallyintensive and may require long PE reads and high levelsof coverage to work reliably. In general, long reads andPE strand-specific sequencing are preferable because theycontain more information than short and SE reads [67]. Afterde novo reconstruction, quantification can be performed.For expression quantification, reads are aligned to the novelreference transcriptome, and further analysis is performed

Page 7: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 7

TABLE 2Comparison among Reference-based Alignment Tools

Aligner & version Year Pairedend?

Paralleli-zation?

Algorithm type Splice-aware?

IA? Alignmentformat

Availability

TopHat (2.1.1) 2009 Yes Yes BWT Yes No BAM tophat.cbcb.umd.edu/GSNAP 2010 Yes No Seed and extend Yes Yes SAM http://research-pub.gene.com/gmap/MapSplice (1.15.2) 2010 Yes No BWT Yes No SAM www.netlab.uky.edu/p/bioinfo/MapSpliceSTAR (2.5.3a) 2013 Yes Yes Seed and stitch Yes Yes BAM code.google.com/p/rna-star/GEM mapper 2012 Yes No Seed and extend No Yes SAM gemlibrary.sourceforge.netBowtie (1.2.0) 2009 No Yes BWT No Yes SAM bowtie.cbcb.umd.eduHISAT (2) 2015 Yes Yes BWT Yes Yes SAM www.ccb.jhu.edu/software/hisat/index.shtml

IA: Independent alignment, BWT: Burrows Wheeler transform.

TABLE 3RNA-seq Reads Alignment Quality Control Tools and their Comparisons

Name & Year Input Other functionalities AvailabilityPicard SAM, BAM,

CRAM, VCFValidating library construction including the insert size distribution and readorientation of paired-end libraries.

broadinstitute.github.io/picard/

RSeQC, 2012 SAM, BAM Visualization of results. http://rseqc.sourceforge.net/AlignerBoost,2015

SAM, BAM,BED, VCF

Removing all non-uniquely mapping reads and conversion of files format. https://github.com/Grice-Lab/AlignerBoost

QoRTs, 2012 SAM, BAM Visualization and data processing. hartleys.github.io/QoRTsRNA-SeQC,2012

BAM, RefGen,GTF,

RNA-SeQC allows investigators to make informed decisions about sampleinclusion in downstream analysis.

www.broadinstitute.org/rna-seqc/

QuaCRS, 2014 FASTQ, BAM New database approach for RNA-seq QC analyses and report generation. http://bioserv.mps.ohio-state.edu/QuaCRS/

MultiQC, 2016 Log files Single cell data and population studies, Sequencing facilities http://multiqc.infoCRAM: Presents a compressed version of the alignment, VCF: Variant Call Format, GTF: Gene Transfer Format, RefGen: Reference gene information in the genome .

[5]. After alignment, transcript identification is performed,followed by functional annotation of that transcript [5].There are many tools to assemble a de novo transcriptome.These are discussed below with a comparison in Table 4. De-tailed assessment of transcript reconstruction methods canbe seen in [68], [69]. These tools can broadly be divided intotwo groups based on whether stranded reads are supportedor not. The reason stranded read based tools are better is thatwe can tell which strand of RNA is being transcribed withsuch tools. Before assembly, some preprocessing is requiredon raw reads to obtain high quality reads for subsequentassembly. We need a trimming procedure to remove adaptorsequences, ambiguous nucleotides and reads with qualityscores less than 20 and length below 30 bp. De novo trascriptassembly has advantages as it does not depend on a refer-ence genome. As a result, for organisms which do not havea reference genome, de novo assembly can provide an initialstrategy. This method is suitable only for highly expressedtranscripts, although using a large sequencing depth thislimitation can also be eliminated.a) Oases [70]: It is a heuristic RNA-seq reads assembler

in the absence of a reference genome. It works acrossa broad spectrum of expression values, and in thepresence of alternative isoforms. It uses dynamic errorremoval from RNA-seq data to predict a full lengthtranscript. This well-documented command-line tool isimplemented in C and runs in 64 bit Unix environments.It requires 12 GB physical memory for installation.

b) Trinity [71]: Trinity is a popular tool that uses dynamicprogramming to solve the splicing problem by identify-ing potential paths in the de Bruijn graph generated fromreads or PE reads. This tool can detect isoforms. Therunning time of the approach increases exponentiallywith respect to the number of branches in the graph.This command-line, well-documented and maintainedtool requires 1 GB RAM per 1 million pairs of Illumina

reads, and is implemented in Java.c) SOAPdenovo-Trans [72]: This assembler uses the error-

removal model from Trinity and the robust heuristicgraph traversal method from Oases. It also uses stricttransitive reduction to provide accurate results by simpli-fying the scaffolding graph. It provides higher contiguity,lower redundancy and faster execution. This command-line tool is implemented in C and C++. This is well-documented and requires 64-bit Linux system with largephysical memory.

d) Rnnotator [73]: It is an automated software pipeline thatgenerates transcript models with highly accurate contigs.It addresses limitations due to short reads, and assemblyerrors due to poor quality of reads. This tool producesfull-length transcript assemblies with deep sequencingcoverage. This tool is only available for free for collabo-rators of the developer. Other users need a commerciallicense.

e) IDBA-tran [59]: IDBA-Tran is able to assemble bothhighly and lowly expressed transcripts. It uses a prob-abilistic progressive approach to remove erroneous ver-tices or edges generated from highly expressed isoformsfrom a de Bruijn graph while keeping the correct oneswith lowly expressed isoforms. This well-documentedand maintained command-line tool is suitable for Unix-like systems with GCC installed.

f) ABySS [74]: The Assembly By Short Sequences tool isfor parallel short read sequencing data, and can handlegenome of any size. It supports larger k-mer sizes, whichimprove the assemblies and complexity of the graph.It represents the graph in a distributed way to supportparallel computing. This is a well-documented and main-tained stand-alone tool that can be run in Linux and MacOS X. It is implemented in Python.

g) Bridger [75]: It combines ideas used in Cufflink andTrinity to produce a de novo transcriptome assembly.

Page 8: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 8

It is used to assemble full-length reference transcripts.It outperforms many other assemblers in running timeand memory requirements. It overcomes the limitationsof Trinity, and is comparable to the reference-based as-sembler Cufflink in both sensitivity and specificity. Thisis a command-line tool, for which documentation andmaintenance are poor.

h) Trans-AbySS [66]: It is a de novo short-read transcriptomeassembly and analysis pipeline that addresses variationsin local read densities by assembling read substringswith varying stringencies, and then merging the result-ing contigs. It has very high sensitivity and specificity.This is a well-documented and maintained stand-alonetool, that runs in Linux. It is implemented in Python andrequires ABySS and BLAT [76].

i) BinPacker [77]: It is a packing-based full-length de novotranscriptome assembler and is reported to outperformalmost all other existing de novo assemblers. It runs fastand takes less memory than most other assemblers. Thisis a well-maintained command-line tool. It is a very ef-fective tool. However, documentation is not satisfactory.

j) StringTie [78]: It is an efficient assembler of RNA-Seqalignments into transcripts. It uses the optimization tech-nique of maximum flow to construct a network to deter-mine gene expression levels. It incorporates alignmentsto both a genome and a de novo assembly of reads. Avery well-documented and well-maintained command-line tool, it can be run on Linux and Mac OS X and isimplemented in C++.

k) Scripture [21]: It can reconstruct a trascriptome usingreads and the genome sequence. It uses gapped align-ments of reads across splice junctions and reconstructsreads into statistically significant transcript structuresby performing segmentation. This tool can be used toreconstruct full-length gene structures. This is a well-documented command-line tool implemented in Javaand can run on any platform.

3.5 Quality Control of De Novo Assemblies

It is important to assess the quality of de novo transcrip-tome assemblies to identify parameters which can improvedownstream biological analysis. Some quality metrics usedwith transcriptome assemblies are listed in Table 5.

TransRate [79] is a tool which can assess the quality of denovo assemblies by evaluating accuracy and completenessusing only input reads. It calculates these through twonovel reference-free statistics: the TransRate contig score andthe TransRate assembly score. The contig score measures aquantitative score for the accuracy of an assembly of indi-vidual contigs. The assembly score measures accuracy andcompleteness of the assembly. DETONATE [80] is anothertool for the evaluation of quality of de novo transcriptomeassemblies. Wang and Gribskov [81] performed a com-prehensive evaluation of different de novo transcriptomeassembly programs and their effect on downstream DEanalyses. After assembly quality assessment, the authorsreported that almost 70% of the DEGs were common be-tween reference-based sequences and de novo assemblies.The remaining 30% were not common due to factors suchas incomplete gene annotations, exon level differences, and

transcript fragmentation. The authors suggested that it isa good idea to perform de novo analysis even when areference genome is available to eliminate biases.

3.6 Expression Quantification

Transcript quantification is the most important applicationof RNA-seq analyses. To measure gene and transcript ex-pressions, one needs to compute the number of reads thatmap to a transcript sequence. The algorithms for genequantification can be divided into two categories: transcriptbased and union-exon based approaches [12].

In union-exon based methods which are simple, alloverlapping exons of the same gene are merged into aunion of exons. Although commonly used, this approachsignificantly underestimates gene expression levels. Whenquantifying expressions, special care must be taken not todouble-count the so-called multi-mapping reads, althoughdiscarding multi-mapping reads leads to significant infor-mation loss.

There are a few union-exon counting tools, e.g., HT-Seq, easyRNASeq and FeatureCounts, which aggregate rawcounts of mapped reads. For expression quantification, it isnecessary to start from the genome coordinates of exons andgenes. The outcome is raw read counts data. The generatedread count data contain different types of biases because offactors such as transcript length, total number of reads, andsequencing biases. So, before DE analysis, for reliable com-putation of expression levels of genes, normalization mustbe performed. Multi-mapping of reads creates problems inachieving highly accurate quantification [83].

There are many sophisticated algorithms to esti-mate transcript-level expression. Algorithms for transcriptquantification for transcriptome mappings include eX-press, RSEM (RNA-Seq by Expectation Maximization) andKallisto. The main challenge of transcript level quantifica-tion is to handle problems related to the sharing of reads byrelated transcripts. Teng et al. [84] performed a comparativestudy and found that RSEM out-performed competitors. Afew tools are discussed below and a comparison is given inTable 6. Many repositories of publicly available count datacan be found in [85].a) HTSeq [86]: HTSeq is a tool that assigns expression values

to genes based on reads that they have been aligned withby an aligner such as STAR and HISTAT. HTSeq providesparsers for reference sequences, short reads and short-read alignments, and for genomic features, annotationand score data. This tool is also used for scRNA-seqexpression quantification3. It also provides functions forquality control and preprocessing for DE analysis. It canrun on any platform and is implemented in Python.This is a well-documented and maintained stand-alonecommand-line tool.

b) easyRNASeq [87]: It goes through several steps for ex-pression quantification, such as reading in sequencedreads, retrieving annotations, summarizing read countsby features of interest, e.g., exons and genes, and fi-nally reporting results. This well-documented stand-alone command-line tool is implemented in R.

c) FeatureCounts [88]: FeatueCounts is similar to HTSeq, butis much faster and requires less memory. It works with

Page 9: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 9

TABLE 4Comparison of De Novo Assembly Reconstructions Tools

Assembler Year Parallelization?

Pairedend reads?

Strandedreads?

Multiple insertsizes?

Availability

Rnnotator 2010 Yes Yes Yes Yes Contact David Gilbert ([email protected])Trans-ABySS (1.5.5) 2010 Yes Yes No Yes www.bcgsc.ca/platform/bioinfo/software/trans-abyssOases (0.2.08 ) 2012 Yes Yes Yes Yes www.ebi.ac.uk/∼zerbino/oases/Trinity (2.4.0) 2010 Yes Yes Yes No trinityrnaseq.sourceforge.net/SOAPdenovo-Trans (1.04) 2014 Yes Yes No Yes https://sourceforge.net/projects/soapdenovotrans/IDBA-tran (1.1.1 ) 2013 No Yes No No www.cs.hku.hk/∼alse/idba tranABySS (2.0.2) 2009 Yes Yes Yes Yes www.bcgsc.ca/platform/bioinfo/software/abyssBridger 2015 No Yes Yes Yes https://sourceforge.net/projects/rnaseqassembly/files/

?source=navbarBinPacker 2016 Yes Yes Yes – https://sourceforge.net/projects/transcriptomeassembly/

files/StringTie (v1.3.3) 2015 Yes Yes Yes – ccb.jhu.edu/software/stringtieScripture 2010 Yes No Yes Yes software.broadinstitute.org/software/scripture/

TABLE 5Transcriptome quality metrics [82]

Metric name Formula Descriptions

Accuracy 100×

∑M

i=1Ai∑M

i=1Li

It is the percentage of correctly assembled bases estimated with the help of a set of expressed referencetranscripts (N) or a reference genome. Li is the length of the alignment between a reference transcript and anassembled transcript Ti, Ai is the number of correct bases in transcript Ti, and M represents the number ofbest alignments between assembled transcripts and the reference.

Completeness 100×

∑N

i=1l(Ci≥δ)

N

It is the percentage of expressed reference transcripts covered by all the assembled transcripts. l is an indicatorfunction, andCi (the percentage of a reference transcript, i, that is covered by assembled transcripts) is greaterthan some arbitrary threshold, δ: for example, 80%

Contiguity 100×

∑N

i=1l(Ci≥δ)

N

It is the same as completeness, the only difference being that in place of all the assembled transcripts, we needto consider a single longest transcript.

Variant reso-lution

100×

∑N

i=1

max((Ci−Ei),0)Vi

N

It is the percentage of transcript variants assembled. This can be calculated by the average of the percentageof assembled variants within the reference. Ci and Ei are the numbers of correctly and incorrectly assembledvariants for reference gene i, respectively, and Vi is the total number of variants for i.

Chimerism – The percentage of chimaeras that occur owing to misassemblies among all of the assembled transcripts.

both SE and PE reads. This standard tool is also used forscRNA-seq expression quantification3. It is implementedin C and can run on Unix. It also supports multithreadingfor speed-up.

d) Kallisto [89]: Kallisto is a program for quantifying abun-dance of transcripts in RNA-Seq data. It can process bothSE and PE reads. This is a well-documented and a widelyused command-line tool. It can run on Mac OS X, Ubuntuand CentOS.

e) EMSAR [90]: It is a tool for transcript quantification. Itcan handle multiread mapping and is efficient in CPUtime and memory. It supports both SE and PE reads. Thiscommand-line tool is implemented in C.

f) RSEM [91]: The RSEM algorithm is based on expecta-tion maximization and returns transcripts per million(TPM) values [91]. Teng et al. [84] performed a compar-ative study based on specificity and sensitivity amongdifferent quantification methods and found that RSEMperforms the best. It can also perform read alignment.It shows linear increase in running time with increasein read alignment. This stand-alone command-line, andwidely used tool is implemented in C++ and Perl andruns on Linux and Mac OS X. By default, it uses Bowtieas an alignment tool. It is a well-documented and main-tained tool.

g) Salmon [92]: This lightweight tool measures transcriptabundance. Most methods suffer from GC-content biasbut this method can correct this bias on its own, making iteffective. It can also perform mapping if required, beforequantification. It supports parallel processing as well.This command-line tool is implemented in C++.

4 NORMALIZATION OF RNA-SEQ DATA

Despite initial optimistic claims that RNA-seq read countdata do not require sophisticated normalization [19], ingeneral it is an important issue for downstream biologicalanalysis. After getting the read counts, it is essential toensure accurate inference of gene expression and propersubsequent analysis, and normalization helps theese pro-cesses. A number of normalization methods for count dataare enumerated in Table 7.

For effective normalization, one must take into accountfactors such as transcript size, GC-content, sequencingdepth, sequencing error rate, and insert size [7]. To eliminatebias in normalization for a dataset, one can use different nor-malization methods and compare corresponding estimatedperformance parameters using measurement error models[7]. Comparative analysis or integrative analysis shows thatan approach like quantile normalization can improve theRNA-seq data quality including those with low amountsof RNA [6]. An R package called EDASeq can reduce GC-content bias [93]. The NVT package [94] can identify the bestnormalization approach for a RNA-seq dataset by analyzingand evaluating multiple methods via visualization, based ona user-defined set of uniformly expressed genes. Zyprych-Walczak et al. [95] provide a procedure to find the optimalnormalization for a specific dataset. They report that aninappropriate normalization method affects DE analysis.Please refer to [96] for detailed performance evaluation ofnormalization methods.

Batch effects are common and powerful sources of vari-ation and systematic error and generally arise because ofthe measurements made in RNA-seq and are affected by

Page 10: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 10

TABLE 6Expression Quantification Methods: A comparison

Method Year Input Output Aggregation Level Annotation File AvailabilityHTSeq 2014 FASTA/Q, SAM, BAM Count GTF, GFF Exon, Gene http://htseq.readthedocs.io/easyRNASeq 2012 BAM RPKM GFF, GTF Exon, Gene http://bioconductor.org/packages/release/bioc/

html/easyRNASeq.htmlFeatureCounts 2014 SAM, BAM Count GFF, SAF Exon, Gene http://subread.sourceforge.net/EMSAR 2015 SAM, BAM FPKM,

TPM– Transcript https://github.com/parklab/emsar

RSEM 2011 FASTA/Q TPM GTF Transcript http://deweylab.biostat.wisc.edu/rsemKallisto 2015 FASTQ TPM – Transcript https://pachterlab.github.io/kallisto/Salmon 2017 FASTA/Q, SAM, BAM TPM – Transcript https://github.com/COMBINE-lab/Salmon

GTF: Gene Transfer Format, GFF: Gene Feature Format, SAF: Simplified Annotation Format.

TABLE 7Normalization Methods

Method Abbreviation Details CommentsTrimmed meanof M-values[106]

TMM After removal of genes with the highest log expression ratios between samplesand the genes with highest expression, the weighted mean of log ratiosbetween the compared samples is used as scaling factor.

Normalized read counts are obtained by dividingraw read counts by TMM adjusted library sizes. Canbe calculated with edgeR’s calcNormFactor() function(method=”TMM”).

Relative Log Ex-pression [4]

RLE This normalization method is based on size factors that renders counts fromdifferent samples comparable.

Implemented in edgeR and DESeq packages. Can becalculated with edgeR.

Relative logexpressionmethodimplemented inDESeq [26]

DESeq This method uses a scaling factor for a sample (or lane). Per gene, the ratioof read count to its geometric mean across all samples (lanes) is calculatedand the median of this ratio is used. It is specifically developed to find DEGsbetween two conditions for RNA-seq data.

Size factor is applied to all read counts of a sample.More robust than total count normalization.Can be calculated with edgeR’s calcNormFactor()function (method=”RLE”) also implemented by Rlibrary (estimateSizeFacors() function).

Reads perkilobase permillion mappedreads [2]

RPKM It corrects for different library sizes and gene length. The number of mappedreads per gene is divided by the total number of million mapped reads (RPM)and then divided by the gene length in kilobases to yield RPKM. It assumesthat the total mRNA amount was identical in the compared samples.

It introduces a bias in per-gene variance, in particu-lar for lowly expressed genes.It is implemented in edgeR’s rpkm() function.The RPKM method is popular in practice.

Fragments perkilobase million

FPKM Mainly used for applications of paired-end sequencing, where two paired-reads can map. The number of fragments (defined by two reads each) is used.

It is implemented in edgeR’s fpkm() function.

Reads permillion mappedreads

RPM The number of mapped reads per gene is divided by the total number ofmillion mapped reads.

Transcripts permillion mappedreads [83]

TPM For TPM normalization, the read counts are divided by the gene length (inkilobases) and then divided by a scaling factor, which is the sum of the readper kilobase values.

Counts per mil-lion

CPM Each gene count is divided by the corresponding library size (in millions). –

HypothesisTesting basedNormalization[107]

HTN It is based on hypothesis testing. It incorporates knowledge about housekeep-ing genes during normalization. It reduces type I error.

It performs better than state-of-the-art normaliza-tion methods.

RAIDA [108] – It uses ratios between counts of genes in each sample for normalization Avoids problems due to highly abundant tran-scripts.

RUV [109] RUV-seq

It removes unwanted variations (RUV) that are arose due to the differentlibrary sizes, sequencing depths and other technical effects.

It enhances DE analysis by performing more accu-rate normalization than state-of-the-art methods.

Notes: All of these methods can be divided into two subgroups: ones that refer to knowledge from library (TMM, DESeq, CPM, G, HTN, RLE, MRN) or distributionadjustment of read counts (TC, UQ, Med, Q, RPKM, FPKM, RPM, TPM, RAIDA, RUV-seq).UQ, CPM, RPKM, FPKM, and TPM can be equally applied to scRNA-seq3.TC, ME, UQ, Q, MRN, RPM and G are presented in the Supplementary file (Table 1).

laboratory conditions. Batch effects impact RNA-seq orscRNA-seq analysis and can bias the results and lead to anincorrect conclusion [97]. Luckily, high throughput sequenc-ing generates enough data to detect and minimize batcheffects. Hicks et al. [98] examined scRNA-seq data from fivepublished studies and found that batch effect can create aconsiderable level of variability in cell-to-cell expression.Normalization cannot remove batch effects because it re-moves biases from individual samples by considering globalproperties of data. Batch effects, generally, affect specificsubsets of genes differently and lead to strong variabilityand decreased ability to detect real biological signals [99].COMBAT [100] and ARSyN [101] are two methods initiallydeveloped for microarray data for removal of batch effects.Later, it was found that these work well for RNA-seq dataas well [5]. Reese et al. present gPCA [102], an extension ofPCA, to quantify the amount of batch effects that exist inRNA-seq data. Swamp [103] is a GUI based R package thatcan be used to control batch effects. SVAseq [104] is a tool forremoving batch effects and noises from RNA-seq data. Liuet al. [105], in their empirical study, reported that SVAseq is

the best method for batch effect removal whereas COMBATover-corrects the batch effects.

5 DIFFERENTIAL EXPRESSION ANALYSIS

DE analysis is the most common use of transcriptomeprofiling for effective disease diagnosis. The main aim ofDE analysis is to find DEGs, i.e., determine whether geneexpression level differences between conditions are greaterthan what would be expected due to natural random vari-ations only. An important application of RNA-seq is thecomparison of transcriptomes across developmental stages,and across disease states, compared to normal cells. Thistype of analysis can be performed using DE analysis.

Many difficulties are faced during RNA-seq analysis.These include 1) Biases and errors present in the NGStechnology [110], 2) Bias introduced by nucleotide compo-sition and varying lengths of genes or transcripts duringmeasurement of read counts [13], 3) Bias due to sequenc-ing depth and the number of replicates, 4) Difficulty inaccurately discriminating real biological differences amonggroups due to combined variations in both biological and

Page 11: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 11

technical replicates, and 5) Difficulties due to the existenceof alternative gene isoforms [111].

The basic tasks of DE analysis tools are the following: 1)Find the magnitude of differences in expressions betweentwo or more conditions based on read counts from repli-cated samples, i.e., calculate the fold changes in read counts,taking into account the differences in sequence depths andvariability, 2) Estimate the significance of differences usingstatistical tests. Some commonly use such test statistics aret-test, F-test, Likelihood Ratio Test (LRT), Fisher’s exact testand Wilcoxon signed rank sum test. A brief description ofeach test is reported in Supplementary material (Table 2).

To determine the genes whose read count differences be-tween two conditions are greater than expected by chance,DGE tools make assumptions about the distribution ofread counts. The null hypothesis for DE analysis betweentwo conditions typically states that for each feature (gene,transcript, or another feature), the means µ for the twoconditions A and B are equal. One of the most popularchoices to model read counts is the Poisson distributionbecause it provides a good fit for counts arising fromtechnical replicates [16], and it is suitable when a smallnumber of biological replicates are available. However, thismodel predicts smaller variations than those seen betweenbiological replicates [112]. The data show an over-dispersionthat cannot be captured by the equal mean and varianceassumptions of a Poisson distribution. Robinson et al. [112]propose the use of a Negative Binomial (NB) distribution tomodel counts across samples and capture the extra-Poissonvariability, known as over-dispersion. The NB distributionhas parameters, which are uniquely determined by meanand variance. With the introduction of a scaling factor forthe variance, NB outperforms Poisson, and is a widely usedmodel for feature counting [13]. However, the number ofreplicates in datasets of interest is often too small to estimateboth parameters, mean and variance, reliably for each gene.Several empitical studies and evaluations of performance ofDE analysis methods can be found in [113]–[115].

5.1 Tools for Differential Expression AnalysisMany algorithms have been introduced for the identificationof DEGs from RNAseq data. Most techniques accept theread count data matrix as input. Read count is the number ofreads mapped to each gene in the samples. Some methodsdirectly accept count data as input, but others transformcount data before taking it as input. The methods that di-rectly work on count data can be classified into two groups,1) parametric and 2) non-parametric. Parametric methodsare generally influenced by outliers, but this limitation canbe overcome by non-parametric methods. To control FDR(False Discovery Rate), the Benjamini-Hochberg procedureis used [116] by most tools. A comparison among the DEanalysis tools is given in Table 8. RNA-Seq 2G [117] isa web portal containing many DE analysis tools. Sometools support RNA-seq analysis workflows and they aretreated as global tools. A few global tools are reported inSupplementary file in the section called Global Tools. Mostof these are implemented in R, making most tools platformindependent.a) Limma+Voom: It was initially developed for microarray

data for DE analysis, and later modified to count data.

It is based on linear modeling. Limma+Voom computesmean-variance relationships at the observation level. Thenormalized count data are transformed to logarithmicscale to estimate the mean variance relationship. It usesa variance stabilizing strategy to enhance performance.It is computationally fast and is relatively unaffected byoutliers. It needs at least 3 samples per condition to detectDEGs with high accuracy [114]. It assumes that rowswith zero or very low counts have been removed. It isa well-documented and widely used command-driventool, implemented in R. It can be used for time courseexperiments as well.

b) DESeq: It performs classical hypothesis testing. A smallsample size causes difficulty in RNA-seq analysis be-cause of the need to estimate the gene-wise dispersionparameters accurately. To eliminate this problem, DESeqincorporates information sharing across all genes in thedataset. The dispersion of a gene is the largest of thevalues among the individual dispersion estimates for thegene, and the value obtained from the fitted dispersion.After estimation of the mean and the dispersion param-eter for each gene, two-group comparison test statisticsare used. It is a well-documented and widely used toolthat can be run on Windows and Mac OS X, and isimplemented in R. This tool can also be used in DEanalysis of scRNA-seq reads3.

c) edgeR: It also takes a classical hypothesis testing ap-proach. Like DESeq, edgeR uses information sharingacross all genes. It computes the dispersion parameterusing a weighted likelihood approach [118]. After esti-mation of the mean and the dispersion parameter foreach gene, edgeR performs two-group comparisons oruses a Generalized Linear Model (GLM). This tool is alsoused in DE analysis of scRNA-seq reads3. It is a well-documented and widely used tool, implemented in R.

d) DESeq 2: A new version of the DESeq package, namedDESeq2 (v1.6.3), detects more DEGs, but also producesmore false positives [119]. It sets the results to NA forthe following cases: 1) all samples have zero counts ina row, 2) if a row contains an outlier, and 3) a row isautomatically filtered for having a low mean normalizedcount. It is a well-documented and widely used tool,implemented in R.

e) Cuffdiff 2: It is a part of the Cufflink package developedto identify DEGs and transcripts. This package estimatesdifferential expressions at transcript as well as at genelevels. A new version of Cuffdiff2, named Cuffdiff 2.1(v.2.1.1) released recently, detects more DEGs, althoughit also increases the number of false positives [119]. Thiscommand-line tool is implemented in C++. This widelyused tool with good documentation can run on Linuxand Mac OS X.

f) DEApp: It is the only web-based package for DE analysisof count data to the best knowledge of the authors. Thisapplication offers model selection, parameter tuning,cross validation and visualization of results in a user-friendly interface. For cross validation it has three DEanalysis methods, Limma+Voom, edgeR and DESeq2. Ituses a filtering strategy to eliminate genes with very lowcount values. This easy to use tool can be run on anyoperating system. It was implemented in R with Shiny.

Page 12: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 12

g) DSS: It uses empirical Bayes shrinkage estimate of thedispersion parameters to control count data distribu-tion. It uses the Wald statistic, and to handle differentsequence depths, it uses two-group comparison thatprovides dispersion shrinkage for multiple factors. Itfits GLMs using edgeR. In this package, underestima-tion of FDR in the top-ranked genes gives an over-optimistic certainty in the reported DEGs. This is a well-documented and maintained tool, implemented in R.

h) ImpulseDE: It detects DEGs in time-course experiments.This tool accepts two kinds of datasets: a single timecourse dataset to detect differential behavior over time(to identify DEGs across time points), or a dataset con-taining multiple conditions as well as a time-coursedataset (to identify DEGs between conditions). To in-crease efficiency, this tool allows multiple cores to fit themodel to the genes. It returns q-values for DEGs as wellas impulse model parameters, and fitted values for eachgene. This recent tool, implemented in R, can run on anyoperating system. It works on time course RNA-seq dataand scRNA-seq data.

i) SARTools: It uses DESeq2 or edgeR, discussed earlier.SARTools requires two types of input files: count datafiles containing raw counts and a target file that describesexperimental design [120]. The analysis process includesthree steps: normalization, dispersion estimation and testfor differential expression. This well-documented web-based recent tool is implemented in R.

5.2 Power Analysis of RNA-seqPower analysis estimates the likelihood of successfully find-ing statistical significance in a dataset [8]. It is very im-portant to know the number of samples required beforeconducting DE analysis to achieve desirable power. Due tothe complexity of the negative binomial model, it is difficultto estimate power and satisfactory sample size for RNA-seqanalysis [127]. In addition, normalization factor, multiplehypothesis testing, and p-value estimation create problemsin power estimation and satisfactory sample size estimation.

Ching et al. [128] conduct an empirical study by evaluat-ing the performance as power of five DE analysis methodsand they observe some interesting patterns, (1) Up to acertain number of samples, the power increment is pro-portional to the number of samples. (2) Higher power canbe achieved by increasing the sequencing depth. However,beyond 5-20 million reads, the power increase is minimaldepending on the dataset. (3) Generally, high fold changeand high expression values show more power than lowfold changes and low expression values. Dispersion showsa great impact on power analysis. Lower the dispersion inthe dataset, more easy it is to achieve higher power [128].Yu et al. [129] provide a framework to calculate the powerof RNA-seq analysis. Their method is able to control falsepositives at a nominal level. Some of the available tools forpower estimation are reported in Table 9. More details onpower analysis can be found in [130].

6 FUNCTIONAL PROFILING WITH RNA-SEQ

Differentially expressed biological pathways or molecularfunctions provide more explanatory results than a longlist of seemingly unrelated genes. Molecular function or

pathway characterization is performed in transcriptomestudies, and DEGs are generally involved in molecularfunctions or pathways. So, it is important in a transcriptomicstudy to characterize molecular functions or pathways inwhich DEGs are involved. Two widely used approachesfor functional characterization are (a) gene set enrichmentanalysis (GSEA) and (b) comparing DEGs against the rest ofthe genome. To study functional enrichment between twogroups, GSEA is widely used because of its effectiveness[135]. These approaches were initially developed for mi-croarray technology, but later these have been adapted toRNA-seq. Many tools have also been developed for RNA-seq data for functional profiling. GOseq [136] can estimatethe effect of bias, and can adapt accordingly in functionalprofiling. Gene Set Variation Analysis (GSVA) [137] or Se-qGSEA [138] can perform enrichment analyses similar toGSEA, using prior knowledge such as splicing informa-tion. SeqGSEA detects more biologically meaningful genesets without bias toward longer or more highly expressedgenes [138]. GAGE is another method for pathway analy-sis. It is applicable to both microarray and RNA-seq data.GSAASeqSP [139] is a powerful platform for investigatingdifferential molecular activities within biological pathways.SeqGSA [140] is a promising tool for testing significant genepathways within RNA-seq data that can take into accountinherent gene length effects.

Functional annotation is required to study transcriptomefunctionality. Resources containing functional annotationdata include Gene Ontology, Bioconductor [141], PANTHER[142], g:Profiler [143], clusterProfiler [144], Enrichr [145],ToppGene [146], WebGIVI [147], GOEAST [148], GOrilla[149], and DAVID [150]. PANTHER uses a comprehensiveprotein library with human curated pathways and evolu-tionary ontology for functional enrichment. g:Profiler is aweb-based tool that performs enrichment analysis for geneontologies, KEGG pathways and protein-protein interaction.clusterProfiler automates the process of biological term clas-sification and enrichment analysis of gene clusters. Enrichris a tool for performing gene overrepresentation analysisusing a comprehensive set of functional annotations. Topp-Gene is used for DEG prioritization or functional enrich-ment using either functional annotations or by identifyingand prioritizing genes responsible for diseases. WebGIVI isan interactive web-based visualization and enrichment toolto explore gene:iTerm pairs. GOEAST is a GO enrichmentanalysis tool, whereas GOrilla is a tool for visualization ofenriched GO terms. DAVID uses knowledge from annota-tion databases such as GO [151], KEGG pathways [152],BioCarta [153] and SwissProt [154] for functional clustering.

Unannotated gene probes are excluded in functionalclustering. Transcripts discovered during de novo transcriptassembly lack functional information, and so, annotation isvery important for functional profiling of such analysis re-sults. External knowledge such as information from proteindatabases such as SwissProt [154], Pfam [155] and InterPro[156] can be used for functional annotation of protein-codingtranscripts. Gene Ontology (GO) is a controlled vocabullaryof terms for describing gene characteristics and gene prod-ucts. GO enables exchange of functional information acrossorthologs. Blast2GO [157] is another tool which allows anno-tation of DEGs using databases and controlled vocabularies.

Page 13: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 13

TABLE 8Comparisons of Common Tools for Differential Expression Analysis of RNA-Seq Data

Tools Year Model Experimental Design Normalization DE Test Statistic NP/P AvailabilityDESeq2 (v1.6.3)[121]

2014 NB Two-group comparison; multi-factor de-sign (GLM); time-series experiments

DESeq Wald test, LRT P www.bioconductor.org/packages/release/bioc/html/DESeq2.html

DESeq (v1.18.0)[26]

2010 NB Two-group comparison; multi-factor de-sign (GLM)

DESeq, TMM Exact test, LRT P http://bioconductor.org/packages/DESeq/

DSS (v2.6.0)[122]

2013 – Two-group comparison; multi-factor de-sign, and model fit via GLM.

Q, TC, M, RLE Wald test NP bioconductor.org/packages/release/bioc/html/DSS.html

edgeR (v3.8.6)[112]

2010 NB Comparison (paired or unpaired); multi-factor design (GLM); blocking;

TMM, UQ, RLE,DESeq

Exact test; LRT, F-test

P http://bioconductor.org/packages/release/bioc/html/edgeR.html

Limma+Voom(v3.22.7) [123]

2014 LND Two group comparison (paired orunpaired); multi-factor design (GLM);blocking; time course experiment

TMM t-test; F-test NP http://bioconductor.org/packages/release/bioc/html/limma.html

Cuffdiff 2 [111] 2013 NB – DESeq, Q, FPKM t-test – https://github.com/cole-trapnell-lab/cufflinks

ImpulseDE[124]

2016 – Clustering Q F-test P bioconductor.org/packages/ImpulseDE/

SARTools [125] 2016 – – – – P https://github.com/PF2-pasteur-fr/SARTools

DEApp [126] 2017 – Two group comparison; multi-factor de-sign

CPM Multiple tests – gallery.shinyapps.io/DEApp

NB: Negative Binomial, LND: Log normal distribution, NP: Non-parametric, P: Parametric.

TABLE 9Comparisons of Power Estimation Tools

Tools Year Description CL/GUI SA/WB AvailabilityRNAseqPS[131]

2014 It assumes Poisson and negative binomial distributions. This online tool was builtusing the Shiny package in R. It has a very good web interface.

GUI WB http://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/

RNASeqPower[132]

2013 It can calculate statistical power required to identify DEGs in RNA-seq analysis.It allows users to set values of important parameters such as power, desired foldchange, sequencing dept and Type I error rate.

GUI SA http://bioconductor.org/packages/release/bioc/html/RNASeqPower.html

powsimR [133] 2017 This power analysis tool is for both RNA-seq and scRNA-seq experiments. Itcan assess sample size requirement and power for DE analysis. This is well-documented.

CL SA https://github.com/bvieth/powsim

Scotty [134] 2013 It assists analysts in identifying required sample size and read depth to achieveexperimental objectives.

GUI WB http://euler.bc.edu/marthlab/scotty/scotty.php

CL: Command-Line, GUI: Graphics User Interface, SA: Stand-Alone, WB: Web-Based.

6.1 Applications of scRNA-seq and Reconstruction ofTranscriptional NetworkThere are numerous applications of scRNA-seq. Some ofthem are as follows [44]. (1) Cellular state and cell typeidentification, where clustering of cells is performed basedon count data to find hidden tissue heterogeneity, (2) Oncecell type is identified, DE can be used to find out DEGsacross different cell types, (3) Highly variable gene iden-tification without knowing the cells types a priori acrossa population of cell using statistical approaches based onbiological variability, (4) Co-regulated network and generegulatory network (GRN) identification, (5) Characteriza-tion of diversity in transcription among individual cells.

RNA-seq measures average expression values of geneslike microarrays do, but, from all cells. RNA-seq data havealso applications in Network Inference (NI). NI is a reverseengineering process which aims to infer GRN structures andparameters using high-throughput data. In NI, additionaldata can be integrated based on prior knowledge, suchas predicted or known relationships or knowledge sources[158]. Gene Regulatory Network (GRN) predicts relation-ships among genes with the help of gene expression data[159]. In a GRN, nodes represent objects of interest andedges represent relationships such as activation or repres-sion between them. These relationships between a pair ofgenes can happen through binding of Transcription Factors(TF) or via signaling cascades or pathways [159]. A GRNhelps in extracting regulatory information from expressionpatterns. Generally, genes with similar expression patterns

are regulated by same TF. So, the reason for creating a GRNis to infer (a) TF activation or repression events, (b) targetgenes for TFs and (c) extraction of the master regulators[160]. Two widely used tools to construct regulatory net-works include ARACNE [161] and GENIE3 [162]. Detailsof a comparative study of GRNs construction tools canbe found in [163]. Some biological insights that are notobtainable from RNA-seq based GRN can be obtained inscRNA-seq based GRNs such as a set of genes activatedindependently by two transcription factors and one of themis expressed in one cell and another is expressed in a secondcell would never been discovered as co-expressed [44] usingRNA-seq.

We mentioned earlier that RNA-seq measures averageexpression values of genes from all cells. These cells maybe of diverse types. So, studying heterogeneity in terms ofGRNs is difficult. scRNA-seq can provide gene expressiondata for individual cells, which enable the study of hetero-geneity in GRNs and provide insights as to the stochasticnature of gene expression and regulatory mechanisms. InscRNA-seq data, temporal information is not present. So,only static features of the underlying cellular mechanismcan be captured. But, building a GRN from scRNA-seqrequires cell ordering to infer relationships between targetsand regulators. Recently, for scRNA-seq data, Oncone et al.[3] used a cell-time ordering algorithm to generate pseudotime series observations to reverse engineer the gene expres-sion data. The authors of [164] use diffusion map and statetransition graphs to reconstruct transcriptional regulatorynetworks. SCENIC [165] is a package for simultaneous re-

Page 14: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 14

construction of GRNs and identification of stable cell states,using single-cell RNA-seq data. Please refer to [160] for moredetails on GRN inference from scRNA-seq data along withrequired tools.

7 GENE FUSION DETECTION

In gene fusion, two distinct genes are fused into a sin-gle gene as a result of chromosomal rearrangement suchas translocation, deletion, or chromosomal inversion [166].These chromosomal rearrangements can be detected byPE information, fragment lengths and orientations of NGS[167]. RNA-seq has become a useful tool for detection ofdisease associated gene fusion, and now it is consideredthe technology for gene fusion discovery [168]. Fusion tran-scripts are formed by two events called gene fusion or trans-splicing [169]. These events usually come from differentchromosomal rearrangements.

Liu et al. [169] summarize 23 state-of-the-art tools forfusion detection. Out of these 23 tools, they performed anexperimental study using 15. They reported that no singlemethod always performed the best. They recommendedcombining the results of the three best performing tools,i.e., SOAPfuse [170], FusionCatche [171] and JAFFA [172]to rank the fusion transcript candidates with high confi-dence. They characterized fusion detection tools based onalignment methods, criteria for detection of gene fusion,filtering criteria and output information [169]. Some toolstake the help of other alignment tools during read alignmentwhereas some others have in-built alignment algorithms.Filtering criteria such as minimal threshold of spanning andsplit reads, minimum anchor length to be set properly toget the best performance. The most recent method GFusion[173] has been found to be most effective method for fusiondiscovery in RNA-seq data.

8 RNA-SEQ DATA VISUALIZATION

Data visualization is an important component of genomicdata analysis. A Next Generation Sequencing (NGS) methodgenerates a large amount of diverse types of RNA-seqdata. To gain insight and find biological meaning in datathrough human judgment, data visualization is needed.Visualization of RNA-seq data can be performed at RNA-seq read level, read alignment level, trascriptome leveland read count level (normalized or unnormalized). Toolsavailable for genomic data visualization can be dividedinto major types: web-based (WB) applications running ondedicated web servers and stand-alone (SA) tools runningon platforms like Windows, Mac and Linux. Web-basedgenome browsers support a variety of annotations withoutthe necessity of installation. A comparison among RNA-seqdata visualization tools is given in Table 10.

Some RNA-seq data analysis tools have built-in visu-alization ability, for example, DESeq2 and DEApp. Othertools use external packages to visualize results. For example,Sashimi plots can be used for visualization of CuffDiffresults. TRAPR [186] has the ability to perform qualitychecking, normalization, statistical analysis and visualiza-tion of RNA-seq data. Read alignment quality control toolsRSeQC and QoRTs have the ability to visualize the output.The NVT package can be used to visualize RNA-seq data

to choose the best normalization method. The DE analysistool MultiRankSeq is also able to visualize DE analysisresults. WebGIVI is an interactive web-based visualizationand enrichment tool to explore gene:iTerm pairs.

9 GUIDELINES FOR THE PRACTITIONER

Based on our extensive review of tools and relevant lit-erature, the following are some informal guidelines andrecommendations for RNA-seq data analysis.a) It necessary to perform read quality checking very ag-

gressively to improve overall quality of the data withoutincurring sequence-specific bias. So, it is recommendedthat the reads of quality below 30% be discarded foreffective downstream biological analysis [5].

b) We have mentioned numerous tools throughout the pa-per for RNA-seq data analysis. Most are able to work onspecific and limited formats of data. So, it is necessary tohave a data format converter. NGS-FC [187] is one suchconverter of sequence data from one format to another.It supports conversion among 14 different data formats.

c) It is established that the number of replicates has greaterimpact on DE analysis than sequencing depth for highlyexpressed genes [188]. For lowly expressed genes, bothof these have almost similar impact. So, it is good a ideato consider lower sequencing depths, but more biologicalreplicates to increase power of differential expression ofRNA-seq under budgetary constraints [189].

d) Use of the STAR tool for splice aware alignment isrecommended because it is a ultra fast and and highlysensitive aligner [190].

e) Success of any data analysis technique depends on thequality of the data provided to it. So, preprocessing stepslike removal of outlier genes, removal of genes withvery low read counts, and detection and filtration oflow quality cells in case of scRNA-seq, and selectionof appropriate normalization and batch effect removaltechniques should be done very carefully.

f) One of DESeq, TPM and RUV-seq can be used for RNA-seq data normalization as default because they are robustin the presence of varying library sizes and varyingsequencing depths.

g) Sample size is an important factor to consider in analyz-ing RNA-seq data, given budgetary constraints. RNAse-qPS [131] and RNAseqPower [132] are tools that can beused to find best trade-offs between sequence depth andsample size for a specific dataset.

h) Generally, the power of DE methods depends on thenumber of samples considered in each group. Largesample size improves measurements in spite of exper-imental variations, increase precision on averaging thegene expressions and detection of outlier genes. It isrecommended to have at least 3 samples per group toperform DE analysis [190].

i) Fischer et al. [115] reported that ImpulseDE2 [191] is thebest DE analysis method for time course RNA-seq dataanalysis because of its high statistical testing power.

j) Consensus among multiple DE tools guarantees findinga more accurate set of DEGs [113].

k) For a set of DEGs identified for a disease, a multi-objective approach considering topological characteris-tics, pathway sharing and roles as regulator may be

Page 15: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 15

TABLE 10RNA-seq Data Visualization Tools: A Comparison

Name& Year Input FileFormat

WB/SA Descriptions Availability

Artemis [174],2012

BAM, SAM,VCF

SA It was the first genome annotation tool, which has been extended to visualize high-throughput read mapping data. This well-documented and maintained command-line toolcan run on any platform and requires Java.

http://www.sanger.ac.uk/science/tools/artemis

IntegratedGenomeBrowser (IGB)[175], 2009

SAM, BAM SA IGB is a viewer for real-time zooming and panning through genomes. It can displayannotations and position specific numerical graphs. This well-documented tool can be runon any computer and is implemented in Java.

igb.bioviz.org

IntegrativeGenomicsViewer (IGV)[176], 2013

SAM, BAM,VCF, GFF

SA IGV is a flexible viewer for NGS data formats but does not include automatic analysiscapabilities. It supports loading of tracks, a reference genome as well as annotation datafrom local or remote data sources. This GUI-based, well-documented and maintainedwidely used tool requires Java and can run on any platform.

www.broadinstitute.org/igv

GenomeView[177], 2012

SAM, BAM,FASTA, GFF

SA It enables users to dynamically browse high volumes of aligned short-read data, withdynamic navigation and semantic zooming. The tool enables visualization of whole genomealignments of dozens of genomes relative to a reference sequence. It requires Java to run onany platform. It is a GUI based well-documented tool.

genomeview.org

ReadXplorer-2[178], 2016

Genbank,GFF, FASTA,SAM or BAM

WB ReadXplorer is used for visualization of genomic and transcriptomic DNA sequencesmapped to a reference. It also has abilities like SNP detection, DE analysis, detection of TSS,and novel trascriptome identification. It extracts and adds quantity and quality measuresto each alignment to classify the mapped reads. This recent well-documented tool requiresJava and the Netbeans platform and can run on Windows, Linux and Mac OS X.

www.readxplorer.org

JBrowse [179],2012

FASTA, BED,GFF, BAM

WB It is a web-based genome browser for large sequencing dataset visualization. It usesJavaScript to delegate a large amount of the workload to the client machine

www.github.com/jbrowse/

Cascade [180],2016

GE WB It is a web-based tool for the intuitive 3D visualization of RNA-seq data from cancergenomics experiments. It provides an intuitive interface which is both scalable andcustomizable. It is implemented in HTML5, HTML+, CSS + and JavaScript. This well-documented tool can run on any platform.

http://bioinfo.iric.ca/∼shifmana/Cascade/

ATGC tran-scriptomics[181], 2017

FASTA, VCF,GFF

WB ATGC transcriptomics is a free and online open-source application. It is designed tovisualize, explore, analyze and share de novo transcriptomic data generated by NGSplatforms. This GUI based well-documented tool is implemented in Python and can runon Linux only.

https://sourceforge.net/projects/atgcinta/

RNASeqExpressionBrowser [182],2014

GE matrix,annotationinformation

WB It is an open-source web interface and allows browsing of genes by identifiers, annotationsor sequence similarity. It provides a bridge between preprocessing and the final steps of theDE analysis.

http://mips.helmholtz-muenchen.de/plant/RNASeqExpressionBrowser/

UBiT2 [183],2017

FASTQ, GE SA It is a client-side web-application, that supports visualization and offline alignment.It incorporates www.browsergenome.org/s ability to perform client-side alignment andquantification. It can also be used for scRNA-seq.

pklab.med.harvard.edu/jean/ubit2/index.html

DEIVA [184],2017

Text file WB DEIVA is a platform independent web application for interactive visual DE analysis. It is awell-documented tool that provides a nice interface. Its stand-alone version is also available.

https://github.com/Hypercubed/DEIVA

iDEP [185],2017

CSV WB This integrated Differential Expression and Pathway analysis tool can perform cluster, PCA,DE and GO analysis. It provide a GUI which enables in-depth bioinformatic analysis. Thisonline tool is well-documented.

http://ge-lab.org/idep/

SA: Stand-alone, WB: Web-based, ACE: ACE is a proprietary data compression archive file format, EGL: Enterprise Generation Language, GE: Gene Expression, SVG:Scalable Vector Graphics, TSS: transcription start site.

suitable for identification of strong association amongthe causal genes with potential non-causal genes acrossprogression states.

l) For most steps of RNA-seq data analysis, there are nogold standard tools. So, for validation and finding ac-curate sets of DEGs, DE analysis over multiple datasetsgenerated by different combinations of standard toolsused in different steps, through appropriate consensusbuilding is a promising approach.

10 ISSUES AND CHALLENGES

Based on our both theoretical and experimental study, weobserve that in the past decade, a large number of methodsand tools have been developed to support effective RNA-seq data analysis. However, there are some important issuesand research challenges that still need attention. In thissection, we highlight some such issues and challenges.a) We explored the NVT tool [94] which has the ability to

select best normalization method from a set of 11 meth-ods for a datsaset. Our experimental study reveals thatthese normalization techniques are dataset dependent.The presence of a few highly expressed genes repress thecounts for all other genes, leading to the possibility thata lot of genes are falsely called differentially expressed[114]. An appropriate normalization technique can help

improve effectiveness of DE analysis. So, there is a needfor the development of effective statistical and compu-tational methods for RNA-seq data normalization whichis dataset independent and can handle the misleading ofDE analysis mentioned above by a few highly DEGs.

b) RNA-seq experiments sometime generate very smallsample sizes, creating problems for further analysis. Ourexperiments sow that with increase in the size of sam-ples, the power of detection of DEGs of a method alsoincreases. It is also our observation that more than 80%of RNA-seq count data available in public repositorieslike recount2 and ARCHS4 [85] have very few samples.So, developing a DE method which can work well evenin the presence of smaller sample size, is another taskthat requires attention.

c) There are almost 40 structural and 7 semantic proximitymeasures to support unsupervised analysis of gene ex-pression data. Our experiments reveal that none of themeasures is able to capture both structural and semanticsimilarity between a pair of genes at the same time.So, there is a need for developing a proximity measurethat can capture both structural and semantic similaritybetween two genes simultaneously.

d) There is no strategy available to identify which DE anal-ysis method is optimal. In other words, a proper analysis

Page 16: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 16

approach that can judge a DE analysis method in termsof reproducibility, accuracy and robustness is unavailable[192].

e) Transformation of count data into continuous variableswithout considering distribution of count data createsinconsistencies [193]. There are a few transformationtechniques available [194] and they transform data byconsidering mean, variance and covariates of data with-out considering finer distribution patterns of count data.So, there is need for a proper transformation algorithm,which considers count data distribution at a more de-tailed level than mean and variance.

f) The absence of a sufficient number of positive and neg-ative controls for gene fusion creates a major problemwhen evaluating an algorithm for discovering gene fu-sion [195]. This can be considered another importantresearch issue.

g) Handling of multi-mapping reads, i.e., the presence ofhomologous and repetitive reads between genes createsambiguity and decreases the efficiency and reliabilityin alignment and also creates challenges in gene fusiondiscovery [173]. It also creates problems in accurate tran-script abundance estimation [91].

h) Limited sensitivity of scRNA-seq is still an issue. Dis-tinguishing technical noise and biological variability forlowly expressed transcripts is still challenging [196].

i) Zero inflation, technical noise, batch effects and dropoutevents are inherent problem in scRNA-seq. Accuratehandling of these problems using proper imputation,normalization and batch effect removal improves DEanalysis of scRNA-seq and accuracy in cell heterogeneitystudy [197]. But, their is no strategy available to deal withall the problems accurately together in single frameworkwithout compromising result quality.

j) Proprcessing steps like alignment and counting are car-ried out individually in case of scRNA-seq. However,with the growing number of cells considered in scRNA-seq and with a requirement for cell-to-cell comparison,the analysis task becomes computationally intensive andit necessitates high computational resources having ap-propriate support of parallelisation.

k) Developing an effective integrated DE analysis methodthat enables handling of bulk RNA-seq as well as scRNA-seq data for non-homogeneous tumor cell towards iden-tification of interesting biomarkers needs to be explored.

11 CONCLUSIONS

In this paper, we presented an overview of the steps in RNA-seq data analysis. We also presented a taxonomy of tools thatwe discussed at length later in the paper. We performeddescriptive comparisons of the tools commonly used ineach step of RNA-seq data analysis and presented themin several tables throughout the paper. We also mentionedtools used in both RNA-seq and scRNA-seq data analysis.Our discussion emphasizes workflow from RNA-seq readsto results of DE analysis. We also presented approaches tonormalization of count data and performance evaluationof DE analysis techniques. We discussed different qualitycontrol tools that can be used in each step to ensure qualityRNA-seq data analysis. In addition, we provided a discus-sion of multiple RNA-seq data visualization tools that help

understand RNA-seq data intuitively. Functional profilingfor results of DE analysis is necessary to gain biologicalinsight. We presented some recommendations, guidelinesfor the practicing RNA-seq data analyst. Finally, we outlinedseveral research issues and challenges for future researchersand practitioners.

ACKNOWLEDGMENT

We are thankful to Dr. Pankaj Barah of Molecular Biology &Biotechnology Department at Tezpur University for his sup-port in describing scRNA-seq data with reverse engineeringmethods to construct transcriptional network.

REFERENCES

[1] M. Schena, D. Shalon, R. W. Davis et al., “Quantitative moni-toring of gene expression patterns with a complementary DNAmicroarray,” Science, vol. 270, no. 5235, p. 467, 1995.

[2] A. Mortazavi, B. A. Williams, K. McCue et al., “Mapping andquantifying mammalian transcriptomes by RNA-Seq,” NatureMethods, vol. 5, no. 7, pp. 621–628, 2008.

[3] A. Ocone, L. Haghverdi, N. S. Mueller et al., “Reconstructing generegulatory dynamics from high-dimensional single-cell snapshotdata,” Bioinformatics, vol. 31, no. 12, pp. i89–i96, 2015.

[4] V. M. Kvam, P. Liu, and Y. Si, “A comparison of statisticalmethods for detecting differentially expressed genes from RNA-seq data,” American Journal of Botany, vol. 99, no. 2, pp. 248–256,2012.

[5] A. Conesa, P. Madrigal, S. Tarazona et al., “A survey of bestpractices for RNA-seq data analysis,” Genome Biology, vol. 17,no. 1, p. 13, 2016.

[6] E. Ager-Wick, C. V. Henkel, T. M. Haug et al., “Using normal-ization to resolve RNA-Seq biases caused by amplification fromminimal input,” Physiological Genomics, vol. 46, no. 21, pp. 808–820, 2014.

[7] C. Filloux, M. Cedric, P. Romain et al., “An integrative method tonormalize RNA-Seq data,” BMC Bioinformatics, vol. 15, no. 1, p.188, 2014.

[8] C.-I. Li, D. C. Samuels, Y.-Y. Zhao et al., “Power and samplesize calculations for high-throughput sequencing-based experi-ments,” Briefings in Bioinformatics, 2017.

[9] A. Dal Molin and B. Di Camillo, “How to design a single-cellRNA-sequencing experiment: pitfalls, challenges and perspec-tives,” Briefings in Bioinformatics, 2018.

[10] A. Haque, J. Engel, S. A. Teichmann et al., “A practical guide tosingle-cell RNA-sequencing for biomedical research and clinicalapplications,” Genome Medicine, vol. 9, no. 1, p. 75, 2017.

[11] G. Chen, C. Wang, and T. Shi, “Overview of available methodsfor diverse RNA-Seq data analyses,” Science China Life Sciences,vol. 54, no. 12, pp. 1121–1128, 2011.

[12] S. Zhao, B. Zhang, Y. Zhang et al., “Bioinformatics for RNA-SeqData Analysis,” Bioinformatics-Updated Features and Applications,p. 125, 2016.

[13] A. Oshlack, M. D. Robinson, and M. D. Young, “From RNA-seqreads to differential expression results,” Genome Biology, vol. 11,no. 12, p. 220, 2010.

[14] Y. Han, S. Gao, K. Muegge et al., “Advanced applications of RNAsequencing and challenges,” Bioinformatics and Biology Insights,vol. 9, no. Suppl 1, p. 29, 2015.

[15] U. Nagalakshmi, Z. Wang, K. Waern et al., “The transcriptionallandscape of the yeast genome defined by RNA sequencing,”Science, vol. 320, no. 5881, pp. 1344–1349, 2008.

[16] J. C. Marioni, C. E. Mason, S. M. Mane et al., “RNA-seq: anassessment of technical reproducibility and comparison withgene expression arrays,” Genome Research, vol. 18, no. 9, pp. 1509–1517, 2008.

[17] S. Pillai, V. Gopalan, and A. K.-Y. Lam, “Review of sequencingplatforms and their applications in phaeochromocytoma andparagangliomas,” Critical Reviews in Oncology/Hematology, vol.116, pp. 58–67, 2017.

[18] I. Allali, J. W. Arnold, J. Roach et al., “A comparison of sequencingplatforms and bioinformatics pipelines for compositional analy-sis of the gut microbiome,” BMC Microbiology, vol. 17, no. 1, p.194, 2017.

Page 17: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 17

[19] Z. Wang, M. Gerstein, and M. Snyder, “RNA-Seq: a revolutionarytool for transcriptomics,” Nature Reviews Genetics, vol. 10, no. 1,pp. 57–63, 2009.

[20] Q. Pan, O. Shai, L. J. Lee et al., “Deep surveying of alterna-tive splicing complexity in the human transcriptome by high-throughput sequencing,” Nature Genetics, vol. 40, no. 12, pp.1413–1415, 2008.

[21] M. Guttman, M. Garber, J. Z. Levin et al., “Ab initio reconstructionof cell type-specific transcriptomes in mouse reveals the con-served multi-exonic structure of lincRNAs,” Nature Biotechnology,vol. 28, no. 5, pp. 503–510, 2010.

[22] J. F. Degner, J. C. Marioni, A. A. Pai et al., “Effect of read-mapping biases on detecting allele-specific expression fromRNA-sequencing data,” Bioinformatics, vol. 25, no. 24, pp. 3207–3212, 2009.

[23] H. Edgren, A. Murumagi, S. Kangaspeska et al., “Identification offusion genes in breast cancer by paired-end RNA-sequencing,”Genome Biology, vol. 12, no. 1, p. R6, 2011.

[24] S. van Dam, U. Vosa, A. van der Graaf et al., “Gene co-expressionanalysis for functional classification and gene–disease predic-tions,” Briefings in Bioinformatics, p. bbw139, 2017.

[25] Y. Guo, Q. Sheng, J. Li et al., “Large scale comparison of gene ex-pression levels by microarrays and RNAseq using TCGA data,”PLoS ONE, vol. 8, no. 8, p. e71462, 2013.

[26] S. Anders and W. Huber, “Differential expression analysis forsequence count data,” Genome Biology, vol. 11, no. 10, p. R106,2010.

[27] B. Schmidt and A. Hildebrandt, “Next-generation sequencing:Big Data meets high performance computing,” Drug DiscoveryToday, 2017.

[28] R. C. Gentleman, V. J. Carey et al., “Bioconductor: open softwaredevelopment for computational biology and bioinformatics,”Genome Biology, vol. 5, no. 10, p. R80, 2004.

[29] T. Ilicic, J. K. Kim, A. A. Kolodziejczyk et al., “Classification oflow quality cells from single-cell RNA-seq data,” Genome Biology,vol. 17, no. 1, p. 29, 2016.

[30] A. Diaz, S. J. Liu, C. Sandoval et al., “SCell: integrated analysisof single-cell RNA-seq data,” Bioinformatics, vol. 32, no. 14, pp.2219–2220, 2016.

[31] L. Tian, S. Su, D. Amann-Zalcenstein et al., “scPipe: a flexible datapreprocessing pipeline for single-cell RNA-sequencing data,”bioRxiv, p. 175927, 2017.

[32] P. Jiang, J. A. Thomson, and R. Stewart, “Quality control of single-cell RNA-seq by SinQC,” Bioinformatics, vol. 32, no. 16, pp. 2514–2516, 2016.

[33] Q. Zhou, X. Su, and K. Ning, “Assessment of quality controlapproaches for metagenomic data analysis,” Scientific Reports,vol. 4, p. 6957, 2014.

[34] R. K. Patel and M. Jain, “NGS QC Toolkit: a toolkit for qualitycontrol of next generation sequencing data,” PloS One, vol. 7,no. 2, p. e30619, 2012.

[35] X. Yang, D. Liu, F. Liu et al., “HTQC: a fast quality control toolkitfor Illumina sequencing data,” BMC Bioinformatics, vol. 14, no. 1,p. 33, 2013.

[36] R. Schmieder and R. Edwards, “Quality control and preprocess-ing of metagenomic datasets,” Bioinformatics, vol. 27, no. 6, pp.863–864, 2011.

[37] A. M. Bolger, M. Lohse, and B. Usadel, “Trimmomatic: a flexibletrimmer for Illumina sequence data,” Bioinformatics, p. btu170,2014.

[38] M. Martin, “Cutadapt removes adapter sequences from high-throughput sequencing reads,” EMBnet. journal, vol. 17, no. 1,pp. pp–10, 2011.

[39] M. P. Cox, D. A. Peterson, and P. J. Biggs, “SolexaQA: At-a-glancequality assessment of Illumina second-generation sequencingdata,” BMC Bioinformatics, vol. 11, no. 1, p. 485, 2010.

[40] P. Dao, I. Numanagic, Y.-Y. Lin et al., “Optimal resolution ofambiguous RNA-Seq multimappings in the presence of novelisoforms,” Bioinformatics, p. btt591, 2013.

[41] M. Burrows and D. J. Wheeler, “A block-sorting lossless datacompression algorithm,” SRC Research Report, 124, 1994.

[42] N. A. Fonseca, J. Rung, A. Brazma et al., “Tools for mapping high-throughput sequencing data,” Bioinformatics, vol. 28, no. 24, pp.3169–3177, 2012.

[43] C. Trapnell, L. Pachter, and S. L. Salzberg, “TopHat: discoveringsplice junctions with RNA-Seq,” Bioinformatics, vol. 25, no. 9, pp.1105–1111, 2009.

[44] O. Stegle, S. A. Teichmann, and J. C. Marioni, “Computationaland analytical challenges in single-cell transcriptomics,” NatureReviews Genetics, vol. 16, no. 3, pp. 133–145, 2015.

[45] T. S. Smith, A. Heger, and I. Sudbery, “UMI-tools: Modellingsequencing errors in Unique Molecular Identifiers to improvequantification accuracy,” Genome Research, pp. gr–209 601, 2017.

[46] C. Girardot, J. Scholtalbers, S. Sauer et al., “Je, a versatile suiteto handle multiplexed NGS libraries with unique molecularidentifiers,” BMC Bioinformatics, vol. 17, no. 1, p. 419, 2016.

[47] P. G. Engstrom, T. Steijger, B. Sipos et al., “Systematic evaluationof spliced alignment programs for RNA-seq data,” Nature Meth-ods, vol. 10, no. 12, p. 1185, 2013.

[48] A. Hatem, D. Bozdag, A. E. Toland et al., “Benchmarking shortsequence mapping tools,” BMC Bioinformatics, vol. 14, no. 1, p.184, 2013.

[49] P. K. R. Kumar, T. V. Hoang, M. L. Robinson et al., “CADBURE:A generic tool to evaluate the performance of spliced aligners onRNA-Seq data,” Scientific Reports, vol. 5, p. 13443, 2015.

[50] H. Li and N. Homer, “A survey of sequence alignment algo-rithms for next-generation sequencing,” Briefings in Bioinformat-ics, vol. 11, no. 5, pp. 473–483, 2010.

[51] D. Kim, G. Pertea, C. Trapnell et al., “TopHat2: accurate alignmentof transcriptomes in the presence of insertions, deletions andgene fusions,” Genome Biology, vol. 14, no. 4, p. R36, 2013.

[52] T. D. Wu and S. Nacu, “Fast and SNP-tolerant detection ofcomplex variants and splicing in short reads,” Bioinformatics,vol. 26, no. 7, pp. 873–881, 2010.

[53] K. Wang, D. Singh, Z. Zeng et al., “MapSplice: Accurate mappingof RNA-seq reads for splice junction discovery,” Nucleic AcidsResearch, vol. 38, no. 18, p. e178, 2010.

[54] A. Dobin, C. A. Davis, F. Schlesinger et al., “STAR: ultrafastuniversal RNA-seq aligner,” Bioinformatics, vol. 29, no. 1, pp. 15–21, 2013.

[55] S. Marco-Sola, M. Sammeth, R. Guigo et al., “The GEM mapper:fast, accurate and versatile alignment by filtration,” Nature Meth-ods, vol. 9, no. 12, pp. 1185–1188, 2012.

[56] B. Langmead, C. Trapnell, M. Pop et al., “Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome,” Genome Biology, vol. 10, no. 3, p. R25, 2009.

[57] B. Langmead and S. L. Salzberg, “Fast gapped-read alignmentwith Bowtie 2,” Nature Methods, vol. 9, no. 4, pp. 357–359, 2012.

[58] D. Kim, B. Langmead, and S. L. Salzberg, “HISAT: a fast splicedaligner with low memory requirements,” Nature Methods, vol. 12,no. 4, pp. 357–360, 2015.

[59] Y. Peng, H. C. Leung, S.-M. Yiu et al., “IDBA-tran: a more robustde novo de Bruijn graph assembler for transcriptomes withuneven expression levels,” Bioinformatics, vol. 29, no. 13, pp. i326–i334, 2013.

[60] L. Wang, S. Wang, and W. Li, “RSeQC: quality control of RNA-seq experiments,” Bioinformatics, vol. 28, no. 16, pp. 2184–2185,2012.

[61] Q. Zheng and E. A. Grice, “AlignerBoost: A Generalized SoftwareToolkit for Boosting Next-Gen Sequencing Mapping AccuracyUsing a Bayesian-Based Mapping Quality Framework,” PLoSComputational Biology, vol. 12, no. 10, p. e1005096, 2016.

[62] S. W. Hartley and J. C. Mullikin, “QoRTs: a comprehensivetoolset for quality control and data processing of RNA-Seq ex-periments,” BMC Bioinformatics, vol. 16, no. 1, p. 224, 2015.

[63] D. S. DeLuca, J. Z. Levin, A. Sivachenko et al., “RNA-SeQC:RNA-seq metrics for quality control and process optimization,”Bioinformatics, vol. 28, no. 11, pp. 1530–1532, 2012.

[64] K. W. Kroll, N. E. Mokaram, A. R. Pelletier et al., “Quality controlfor RNA-seq (QuaCRS): An integrated quality control pipeline,”Cancer Informatics, vol. 13, no. Suppl 3, p. 7, 2014.

[65] P. Ewels, M. Magnusson, S. Lundin et al., “MultiQC: summarizeanalysis results for multiple tools and samples in a single report,”Bioinformatics, vol. 32, no. 19, pp. 3047–3048, 2016.

[66] G. Robertson, J. Schein, R. Chiu et al., “De novo assembly andanalysis of RNA-seq data,” Nature Methods, vol. 7, no. 11, pp.909–912, 2010.

[67] B. J. Haas, A. Papanicolaou, M. Yassour et al., “De novo tran-script sequence reconstruction from RNA-seq using the Trinityplatform for reference generation and analysis,” Nature Protocols,vol. 8, no. 8, pp. 1494–1512, 2013.

[68] T. Steijger, J. F. Abril, P. G. Engstrom et al., “Assessment oftranscript reconstruction methods for RNA-seq,” Nature methods,vol. 10, no. 12, p. 1177, 2013.

Page 18: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 18

[69] K. Clarke, Y. Yang, R. Marsh et al., “Comparative analysis of denovo transcriptome assembly,” Science China Life Sciences, vol. 56,no. 2, pp. 156–162, 2013.

[70] M. H. Schulz, D. R. Zerbino, M. Vingron et al., “Oases: robust denovo RNA-seq assembly across the dynamic range of expressionlevels,” Bioinformatics, vol. 28, no. 8, pp. 1086–1092, 2012.

[71] M. G. Grabherr, B. J. Haas, M. Yassour et al., “Trinity: reconstruct-ing a full-length transcriptome without a genome from RNA-Seqdata,” Nature Biotechnology, vol. 29, no. 7, p. 644, 2011.

[72] Y. Xie, G. Wu, J. Tang et al., “SOAPdenovo-Trans: de novo tran-scriptome assembly with short RNA-Seq reads,” Bioinformatics,vol. 30, no. 12, pp. 1660–1666, 2014.

[73] J. Martin, V. M. Bruno, Z. Fang et al., “Rnnotator: an automatedde novo transcriptome assembly pipeline from stranded RNA-Seq reads,” BMC Genomics, vol. 11, no. 1, p. 663, 2010.

[74] J. T. Simpson, K. Wong, S. D. Jackman et al., “ABySS: a parallelassembler for short read sequence data,” Genome Research, vol. 19,no. 6, pp. 1117–1123, 2009.

[75] Z. Chang, G. Li, J. Liu et al., “Bridger: a new framework forde novo transcriptome assembly using RNA-seq data,” GenomeBiology, vol. 16, no. 1, p. 30, 2015.

[76] W. J. Kent, “BLAT–the BLAST-like alignment tool,” Genome Re-search, vol. 12, no. 4, pp. 656–664, 2002.

[77] J. Liu, G. Li, Z. Chang et al., “BinPacker: packing-based de novotranscriptome assembly from RNA-seq data,” PLoS ComputationalBiology, vol. 12, no. 2, p. e1004772, 2016.

[78] M. Pertea, G. M. Pertea, C. M. Antonescu et al., “StringTieenables improved reconstruction of a transcriptome from RNA-seq reads,” Nature Biotechnology, vol. 33, no. 3, pp. 290–295, 2015.

[79] R. Smith-Unna, C. Boursnell, R. Patro et al., “TransRate: reference-free quality assessment of de novo transcriptome assemblies,”Genome Research, vol. 26, no. 8, pp. 1134–1144, 2016.

[80] B. Li, N. Fillmore, Y. Bai et al., “Evaluation of de novo transcrip-tome assemblies from RNA-Seq data,” Genome Biology, vol. 15,no. 12, p. 553, 2014.

[81] S. Wang and M. Gribskov, “Comprehensive evaluation of de novotranscriptome assembly programs and their effects on differentialgene expression analysis,” Bioinformatics, p. btw625, 2016.

[82] J. A. Martin and Z. Wang, “Next-generation transcriptome assem-bly,” Nature Reviews Genetics, vol. 12, no. 10, pp. 671–682, 2011.

[83] B. Li, V. Ruotti, R. M. Stewart et al., “RNA-Seq gene expression es-timation with read mapping uncertainty,” Bioinformatics, vol. 26,no. 4, pp. 493–500, 2010.

[84] M. Teng, M. I. Love, C. A. Davis et al., “A benchmark for RNA-seq quantification pipelines,” Genome Biology, vol. 17, no. 1, p. 74,2016.

[85] A. Lachmann, D. Torre, A. B. Keenan et al., “Massive miningof publicly available RNA-seq data from human and mouse,”Nature communications, vol. 9, no. 1, p. 1366, 2018.

[86] S. Anders, P. T. Pyl, and W. Huber, “HTSeq–a Python frameworkto work with high-throughput sequencing data,” Bioinformatics,p. btu638, 2014.

[87] N. Delhomme, I. Padioleau, E. E. Furlong et al., “easyRNASeq:a bioconductor package for processing RNA-Seq data,” Bioinfor-matics, vol. 28, no. 19, pp. 2532–2533, 2012.

[88] Y. Liao, G. K. Smyth, and W. Shi, “featureCounts: an efficient gen-eral purpose program for assigning sequence reads to genomicfeatures,” Bioinformatics, vol. 30, no. 7, pp. 923–930, 2014.

[89] N. Bray, H. Pimentel, P. Melsted et al., “Near-optimal RNA-Seqquantification,” arXiv preprint arXiv:1505.02710, 2015.

[90] S. Lee, C. H. Seo, B. H. Alver et al., “EMSAR: estimation oftranscript abundance from RNA-seq data by mappability-basedsegmentation and reclustering,” BMC Bioinformatics, vol. 16,no. 1, p. 278, 2015.

[91] B. Li and C. N. Dewey, “RSEM: accurate transcript quantificationfrom RNA-Seq data with or without a reference genome,” BMCBioinformatics, vol. 12, no. 1, p. 323, 2011.

[92] R. Patro, G. Duggal, M. I. Love et al., “Salmon provides fastand bias-aware quantification of transcript expression,” NatureMethods, 2017.

[93] D. Risso, K. Schwartz, G. Sherlock et al., “GC-content normaliza-tion for RNA-Seq data,” BMC Bioinformatics, vol. 12, no. 1, p. 480,2011.

[94] T. Eder, F. Grebien, and T. Rattei, “NVT: a fast and simpletool for the assessment of RNA-seq normalization strategies,”Bioinformatics, p. btw521, 2016.

[95] J. Zyprych-Walczak, A. Szabelska, L. Handschuh et al., “Theimpact of normalization methods on RNA-Seq data analysis,”BioMed Research International, vol. 2015, 2015.

[96] J. H. Bullard, E. Purdom, K. D. Hansen et al., “Evaluation ofstatistical methods for normalization and differential expressionin mRNA-Seq experiments,” BMC Bioinformatics, vol. 11, no. 1,p. 94, 2010.

[97] R. S. Spielman, L. A. Bastone, J. T. Burdick et al., “Commongenetic variants account for differences in gene expression amongethnic groups,” Nature Genetics, vol. 39, no. 2, pp. 226–231, 2007.

[98] S. C. Hicks, M. Teng, and R. A. Irizarry, “On the widespread andcritical impact of systematic bias and batch effects in single-cellRNA-Seq data,” bioRxiv, p. 025528, 2015.

[99] J. T. Leek, R. B. Scharpf, H. C. Bravo et al., “Tacklingthe widespread and critical impact of batch effects in high-throughput data,” Nature Reviews Genetics, vol. 11, no. 10, pp.733–739, 2010.

[100] W. E. Johnson, C. Li, and A. Rabinovic, “Adjusting batch effectsin microarray expression data using empirical Bayes methods,”Biostatistics, vol. 8, no. 1, pp. 118–127, 2007.

[101] M. J. Nueda, A. Ferrer, and A. Conesa, “ARSyN: a method for theidentification and removal of systematic noise in multifactorialtime course microarray experiments,” Biostatistics, vol. 13, no. 3,pp. 553–566, 2012.

[102] S. E. Reese, K. J. Archer, T. M. Therneau et al., “A new statisticfor identifying batch effects in high-throughput genomic datathat uses guided principal component analysis,” Bioinformatics,vol. 29, no. 22, pp. 2877–2883, 2013.

[103] M. Lauss, I. Visne, A. Kriegner et al., “Monitoring of technicalvariation in quantitative high-throughput datasets,” Cancer infor-matics, vol. 12, p. 193, 2013.

[104] J. T. Leek, “SVAseq: removing batch effects and other unwantednoise from sequencing data,” Nucleic Acids Research, vol. 42,no. 21, pp. e161–e161, 2014.

[105] Q. Liu and M. Markatou, “Evaluation of methods in removingbatch effects on RNA-seq data,” Infectious Diseases and Transla-tional Medicine, vol. 2, no. 1, pp. 3–9, 2016.

[106] M. D. Robinson and A. Oshlack, “A scaling normalizationmethod for differential expression analysis of RNA-seq data,”Genome Biology, vol. 11, no. 3, p. R25, 2010.

[107] Y. Zhou, G. Wang, J. Zhang et al., “A Hypothesis Testing BasedMethod for Normalization and Differential Expression Analysisof RNA-Seq Data,” PLoS ONE, vol. 12, no. 1, p. e0169594, 2017.

[108] M. B. Sohn, R. Du, and L. An, “A robust approach for identi-fying differentially abundant features in metagenomic samples,”Bioinformatics, p. btv165, 2015.

[109] D. Risso, J. Ngai, T. P. Speed et al., “Normalization of RNA-seqdata using factor analysis of control genes or samples,” NatureBiotechnology, vol. 32, no. 9, p. 896, 2014.

[110] K. D. Hansen, S. E. Brenner, and S. Dudoit, “Biases in Illuminatranscriptome sequencing caused by random hexamer priming,”Nucleic Acids Research, vol. 38, no. 12, pp. e131–e131, 2010.

[111] C. Trapnell, D. G. Hendrickson, M. Sauvageau et al., “Differentialanalysis of gene regulation at transcript resolution with RNA-seq,” Nature Biotechnology, vol. 31, no. 1, pp. 46–53, 2013.

[112] M. D. Robinson, D. J. McCarthy, and G. K. Smyth, “edgeR:a Bioconductor package for differential expression analysis ofdigital gene expression data,” Bioinformatics, vol. 26, no. 1, pp.139–140, 2010.

[113] J. Costa-Silva, D. Domingues, and F. M. Lopes, “RNA-Seq differ-ential expression analysis: An extended review and a softwaretool,” PloS One, vol. 12, no. 12, p. e0190152, 2017.

[114] C. Soneson and M. Delorenzi, “A comparison of methods fordifferential expression analysis of RNA-seq data,” BMC Bioinfor-matics, vol. 14, no. 1, p. 91, 2013.

[115] D. Spies, P. F. Renz, T. A. Beyer et al., “Comparative analysisof differential gene expression tools for RNA sequencing timecourse data,” Briefings in Bioinformatics, 2017.

[116] Y. Benjamini and Y. Hochberg, “Controlling the false discoveryrate: a practical and powerful approach to multiple testing,”Journal of the Royal Statistical Society. Series B (Methodological), pp.289–300, 1995.

[117] Z. Zhang, Y. Zhang, P. Evans et al., “RNA-Seq 2G: Online AnalysisOf Differential Gene Expression With Comprehensive Options OfStatistical Methods,” bioRxiv, p. 122747, 2017.

Page 19: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 19

[118] M. D. Robinson and G. K. Smyth, “Moderated statistical tests forassessing differences in tag abundance,” Bioinformatics, vol. 23,no. 21, pp. 2881–2887, 2007.

[119] F. Seyednasrollah, A. Laiho, and L. L. Elo, “Comparison ofsoftware packages for detecting differential expression in RNA-seq studies,” Briefings in Bioinformatics, vol. 16, no. 1, pp. 59–70,2015.

[120] M. E. Ritchie, B. Phipson, D. Wu et al., “Limma powers differ-ential expression analyses for RNA-sequencing and microarraystudies,” Nucleic Acids Research, p. gkv007, 2015.

[121] M. I. Love, W. Huber, and S. Anders, “Moderated estimationof fold change and dispersion for RNA-seq data with DESeq2,”Genome Biology, vol. 15, no. 12, p. 550, 2014.

[122] H. Wu, C. Wang, and Z. Wu, “A new shrinkage estimator fordispersion improves differential expression detection in RNA-seqdata,” Biostatistics, p. kxs033, 2012.

[123] C. W. Law, Y. Chen, W. Shi et al., “Voom: precision weights unlocklinear model analysis tools for RNA-seq read counts,” GenomeBiology, vol. 15, no. 2, p. R29, 2014.

[124] J. Sander, J. L. Schultze, and N. Yosef, “ImpulseDE: detection ofdifferentially expressed genes in time series data using impulsemodels,” Bioinformatics, p. btw665, 2016.

[125] H. Varet, L. Brillet-Gueguen, J.-Y. Coppee et al., “SARTools: aDESeq2-and edgeR-based R pipeline for comprehensive differ-ential analysis of RNA-Seq data,” PLoS ONE, vol. 11, no. 6, p.e0157022, 2016.

[126] Y. Li and J. Andrade, “DEApp: an interactive web interfacefor differential expression analysis of next generation sequencedata,” Source Code for Biology and Medicine, vol. 12, no. 1, p. 2,2017.

[127] T. V. Pham and C. R. Jimenez, “An accurate paired sample testfor count data,” Bioinformatics, vol. 28, no. 18, pp. i596–i602, 2012.

[128] T. Ching, S. Huang, and L. X. Garmire, “Power analysis andsample size estimation for RNA-Seq differential expression,”RNA, vol. 20, no. 11, pp. 1684–1696, 2014.

[129] L. Yu, S. Fernandez, and G. Brock, “Power analysis for RNA-Seqdifferential expression studies,” BMC Bioinformatics, vol. 18, no. 1,p. 234, 2017.

[130] A. Poplawski and H. Binder, “Feasibility of sample size calcula-tion for RNA-seq studies,” Briefings in Bioinformatics, p. bbw144,2017.

[131] Y. Guo, S. Zhao, C.-I. Li et al., “RNAseqPS: a web tool forestimating sample size and power for RNAseq experiment,”Cancer Informatics, no. Suppl. 6, p. 1, 2014.

[132] S. N. Hart, T. M. Therneau, Y. Zhang et al., “Calculating samplesize estimates for RNA sequencing data,” Journal of ComputationalBiology, vol. 20, no. 12, pp. 970–978, 2013.

[133] B. Vieth, C. Ziegenhain, S. Parekh et al., “powsimR: Power anal-ysis for bulk and single cell RNA-seq experiments,” bioRxiv, p.117150, 2017.

[134] M. A. Busby, C. Stewart, C. A. Miller et al., “Scotty: a web toolfor designing RNA-Seq experiments to measure differential geneexpression,” Bioinformatics, vol. 29, no. 5, pp. 656–657, 2013.

[135] V. K. Mootha, C. M. Lindgren, K.-F. Eriksson et al., “PGC-1α-responsive genes involved in oxidative phosphorylation are co-ordinately downregulated in human diabetes,” Nature Genetics,vol. 34, no. 3, pp. 267–273, 2003.

[136] M. D. Young, M. J. Wakefield, G. K. Smyth et al., “Gene ontologyanalysis for RNA-seq: accounting for selection bias,” GenomeBiology, vol. 11, no. 2, p. R14, 2010.

[137] S. Hanzelmann, R. Castelo, and J. Guinney, “GSVA: gene setvariation analysis for microarray and RNA-seq data,” BMC Bioin-formatics, vol. 14, no. 1, p. 7, 2013.

[138] X. Wang and M. J. Cairns, “Gene set enrichment analysis of RNA-Seq data: integrating differential expression and splicing,” BMCBioinformatics, vol. 14, no. 5, p. S16, 2013.

[139] Q. Xiong, S. Mukherjee, and T. S. Furey, “GSAASeqSP: a toolsetfor gene set association analysis of RNA-Seq data,” ScientificReports, vol. 4, p. 6347, 2014.

[140] X. Ren, Q. Hu, S. Liu et al., “Gene set analysis controlling forlength bias in RNA-seq experiments,” BioData Mining, vol. 10,no. 1, p. 5, 2017.

[141] W. Huber, V. J. Carey, R. Gentleman et al., “Orchestrating high-throughput genomic analysis with Bioconductor,” Nature Meth-ods, vol. 12, no. 2, pp. 115–121, 2015.

[142] H. Mi, A. Muruganujan, J. T. Casagrande et al., “Large-scalegene function analysis with the PANTHER classification system,”Nature Protocols, vol. 8, no. 8, pp. 1551–1566, 2013.

[143] J. Reimand, T. Arak, and J. Vilo, “g: Profiler-a web server forfunctional interpretation of gene lists (2011 update),” NucleicAcids Research, vol. 39, no. suppl 2, pp. W307–W315, 2011.

[144] G. Yu, L.-G. Wang, Y. Han et al., “clusterProfiler: an R packagefor comparing biological themes among gene clusters,” Omics: aJournal of Integrative Biology, vol. 16, no. 5, pp. 284–287, 2012.

[145] E. Y. Chen, C. M. Tan, Y. Kou et al., “Enrichr: interactive andcollaborative HTML5 gene list enrichment analysis tool,” BMCBioinformatics, vol. 14, no. 1, p. 128, 2013.

[146] J. Chen, E. E. Bardes, B. J. Aronow et al., “ToppGene Suite forgene list enrichment analysis and candidate gene prioritization,”Nucleic Acids Research, vol. 37, no. suppl 2, pp. W305–W311, 2009.

[147] L. Sun, Y. Zhu, A. A. Mahmood et al., “WebGIVI: a web-basedgene enrichment analysis and visualization tool,” BMC Bioinfor-matics, vol. 18, no. 1, p. 237, 2017.

[148] Q. Zheng and X.-J. Wang, “GOEAST: a web-based softwaretoolkit for Gene Ontology enrichment analysis,” Nucleic AcidsResearch, vol. 36, no. suppl 2, pp. W358–W363, 2008.

[149] E. Eden, R. Navon, I. Steinfeld et al., “GOrilla: a tool for discoveryand visualization of enriched GO terms in ranked gene lists,”BMC Bioinformatics, vol. 10, no. 1, p. 48, 2009.

[150] D. W. Huang, B. T. Sherman, and R. A. Lempicki, “Bioinformaticsenrichment tools: paths toward the comprehensive functionalanalysis of large gene lists,” Nucleic Acids Research, vol. 37, no. 1,pp. 1–13, 2009.

[151] M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene Ontology: toolfor the unification of biology,” Nature Genetics, vol. 25, no. 1, pp.25–29, 2000.

[152] H. Ogata, S. Goto, K. Sato et al., “KEGG: Kyoto Encyclopedia ofGenes and Genomes,” Nucleic Acids Research, vol. 27, no. 1, pp.29–34, 1999.

[153] D. Nishimura, “BioCarta,” Biotech Software & Internet Report: TheComputer Software Journal for Scientist, vol. 2, no. 3, pp. 117–120,2001.

[154] A. Bairoch, B. Boeckmann, S. Ferro et al., “Swiss-Prot: jugglingbetween evolution and stability,” Briefings in Bioinformatics, vol. 5,no. 1, pp. 39–55, 2004.

[155] R. D. Finn, P. Coggill, R. Y. Eberhardt et al., “The Pfam proteinfamilies database: towards a more sustainable future,” NucleicAcids Research, vol. 44, no. D1, pp. D279–D285, 2016.

[156] S. Hunter, P. Jones, A. Mitchell et al., “InterPro in 2011: newdevelopments in the family and domain prediction database,”Nucleic Acids Research, p. gkr948, 2011.

[157] A. Conesa, S. Gotz, J. M. Garcıa-Gomez et al., “Blast2GO: a uni-versal tool for annotation, visualization and analysis in functionalgenomics research,” Bioinformatics, vol. 21, no. 18, pp. 3674–3676,2005.

[158] D. Gomez-Cabrero, I. Abugessaisa, D. Maier, et al., “Data inte-gration in the era of omics: current and future challenges,” BMCSystems Biology, vol. 8, no. 2, p. I1, 2014.

[159] J. Linde, S. Schulze, S. G. Henkel et al., “Data-and knowledge-based modeling of gene regulatory networks: an update,” EXCLIJournal, vol. 14, p. 346, 2015.

[160] M. W. Fiers, L. Minnoye, S. Aibar et al., “Mapping gene regulatorynetworks from single-cell omics data,” Briefings in FunctionalGenomics, 2018.

[161] A. A. Margolin, I. Nemenman, K. Basso et al., “ARACNE: analgorithm for the reconstruction of gene regulatory networks ina mammalian cellular context,” BMC Bioinformatics, vol. 7, no. 1,p. S7, 2006.

[162] A. Irrthum, L. Wehenkel, P. Geurts et al., “Inferring regulatorynetworks from expression data using tree-based methods,” PLoSONE, vol. 5, no. 9, p. e12776, 2010.

[163] J. D. Allen, Y. Xie, M. Chen et al., “Comparing statistical methodsfor constructing large scale gene networks,” PloS One, vol. 7,no. 1, p. e29348, 2012.

[164] V. Moignard, S. Woodhouse, L. Haghverdi et al., “Decoding theregulatory network of early blood development from single-cellgene expression measurements,” Nature biotechnology, vol. 33,no. 3, pp. 269–276, 2015.

[165] S. Aibar, C. B. Gonzalez-Blas et al., “SCENIC: Single-Cell Reg-ulatory Network Inference And Clustering,” bioRxiv, p. 144501,2017.

Page 20: TRANSACTION ON COMPUTATIONAL BIOLOGY AND …jkalita/papers/2018/... · steps of RNA-seq data analysis. Chen et al. [11] and Oshlack et al. [13] provide an overview of RNA-seq data

TRANSACTION ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 20

[166] S. Kalyana-Sundaram, S. Shankar, S. DeRoo et al., “Gene fusionsassociated with recurrent amplicons represent a class of passen-ger aberrations in breast cancer,” Neoplasia, vol. 14, no. 8, pp.702IN7–708IN13, 2012.

[167] L. Yang, L. J. Luquette, N. Gehlenborg et al., “Diverse mechanismsof somatic structural variations in human cancer genomes,” Cell,vol. 153, no. 4, pp. 919–929, 2013.

[168] C. A. Maher, C. Kumar-Sinha, X. Cao et al., “Transcriptomesequencing to detect gene fusions in cancer,” Nature, vol. 458,no. 7234, pp. 97–101, 2009.

[169] S. Liu, W.-H. Tsai, Y. Ding et al., “Comprehensive evaluationof fusion transcript detection algorithms and a meta-caller tocombine top performing methods in paired-end RNA-seq data,”Nucleic Acids Research, vol. 44, no. 5, pp. e47–e47, 2016.

[170] W. Jia, K. Qiu, M. He et al., “SOAPfuse: an algorithm for identi-fying fusion transcripts from paired-end RNA-Seq data,” GenomeBiology, vol. 14, no. 2, p. R12, 2013.

[171] D. Nicorici, M. Satalan, H. Edgren et al., “FusionCatcher-a toolfor finding somatic fusion genes in paired-end RNA-sequencingdata,” BioRxiv, p. 011650, 2014.

[172] N. M. Davidson, I. J. Majewski, and A. Oshlack, “JAFFA: Highsensitivity transcriptome-focused fusion gene detection,” GenomeMedicine, vol. 7, no. 1, p. 43, 2015.

[173] J. Zhao, Q. Chen, J. Wu et al., “GFusion: an Effective Algorithmto Identify Fusion Genes from Cancer RNA-Seq Data,” ScientificReports, vol. 7, no. 1, p. 6880, 2017.

[174] T. Carver, S. R. Harris, M. Berriman et al., “Artemis: an integratedplatform for visualization and analysis of high-throughputsequence-based experimental data,” Bioinformatics, vol. 28, no. 4,pp. 464–469, 2012.

[175] J. W. Nicol, G. A. Helt, S. G. Blanchard et al., “The IntegratedGenome Browser: free software for distribution and explorationof genome-scale datasets,” Bioinformatics, vol. 25, no. 20, pp.2730–2731, 2009.

[176] H. Thorvaldsdottir, J. T. Robinson, and J. P. Mesirov, “IntegrativeGenomics Viewer (IGV): high-performance genomics data visual-ization and exploration,” Briefings in Bioinformatics, vol. 14, no. 2,pp. 178–192, 2013.

[177] T. Abeel, T. Van Parys, Y. Saeys et al., “GenomeView: a next-generation genome browser,” Nucleic Acids Research, vol. 40, no. 2,pp. e12–e12, 2012.

[178] R. Hilker, K. B. Stadermann, O. Schwengers et al., “ReadXplorer2–detailed read mapping analysis and visualization from one singlesource,” Bioinformatics, vol. 32, no. 24, pp. 3702–3708, 2016.

[179] O. Westesson, M. Skinner, and I. Holmes, “Visualizing next-generation sequencing data with JBrowse,” Briefings in Bioinfor-matics, p. bbr078, 2012.

[180] A. R. Shifman, R. M. Johnson, and B. T. Wilhelm, “Cascade: anRNA-seq visualization tool for cancer genomics,” BMC Genomics,vol. 17, no. 1, p. 75, 2016.

[181] S. Gonzalez, B. Clavijo, M. Rivarola, et al., “ATGC transcrip-tomics: a web-based application to integrate, explore and analyzede novo transcriptomic data,” BMC Bioinformatics, vol. 18, no. 1,p. 121, 2017.

[182] T. Nussbaumer, K. G. Kugler, K. C. Bader et al.,“RNASeqExpressionBrowser-a web interface to browse andvisualize high-throughput expression data,” Bioinformatics, p.btu334, 2014.

[183] J. Fan, D. Fan, K. Slowikowski et al., “UBiT2: a client-side web-application for gene expression data analysis,” bioRxiv, p. 118992,2017.

[184] J. Harshbarger, A. Kratz, and P. Carninci, “DEIVA: a web applica-tion for interactive visual analysis of differential gene expressionprofiles,” BMC Genomics, vol. 18, no. 1, p. 47, 2017.

[185] S. X. Ge, “iDEP: An integrated web application for differentialexpression and pathway analysis,” bioRxiv, p. 148411, 2017.

[186] J. H. Lim, S. Y. Lee, and J. H. Kim, “TRAPR: R Package for Statis-tical Analysis and Visualization of RNA-Seq Data,” Genomics &Informatics, vol. 15, no. 1, pp. 51–53, 2017.

[187] Y. Chunjiang, W. Wentao, J. Wang et al., “NGS-FC: A Next-Generation Sequencing Data Format Converter,” IEEE/ACMTransactions on Computational Biology and Bioinformatics, 2017.

[188] S. Lamarre, P. Frasse, M. Zouine et al., “Optimization of anRNA-Seq Differential Gene Expression Analysis Depending onBiological Replicate Number and Library Size,” Frontiers in PlantScience, vol. 9, p. 108, 2018.

[189] Y. Liu, J. Zhou, and K. P. White, “RNA-seq differential expres-sion studies: more sequence or more replication?” Bioinformatics,vol. 30, no. 3, pp. 301–304, 2014.

[190] A. G. Williams, S. Thomas, S. K. Wyman et al., “RNA-seq Data:Challenges in and Recommendations for Experimental Designand Analysis,” Current Protocols in Human Genetics, pp. 11–13,2014.

[191] D. S. Fischer, F. J. Theis, and N. Yosef, “Impulse model-baseddifferential expression analysis of time course sequencing data,”bioRxiv, p. 113548, 2017.

[192] Z. H. Zhang, D. J. Jhaveri, V. M. Marshall et al., “A comparativestudy of techniques for differential expression analysis on RNA-Seq data,” PLoS ONE, vol. 9, no. 8, p. e103207, 2014.

[193] L. Lopez-Kleine and C. Gonzalez-Prieto, “Challenges analyzingRNA-seq gene expression data,” Open Journal of Statistics, vol. 6,no. 04, pp. 628–636, 2016.

[194] I. Zwiener, B. Frisch, and H. Binder, “Transforming RNA-Seq datato improve the performance of prognostic gene signatures,” PloSOne, vol. 9, no. 1, p. e85150, 2014.

[195] A. McPherson, F. Hormozdiari, A. Zayed et al., “deFuse: analgorithm for gene fusion discovery in tumor RNA-Seq data,”PLoS Computational Biology, vol. 7, no. 5, p. e1001138, 2011.

[196] A.-E. Saliba, A. J. Westermann, S. A. Gorski et al., “Single-cell RNA-seq: advances and future challenges,” Nucleic AcidsResearch, vol. 42, no. 14, pp. 8845–8860, 2014.

[197] L. Haghverdi, A. T. Lun, M. D. Morgan et al., “Batch effectsin single-cell RNA-sequencing data are corrected by matchingmutual nearest neighbors,” Nature Biotechnology, vol. 36, no. 5, p.421, 2018.

Hussain Ahmed Chowdhury was born in As-sam, India in 1990. He received his BE in Com-puter Science & Engineering from Jorhat Engi-neering College in 2013 and M.Tech in Informa-tion Technology in 2016 from Tezpur University.Currently, he is pursuing Ph.D. in Computer Sci-ence at Tezpur University. His current researchinterests include Machine Learning and Compu-tational Biology.

Dhruba Kumar Bhattacharyya received hisPh.D. in Computer Science from Tezpur Univer-sity in 1999. He is a Professor in Computer Sci-ence & Engineering at Tezpur University. His re-search areas include Data Mining, Network Se-curity and Content-based Image Retrieval. Prof.Bhattacharyya has published 250+ research pa-pers in the leading international journals andconference proceedings. In addition, Dr Bhat-tacharyya has written/edited many books. He isa Program Committee/Advisory Body member of

several international conferences/workshops.

Jugal Kumar Kalita is a professor of ComputerScience at the University of Colorado at Col-orado Springs. He received his Ph.D. from theUniversity of Pennsylvania in 1990. His researchinterests are in natural language processing, ar-tificial intelligence and bioinformatics. He haspublished more than 200 papers in internationaljournals and referred conferences.