a comparison of analog and next-generation transcriptomic tools for mammalian studies
TRANSCRIPT
A comparison of analog andNext-Generation transcriptomictools for mammalian studiesNicole C. Roy, Eric Altermann, Zaneta A. Park and Warren C. McNabb
AbstractThis review focuses on tools for studying a cell’s transcriptome, the collection of all RNA transcripts produced at aspecific time, and the tools available for determining how these changes in gene expression relate to the functionalchanges in an organism.While the microarray-based (analog) gene-expression profiling technology has dominatedthe ‘omics’ era, Next-Generation Sequencing based gene-expression profiling (RNA-Seq) is likely to replace thisanalog technology in the future. RNA-Seq shows much promise for transcriptomic studies as the genes of interestdo not have to be known a priori, new classes of RNA, SNPs and alternative splice variants can be detected, and itis also theoretically possible to detect transcripts from all biologically relevant abundance classes. However, thetechnology also brings with it new issues to resolve: the specific technical properties of RNA-Seq data differ tothose of analog data, leading to novel systematic biases which must be accounted for when analysing this type ofdata. Additionally, multireads and splice junctions can cause problems when mapping the sequences back to agenome, and concepts such as cloud computing may be required because of the massive amounts of data generated.
Keywords: transcriptomics; microarray; Next-Generation sequencing; RNA-Seq
INTRODUCTIONThe understanding of biological systems comprising
large numbers of genes, approximately 20 000–25 000
protein-coding sequences for humans [1] and at least
22 000 in cattle, broadly similar to gene counts in
other mammals [2] is challenging. Fortunately, the
tools available for studying a cell’s transcriptome and
for transforming the large volume of data that these
techniques generate into knowledge and new hypo-
theses have improved over recent years. Traditional
methods for gene expression analysis, such as northern
blotting, quantitative real-time polymerase chain
reaction (qRT–PCR) or differential display, require
the pre-selection of single genes. These methods
Nicole C. Roy is the Team Leader of Food Nutrition Genomics, Agri-Foods and Health Section at AgResearch, Palmerston North,
New Zealand, an Associate Investigator and an Adjunct Senior Lecturer at the Riddet Institute, Massey University, Palmerston North,
New Zealand and a member of Nutrigenomics New Zealand. Current research is focused on factors which affect the nutrient–gene
interactions (nutrigenomics, food–microbe–host interactions, intestinal barrier function) that regulate the supply of nutrients to tissues.
EricAltermann is a Senior Research Scientist in Rumen Microbial Genomics in the Rumen, Nutrition and Microbiology Section at
AgResearch, New Zealand and an Associate Investigator at the Riddet Institute, Massey University, Palmerston North, New Zealand.
Current main research is focused on applied bioinformatic analyses of prokaryotes and development and application of customised
bioinformatic algorithms to identify key genetic elements within microbial genetic blueprints.
Zaneta A. Park is a Bioinformatician in the Bioinformatics, Mathematics and Statistics Group at AgResearch Grasslands, Palmerston
North, New Zealand and a member of Nutrigenomics New Zealand. Current research focuses on transcriptomic analyses of mam-
malian and other tissues.
WarrenC.McNabb is Science and Technology General Manager of the Food and Textiles Group at AgResearch, Palmerston North,
New Zealand, a Professor of Nutrition at the Riddet Institute, Massey University, Palmerston North, New Zealand and a member of
Nutrigenomics New Zealand. Current research focuses on factors which affect the nutrient–gene interactions that ultimately regulate
the supply of nutrients to tissues.
Corresponding author. Nicole Roy, Team Leader, Food Nutrition Genomics, Agri-Foods and Health Section, AgResearch Grasslands,
Tennent Drive, Private Bag 11008, Palmerston North 4442, New Zealand. Tel.: þ64-6-351-8110; Fax: þ64-6-351-8003;
E-mail: [email protected]
BRIEFINGS IN FUNCTIONAL GENOMICS. page 1 of 16 doi:10.1093/bfgp/elr005
� The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected]
Briefings in Functional Genomics Advance Access published March 9, 2011 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
have been and are still useful, but they miss important
effects in biological processes, such as metabolic and
signaling pathways and transcriptional networks
across several pathways [3] because they are confined
to the analysis of single genes or a very limited num-
ber of selected genes of interest in a few samples.
The advent of high-throughput DNA sequencing
and the subsequent development of analog gene
expression techniques such as microarrays, repre-
sented a critical breakthrough as the simultaneous
measurement of the expression of many thousands
of genes in a sample was finally possible. The recent
development of Next-Generation Sequencing and
its use in transcriptomic analysis (RNA-Seq) now
potentially enables the quantitative measurement of
‘all’ genes expressed in a sample.
The completion of the Human Genome Project in
2003 [4, 5], the bovine genome (first draft from 2006
and publication in 2009; http://www.hgsc.bcm.tmc
.edu/project-species-m-Bovine.hgsc), followed by
other initiatives for porcine and ovine, etc. (see
http://www.genomesonline.org/ for a more com-
plete list), and the subsequent availability of complete
genomic DNA sequence data are only a few rea-
sons why these newer technologies have advanced
so quickly. Initiatives that take advantage of these
technologies have been established (Table 1 for
examples). These disciplines cover such a diverse
scope as nutrition, medicine, parasitology, molecular
biology, chemistry, mathematics and bioinformatics
in human and livestock research. These programmes
are leading recent advances in human and animal
genomics [6, 7].
This review presents an overview of currently
existing and emerging transcriptomic technologies,
including their advantages and limitations, and data
management and analysis strategies and concerns.
Available software solutions and the integration of
different data sets are also discussed.
ANALOGMETHODS FORTRANSCRIPTOMICANALYSISBackgroundThe measurement of mRNA levels for a complete
set of transcripts in a single assay can be accomplished
using DNA and oligonucleotide microarray (chip)
technology. This analog high-throughput method-
ology has become a standard tool for gene expression
profiling, facilitating the analysis of genome-wide
expression patterns, whether there is a sequenced
genome or not (although the sequence of the
probes on an array are usually known), to establish
gene networks and identify new genes involved in a
phenotype.
Microarrays have been used to study many differ-
ent organisms. Their use includes nutritional studies
[3, 8–15] and livestock research [16–21] where they
are used to characterize metabolic pathways in tissues
or cells important to phenotypic outcomes. For ex-
ample, in cattle, they have been used to investigate
folliculogenesis, ovulation and oocyte quality, and
early embryonic development [22].
Array typesHybridization arrays include macroarrays (spot diam-
eter of each probe >300 mm) and microarrays (spot
diameter of each probe <250 mm). Microarrays, the
most commonly used of the hybridization arrays,
usually consist of a predefined arrangement of a
large number of probe (DNA) sequences (whole
genome or partial) immobilized on a solid surface
that serve as a hybridization substrate for cRNA or
cDNA fragments generated from a tissue or cell
sample (target). RNA extracted from a tissue or
Table 1: Examples of worldwide genomic initiatives
Initiative Scope Website Country
Nutrigenomics organization Nutrigenomics www.nugo.org EuropeNutrigenomics New Zealand Nutrigenomics www.nutrigenomics.org.nz New ZealandFugato Bovine, porcine,
ovine and equinewww.fugato-forschung.de Germany
Milk genomics and human health Mammalian milkgenomics
www.milkgenomics.org USA
International sheep genomics consortium Ovine https://isgcdata.agresearch.co.nz/ InternationalThe bovine genome sequencing and
analysis consortiumBovine http://www.bcm.edu/news/packages/bovinegenome.cfm International
page 2 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
cell sample is amplified and reverse transcribed to
produce cDNA labelled with a fluorophore or radi-
olabel which is hybridized to the microarray chip
under stringent conditions. After hybridization is
completed, the microarray slides are washed then
scanned using a microarray scanner to detect the
intensities of the target signal and the background
noise [23].
While in-house microarray production still exists
(especially for incomplete, non-sequenced or confi-
dential genomes), several commercial platforms offer
microarray products of high quality. There are many
variants of this technology that can be grouped in
two categories based on the length and origin of the
spotted DNA: (i) the DNA microarray, where PCR
amplicons of a few hundred base pairs of denatured
double-stranded DNA are spotted onto glass slides;
(ii) the oligonucleotide microarray, where chem-
ically synthesized single-stranded DNA is immobi-
lized on glass slides. For example, the Agilent
platform where probes of 40–70 nt in length are
tiled; and the Affymetrix platform where 25-mers
are tiled, with typically 11–16 25-mers tiled for
each gene. For the latter platform, ‘mismatch’
probes are also tiled for each probe (i.e. the middle
nucleotide is replaced with another) to allow adjust-
ment of the data for cross-hybridization, therefore
increasing the reliability of the microarray data
although at the expense of capacity. More recently,
Affymetrix introduced all-exon arrays (HuEx arrays,
i.e. Human, Mouse and Rat Exon/Gene 1.0 ST
arrays) which differ significantly from the traditional
30-expression arrays described above. Here, exons
are covered by only four probes and T7-linked
random hexamers used for cDNA synthesis eliminate
the need for intact poly-A tails. Studies comparing
30-expression arrays to HuEx arrays reported a high
level of cross-platform comparability with only a
limited number of recognized problems, such as dif-
ferences in detection thresholds [24]. Numerous
mammalian DNA and oligonucleotide microarrays
are available, including for humans, livestock species
(e.g. cattle, pig, horse, sheep and chickens) and other
mammals (e.g. canine, rabbit, Rhesus macaque, etc).
For species where microarrays are less readily avail-
able (e.g. goat), cross-species hybridization to exist-
ing arrays is possible though not ideal [22, 25].
Custom microarrays can also be used to tile a
chosen set of genes of interest.
Additionally, a third category of microarray exists,
the beadarray (Illumina platform), which consists of
thousands of three-micron silica beads each coated
with hundreds of thousands of copies of a specific
oligonucleotide sequence, which self-allocate ran-
domly across an array. A decoding process is used
to determine which bead occupied each well.
Beadarrays typically comprise multiple copies
(around 30) of each bead type per array. The beads
are combined with more than 1000 control bead
types on the arrays (used as negative controls)
which, with the random allocation of multiple
copies of each bead type, result in high quality data
at relatively low cost and sample input [26].
Variations on the microarray theme [e.g. exon
junction arrays, tiling arrays, fusion chips and single
nucleotide polymorphism (SNP) chips] allow micro-
array technology to be used for more than gene ex-
pression profiling. Exon junction, tiling and fusion
arrays allow detection of alternative splice variants of
a (fused) gene [27, 28]. SNP chips allow the identi-
fication of SNPs within and between populations
[29, 30] enabling uses such as assessing levels of
genetic variability including loss of heterozygosity
[31, 32], detection of allelic imbalance [33] and
construction of linkage disequilibrium maps which
facilitate the association of genetic variation with
economically important traits, for example as
developed and characterized in bovine [34–36].
The recent development of an ovine SNP chip con-
taining 1536 SNPs represents the first time that the
sheep genome has been assayed on a genome-wide
basis. Using the allele frequencies at each SNP, cal-
culation of genetic parameters (e.g. genetic distance)
have allowed the levels of genetic variability both
within and between a diverse group of ovine popu-
lations to be determined. This in combination with
cluster analysis has shown that sheep are character-
ized by weak phylogeographic structure, overlapping
genetic similarity and generally low differentiation
which is consistent with their short evolutionary
history [31].
Microarray limitationsMicroarrays are now an affordable technique that
provide RNA expression pattern data based on a
high-throughput and semi-quantitative analysis of
light signaling intensity. They do, however, have
their limitations [37]. The data need to be normal-
ized to remove spatial artefacts and systematic biases,
and appropriate statistical analysis must be used to
reduce the number of false positives obtained from
testing so many genes at once. Furthermore, as the
Analog and Next-Generation transcriptomic tools page 3 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
technique relies on hybridization, it brings a range of
related potential problems such as background
hybridization levels (including cross-hybridization),
differential probe hybridization properties and dye
binding variances [37–40]. These variables mean
that microarrays do not easily quantify the expression
pattern of low abundance transcripts as the low in-
tensity fluorescence signal is difficult to distinguish
numerically and statistically from background noise.
Conversely, signal saturation can occur at high inten-
sities therefore limiting the ability to compare the
level of expression of transcripts which are expressed
at very high levels [41–43]. The information gener-
ated with hybridization arrays is also limited to the
number of probes on the microarray slide and usually
to genes with known sequence. Microarrays are also
constrained in their ability to detect splice variants,
either because not all the forms are tiled or they
are too closely related for hybridization methods to
distinguish [42, 44].
ANALOG-DERIVEDTRANSCRIPTOMIC DATAAnalysis methodsDetermining the biologically significant changes in
gene expression levels from a large amount of gene
data is still a challenge for microarray-based transcrip-
tomic data.
Analysis of microarrays typically comprises quality
control and normalization, followed by determin-
ation of a list of differentially expressed genes based
on fold change and some type of significance criteria
(most commonly used parameter is the P-value).
The latter is usually calculated from a t-test, with
the recommendation that the variance estimate
used should be determined using both gene-specific
information and information from across all genes
[22, 45, 46]. Volcano plots are an effective way of
summarizing the results for the two criteria [46].
Additionally, adjustments for multiple testing are
usually applied to control the number of false posi-
tives, with the false discovery rate (FDR) [47] a
popular choice. Mixture model methods which
treat genes as being composed of two populations:
one differentially expressed and one not, are also an
option [45].
An alternative analysis approach involves directly
determining sets of genes which can be used to suc-
cessfully differentiate samples from different treat-
ments. The random forest procedure has been
shown to be a useful method for doing this as it
yields very small sets of genes (often smaller than
alternative methods) while preserving predictive
accuracy [48].
Available softwareBehind these analysis tools are sophisticated mathem-
atical and statistical models which have been imple-
mented in a variety of open-source and commercial
software packages. Some of the most well known
packages available include GeneSpring (http://
www.chem.agilent.com/en-US/products/software/
lifesciencesinformatics/genespringgx/Pages/default
.aspx), GenStat (http://www.vsni.co.uk/software/
genstat/, [49]), Spotfire and the Bioconductor suite
of microarray analysis packages written in the R pro-
gramming language (www.bioconductor.org, freely
available). The latter includes linear models for
microarray analysis (limma, [50]), affy [37] and sim-
pleaffy [51] for Affymetrix data analysis, lumi [52] for
Illumina beadarray analysis and arrayQualityMetrics
[53] for quality control.
Most analysis programs accept data from a variety
of sources (e.g. Affymetrix, Agilent, Illumina and in-
house data). The Bioconductor suite of packages has
the advantage that it is open source and readily ac-
cessible, thus facilitating collaborative projects. Many
software packages can also be further customized as
they include application programming interfaces
which allow them to interact directly with other
tools. For example, GeneSpring and Spotfire both
include modules which allow R code to be used
within the package while providing easy interactive
visualization of the data and results. A more compre-
hensive summary of many packages and their inter-
activity can be found in Supplementary Table S1
of [54].
Interfaces to a range of underlying tools have also
been developed, including Chipster (http://nami
.csc.fi/features.shtml, freely available for local instal-
lations or alternatively the CSC server can be used
for a fee) which allows one to perform DNA micro-
array data analysis with R/Bioconductor and other
tools through an intuitive graphical user interface and
GenePattern (freely available). GenePattern was de-
veloped at the Broad Institute of Massachusetts
Institute of Technology and Harvard University. It
is a software environment which incorporates a wide
range of already developed tools and has the ability
to adopt new methods. The GenePattern platform
consequently integrates a large number of analytical
page 4 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
tools for genomic data, facilitates data entry and
allows parameters to be set with regard to quality
control, statistics, gene enrichment analysis, etc.
Furthermore, custom modules in R/Bioconductor,
MATLAB or the Perl or Java programming lan-
guages can be implemented. GenePattern includes
60–100 pre-packaged analysis modules and more
complex methodologies (not only for microarray,
but also for epigenetic, SNP, proteomic and
sequence analyses) aligning analyses into a single, re-
producible pipeline that caters for users at all levels
of computational experience [54–56].
Validation of resultsSelected differentially expressed genes are further
investigated by qRT–PCR to confirm the expression
patterns seen in microarrays. The use of PCR pri-
mers targeting transcript variants is one reason for
discrepancies between these two methods, thus
care has to be taken to design PCR primers which
recognize the same transcript as each microarray
probe. [57]. The minimum information for publica-
tion of quantitative real-time experiments (MIQE)
guidelines provide best experimental practice for
qRT–PCR to generate data that are more uniform,
more comparable and ultimately, more reliable [58].
RNA-Seq METHODS FORTRANSCRIPTOMICANALYSISBackgroundRNA-Seq transcriptomics replaces the hybridization
of nucleotide probes with sequencing individual
cDNA species followed by counting and mapping al-
gorithms. Emerging methods for these fully quantita-
tive transcriptomic analyses have the potential
to overcome the limitations of microarray technology
and may replace them. Early attempts reach as far back
as 1997 [59] where random clones derived from
cDNA libraries were sequenced using fluorescent
well and capillary DNA sequencers (i.e. Applied Bio
Systems ABI PRISM 373, 377 and 3730xl DNA
Sequencer). Variations in gene expression levels
were deduced from the counts of respective sequence
tags. However, restrictions in technology meant
sequencing and analysing large numbers of sequences
was slow, expensive and labor intensive so that only
a relatively small number of clones (usually in the
thousands) were sequenced. Besides the small sample
size, cloning bias also introduced new drawbacks.
Today, Next-Generation Sequencing techniques
have surpassed these relatively low throughput
technologies. In principle, all competing products
are based on the sequencing-by-synthesis tech-
nique [60] or sequencing-by-ligation (http://www3
.appliedbiosystems.com/cms/groups/mcb_
marketing/documents/generaldocuments/cms_
058265.pdf). Sequencing-by-synthesis relies on the
detection of nucleotides immediately after incorpor-
ation into a newly synthesized DNA strand, whereas
sequencing-by-ligation involves the binding of
known probes to the sequence. While the principle
remains unchanged, a number of variants and im-
provements have been introduced [61].
Number and length of reads obtainedusing different platformsCurrently, three Next-Generation Sequencing sys-
tems [Roche 454 Life Sciences (Pyrosequencing),
Illumina and Applied Biosystems (Solid 3 Plus,
Solid 4 and Solid 4hq)] are dominating the market
for RNA-Seq with emerging alternative systems
(e.g. Helicos BioSciences [62]). These systems can
also be used for other specific applications such as
the targeted resequencing of genomes, ChIP-Seq
or copy number variation analyses [63]. Next-
Generation Sequencing systems have brought a
dramatic change in scale to DNA sequencing.
Traditional Sanger Type plate and capillary sequen-
cers can provide up to 96 reads per run, albeit at
longer reads length in excess of 1000 nt. Roche
454 Pyrosequencing systems generate hundreds of
thousands of reads per run and, with the introduc-
tion of the latest Flx-Titanium system upgrade, have
extended read length up to 500 nt. This extension in
sequence read length has been achieved by the use of
a thin metal coating (titanium) that is applied to the
pico titer plate walls eliminating crosstalk between
individual wells. The coating improves the signal
to noise ratio and also increases the number of sam-
ples that can be analysed in one plate. Recent an-
nouncements by Roche 454 unveiled yet another
doubling of read length, targeting the 1000 nt barrier
(http://454.com/about-454/news/index.asp?display
¼detail&id¼137). Finally, solid state sequencing
systems, such as Illumina or ABI SOLiD, can gener-
ate sequence reads in excess of 100 million per
run, albeit with the shortcoming of very short
read-lengths of <50–100 nt. Indeed, the Applied
Biosystems Solid 4 and Solid 5500xl (which delivers
on the promise of the Solid 4hq) systems have re-
cently been released with up to 300 Gb mappable
data throughput. While the scalability factor reduces
Analog and Next-Generation transcriptomic tools page 5 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
the cost per run ($US3000 per genome or $US120
per transcriptome, http://www.appliedbiosystems
.com/absite/us/en/home/applications-technologies/
solid-next-generation-sequencing/next-generation-
systems/solid4hq.html; http://www.lifetechnologies
.com/news-gallery/press-releases/2010/life-
techologies-lauches-ew-solid-sequecer-to-drive-
advaces-i-c.html), other potentially limiting factors
such as short read lengths of 75 bp or less or the
necessity of creating fragment libraries still remain.
RNA-Seq gene expression profilingDue to their read-length restrictions these ultra-
high-throughput systems may not necessarily be
suited for de novo genome sequencing (with the
recent exception of 454 pyrosequencing with its
longer reads). However, these limitations are irrele-
vant for RNA-Seq gene expression profiling. Here,
the promise of hundreds of millions of reads creates
for the first time the realistic opportunity to assess the
transcriptome of an organism on a holistic level
without cloning bias (although each platform may
have its own associated biases, see below) [64, 65].
Briefly, current technologies utilize a common
principle: RNA is converted to a library of cDNA
fragments via either RNA or cDNA fragmentation
and adapters are attached to one or both ends of the
fragments. Individual cDNA species are then ampli-
fied into separate clusters to amplify the signal inten-
sity. These clusters are then sequenced from one
(single-end sequencing) or both ends (pair-end
sequencing) and the reads aligned to a reference
genome or transcriptome [43, 65, 66]. The number
of sequencing reads mapped to each gene is then
tabulated and normalized. In general, the larger the
genome and the more complex the transcriptome,
the higher the number of reads and the greater the
sequencing depth that is required for adequate
coverage [43].
Standard protocols use random fragmentation of
cDNA or RNA which means that more than one
fragment may be produced from a single transcript.
However, different processing methods can be com-
bined with RNA-Seq and give the advantage that
only a single tag for each transcript is produced. Such
processing methods include serial analysis of gene
expression (SAGE) (SOLiD-SAGE; http://tools
.invitrogen.com/content/sfs/manuals/SOLiD_
SAGE_man.pdf), cap analysis of gene expression
(CAGE) and massively parallel signature sequencing
(MPSS) [67]. Disadvantages include that non-coding
RNAs may not be detected, the methods rely on the
presence of particular restriction sites [68], and the
fragments may be considerably shorter (for example,
27 bp for the SOLiD-SAGE protocol) and therefore
do not detect splice isoforms [42].
More recently, experimental evidence indicates
that RNA fragmentation, compared to cDNA frag-
mentation, may significantly improve the uniformity
of sequence coverage across transcripts, so allowing
greater sensitivity of detection, accuracy of quantifi-
cation and completeness of splice and exon maps.
RNA fragmentation may work better because of
30-bias induced during cDNA synthesis and also
the secondary RNA structure may mean that prim-
ing is not truly random but instead some sites are
favoured over others [42].
Additionally, whether a strand-specific protocol is
used in RNA-Seq is an important factor to consider
[68], particularly when studying mammals where
antisense transcription has been shown to be a ubi-
quitous phenomenon [69].
Advantages of RNA-SeqInitially, Next-Generation Sequencing techniques
were used to provide insights into the way genes
are expressed and regulated in cells. More recently,
they have also been heavily utilized to determine
new classes of RNA, SNPs, unknown transcripts,
splicing events, etc. which were inaccessible on a
global level using older technologies [42, 44, 70–74].
The digital nature of RNA-Seq gene expression
studies holds the promise of true quantitative analyses
[43]. While the implementation of this technique is
still in its infancy and essentially in validation stage, its
current prohibitive cost is expected to decrease and
this technology will become more and more access-
ible. Out of the currently dominating Next-
Generation Sequencing technologies, Illumina and
Solid platforms are better suited for RNA-Seq appli-
cations than Roche 454 Pyrosequencing. This is
largely due to the much greater number of individual
sequence reads and the resulting increased depth of
coverage. If RNA-Seq is to be combined with
de novo genome sequencing, Illumina/Solid data can
easily be augmented by paired-end pyrosequencing
reads, creating a robust genome scaffold for the
shorter Illumina/Solid reads.
RNA-Seq creates the new Gold standard for gene
expression studies as transcripts from all biologically
relevant abundance classes should theoretically be
able to be detected assuming enough reads are
page 6 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
collected from a sample [42, 43, 70]. Indeed, using
one of the earlier RNA-Seq gene expression tech-
nologies (digital gene expression profiling) with 20
million tags per library, 10–20% more transcripts
were detected than with microarrays, a majority of
which were expressed at levels below the sensitivity
threshold of microarray platforms [41]. A comparison
of the Illumina sequencing platform with the
Affymetrix microarray platform, showed that 81%
of differentially expressed genes from arrays were de-
tected with Illumina and more of these genes were
true positives with the Illumina technology [71].
Additionally, comparison of relative RNA-Seq read
densities to published qRT–PCR measurements for
787 genes in two reference RNA samples yielded a
nearly linear relationship across five orders of mag-
nitude, indicating that RNA-Seq read counts give
accurate relative gene expression measurements
across a very broad dynamic range [44].
Alternative splice variants have been proposed as a
primary driver of the evolution of phenotypic com-
plexity in mammals [44]. The promise that RNA-Seq
gene expression studies can easily identify these using
direct sequencing may prove to be a major advantage
compared to hybridization methods which cannot
identify closely related forms except via expensive
high-resolution tiling arrays. For example, RNA-
Seq methods found that 3500 mouse genes were
alternatively spliced [42] and 4096 previously un-
known splice junctions in 3106 human genes were
detected in a recent study [70]. Indeed, in humans
there is evidence for multiple isoforms for>95% of all
multi-exon genes. These transcripts are the result of
alternative transcription starts, alternative splicing,
RNA editing and alternative poly-adenylation [74].
RNA-Seq is also useful for discovering novel
microRNAs (small RNAs �20–25 nt in length)
[75]. However, sequence read variations due to ma-
chine error may be higher than the variation found
within a microRNA family, so one must be cautious
in interpreting this data. Also, the accuracy of RNA-
Seq readings may not necessarily be better than that
of microarrays: indeed, a recent study which used
synthetic microRNAs found that microarrays mea-
sured the expression levels of microRNAs better
than RNA-Seq [76].
Examples of biological applications ofRNA-SeqAlthough RNA-Seq can be still considered an emer-
ging technology, it has generated new knowledge of
biological systems. For example, RNA-Seq was used
to characterize the total non-ribosomal transcriptome
of human, chimpanzee and rhesus macaque brain
[77]. In this study, the authors showed that while
transcriptome divergence between species increases
with evolutionary time, intergenic transcripts show
more expression differences among species and exons
show less. These yet uncharacterized evolutionary
conserved transcripts that exist in the human brain
may play roles in transcriptional regulation and con-
tribute to evolution of human-specific phenotypic
traits. Another example relates to the use of a
novel, strand-specific RNA-Seq method. Using
this method with tumors and matched normal
tissue from three patients with oral squamous cell
carcinomas, Tuch et al. [78] showed that it accurately
measures allelic imbalance and that measurement on
the genome-wide scale yields novel insights into
cancer etiology. Cancer-related functions such as
cell adhesion and differentiation functions were
found to be enriched in the set of genes differentially
expressed in the tumors, but, unexpectedly, also in
the set of allelically imbalanced genes.
RNA-SeqTRANSCRIPTOMIC DATAVolume of data producedAs with the analog technique, a large volume of data
is produced. Indeed, the arrival of the RNA-Seq era
increased the data volume by several magnitudes and
handling of this amount of data is an important con-
sideration, both in terms of collecting and managing
the data and the computer hardware (server space)
and software required [43, 79, 80]. The amounts of
data are so large that if current trends continue, it will
soon cost less to sequence a base of DNA than to
store it on a hard disk [80]. Alternative data manage-
ment concepts such as cloud-based computing (essen-
tially renting server space) are already available [81].
For example scientists can currently establish an
account with Amazon Web Services or Microsoft
Azure, attach any one of several large public
genome-oriented data sets to the virtual machine
and analyse this data using any one of several installed
software packages. There are also a growing number
of academic-based clouds, for example the Open
Cloud Consortium (http://opencloudconsortium
.org/). These may be a better option long-term
as academic clouds are more likely to be able to
tune their performance to the specific needs of the
scientific community, for example data read and
Analog and Next-Generation transcriptomic tools page 7 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
write speeds need to be very high for genomic data
[80].
Statistical testsVarious tests of differential expression have been
proposed for replicated RNA-Seq data using bino-
mial, Poisson, negative binomial or pseudo-
likelihood models for the counts [66, 82]. How to
best analyse RNA-Seq data is an active field of re-
search. Robinson and Smyth [82] have recently de-
veloped a method using the negative binomial
distribution to model over-dispersion relative to
the Poisson, and use conditional weighted likelihood
to moderate the level of over-dispersion across genes.
This method is suitable even when the number of
replicates is very small. Additionally, methods from
the SAGE literature may be useful for analysing
RNA-Seq data [66].
Analysing counts of alternative isoforms creates
particular analytical problems. Initial analysis meth-
ods, for example the Poisson test, have focused
on first assigning reads to transcripts and then testing
for differential expression. Stegle et al. [74] describe a
modification to this approach, the Poisson Region
test, which only utilizes information about the
discriminative regions of a gene. They also present
a non-parametric kernel method, the Maximum
Mean Discrepancy (MMD) test, which directly
tests for differences of the observed read distri-
bution from different samples in the complete
absence of any annotation information. In compar-
ing these three methods using simulated and
real data, the Poisson Region test was the most
sensitive. However, the MMD test was still able to
detect 75% of the differentially expressed tran-
scripts that the Poisson Region test could.
Additionally, the MMD test has the advantage that
it can detect differential expression even if only one
annotation is currently known for a gene. It also does
not depend on the accuracy of existing gene
annotations.
Experimental design and quality controlMany issues must be considered when planning an
RNA-Seq experiment (Figure 1). No matter which
method is used or how many reads are generated,
using generally accepted experimental design prin-
ciples such as randomization of samples to lanes or
plates, and sufficient biological replication are rec-
ommended when designing RNA-Seq experiments.
Biological replication is essential as otherwise the
results from an experiment cannot be generalized.
Similarly, randomization and blocking are equally
important factors in reducing the effects of batch,
lane or flowcell variations. We refer the reader to
the excellent paper which Auer and Doerge have
recently written [66]. This clearly explains key stat-
istical principles which should be incorporated when
designing and analysing RNA-Seq experiments.
They also provide practical suggestions, for example
barcoding may be a useful tool for creating balanced
block designs.
Quality control is also an important aspect of
RNA-Seq data analysis. For example, it is useful to
plot both the proportions of each nucleotide type,
and the base quality scores, for each sequence pos-
ition. A filter can then be applied to trim the se-
quence ends if they contain bases which are of low
quality or which have atypical nucleotide
proportions.
Mapping considerationsWith analog expression data, one usually knows
what the genes are in advance, whereas with
RNA-Seq, all transcripts need to be mapped back
to a reference genome or transcriptome.
Difficulties in mapping transcripts to genes can
occur and mammalian genomes in particular create
difficulties as they are large, complex and often con-
tain families of paralogous genes, repeats and retro-
posed pseudogenes for highly expressed
housekeeping genes. Therefore individual reads, par-
ticularly shorter ones, may map to more than one
gene. Such multiread transcripts cannot be simply
discarded, as these genes, for example those in the
ubiquitin family, will then be undercounted or not
even reported. Alternative approaches such as distri-
buting multireads in proportion to the number of
unique and splice reads recorded at similar loci or
using orthogonal data (for example RNA polymer-
ase II occupancy data) have been proposed to help
resolve these issues [42, 73].
Mapping splice junctions is also an important issue
to consider when mapping reads from complex
mammalian (and other) genomes where reads may
span large introns [42, 73]. Two main approaches are
currently used: the reference genome may be sup-
plemented with known splice junction information
(including information from gene models) or alter-
natively the splice junctions can be determined with-
out a reference annotation. TopHat (http://tophat
.cbcb.umd.edu/; [83]) is a powerful freely available
page 8 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
Figure 1: Description of RNA-Seq platform, protocol considerations and workflow.
Analog and Next-Generation transcriptomic tools page 9 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
mapping program which can map reads using any
combination of these methods. A range of other
mapping programs also exist: see Table 2 in Pepke
et al. [73] for an excellent summary. Developing
in silico methods to map splice junctions agnostically
(i.e. independently of existing genome annotations)
is still an active area of research though. Hammer
et al. [84] have recently developed a series of novel
bioinformatic tools which advance RNA-Seq bio-
informatics toward unbiased transcriptome capture.
It is likely that further tools and methodology will be
developed in the short to medium term.
Normalization and biasesAs for the analog transcriptomic analyses, the effi-
ciency of RNA extraction and the quality of
cDNA synthesis remain as variables. Additionally,
the specific technical properties of RNA-Seq data
differ to those of analog data, leading to novel sys-
tematic biases which must be accounted for in the
analysis.
The number of reads obtained per sample usually
differs for RNA-Seq. Thus a range of normaliza-
tion methods for RNA-Seq based on the total
number of reads for each sample have been reported.
However it has recently been shown that the com-
position of the RNA population is also import-
ant [85]. Transcripts which are highly expressed in
only some samples, due to true biological differences
(e.g. genes which are only expressed in liver and
not kidney) or contamination, reduce the sequen-
cing ‘real estate’ available for the remaining genes,
meaning that these genes will be under-represented
if the data is normalized solely using a total gene
count approach. A normalization method,
Trimmed Mean of M values (TMM) which accounts
for this issue has been proposed by Robinson
and Oshlack [85]. The method assumes, similarly
to common microarray normalization methods
(e.g. loess and quantile) that the majority of genes
are not differentially expressed. It then determines
the relative RNA production for all genes in a
sample using a global fold change approach calcu-
lated by using trimmed means. This normaliza-
tion method can be implemented using the edgeR
package in Bioconductor ([86]; www.bioconductor
.org.).
Furthermore, current RNA-Seq protocols usually
use random fragmentation of the RNA (or cDNA)
which implies that the expected count for a transcript
is proportional to the gene’s expression level
multiplied by its transcript length, as longer tran-
scripts generate more fragments. This means that
longer genes have higher transcript counts and so,
relative to shorter genes, are more likely to be
found to be differentially expressed, particularly if
the gene is also a lowly expressed one [67, 87].
Normalization methods which account for gene
length, for example reads per kilobase per million
mapped (RPKM) [42] have been developed.
However, the problem cannot be corrected by
simply dividing by the length of the transcript or
some modification of this, as while this results in
an unbiased measure of expression, the data variance
is still affected in a length dependent manner. This
problem has been observed for a variety of different
analysis methods, experimental designs and sequen-
cing platforms. The bias causes most problems when
the results for different genes are compared, when
creating ranked gene lists, or gene category
over-representation analysis is undertaken; with a
proposed suggestion for the latter recently published
([87]; discussed further below).
This bias is elevated by the fact that current tech-
nologies require both amplification and fragmenta-
tion steps for mRNA/cDNA species used in the
analyses [72]. Emerging technologies such as small
molecule real-time DNA sequencing (SMRT)
[62, 88] and long-read sequencing such as nanopore
DNA sequencing [89, 90] and direct-read genetic
sequencing using Transmission Electron Microscope
(http://www.zsgenetics.com) show promise to over-
come this new system based bias.
Systematic biases in the bases sequenced and
sequencing errors also need to be considered.
These result from the combined effects of the manu-
facturer recommended laboratory methods, se-
quence read alignment tools and base calling
algorithms utilized. Recent advances such as the abil-
ity to obtain longer reads and paired-end sequencing
alleviate these issues, however further optimizations
are desirable [64, 72].
Available softwareThe tools available for RNA-Seq derived transcrip-
tomic data analyses are not as mature yet as those for
analog data. However, in this fast moving science
application, software developers are rapidly closing
the gap. Recently, several new software packages
and modules for existing tools have been released.
One of the most well known and established bio-
informatic companies, DNASTAR, has recently
page 10 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
released version 3 of its gene expression analysis
software, ArrayStar (a beta version of ArrayStar 4 is
also available). An optional module, QSeq (http://
www.dnastar.com/products/QSeq.php), has been
developed specifically for analysis of RNA-Seq
gene expression data. Data sets from the most
widely used platforms such as Illumina and Roche
454 can be imported directly. The range of sup-
ported analyses includes transcript discovery and
mapping, detection of alternative splicing events
and transcriptome quantification. Similarly, CLCbio
(http://www.clcbio.com) has released its CLC
Genomics Workbench software suite which inte-
grates genomics (genome denovo and re-sequencing),
transcriptomics (RNA-Seq gene expression) and epi-
genomics (ChIP-seq analysis) in one environment.
GenomeQuest takes a decentralized approach by
offering a web-based service. Its RNA-Seq work-
flow (http://wiki.genomequest.com/index.php/
RNA_Seq) accesses the databases of transcriptomes
and genomes while being able to utilize third-party
tools such as GeneSpring and Spotfire for further
analyses.
Open source software also plays an important part
in the analysis of RNA-Seq data as it is able to adapt
quickly to changes in technology, and is not slowed
by the need to wait for official release dates like its
commercial counterparts. With a large Bioconductor
community developing much of this software using
the R programming language ([91, 92], http://
www.bioconductor.org), such software is usually
of similar quality to that of commercial software or
may even surpass it. The DEGseq package [93] for
analysing RNA-Seq data and edgeR package
[82, 86], both from the Bioconductor suite are two
examples of open source software that are freely
available for use in analysing RNA-Seq data.
Additionally, Cufflinks (http://cufflinks.cbcb.umd
.edu/; [94]) is an open source program which can
be used to assemble transcripts, estimate their abun-
dances, and test for differential expression and regu-
lation in RNA-Seq samples. Cufflinks is particularly
useful for researchers who are interested in alterna-
tive transcript or splice variants as it can identify novel
transcripts and probabilistically assign reads to iso-
forms without the need for prior gene annota-
tion knowledge. Finally, the ShortRead R package
and FASTX-Toolkit (http://hannonlab.cshl.edu/
fastx_toolkit/) are two freely available packages
which enable quality control of short read
RNA-Seq data.
FUNCTIONALANALYSIS OFTRANSCRIPTOMIC DATAClassification conceptsFor both analog and RNA-Seq data, gene filtering
methods aim to find a list of differentially expressed
genes that are significantly associated with the
phenotype studied. Tens to hundreds of genes, or
an entire gene network, may be the causal link to
a specific phenotype in response to a particular
stimulus. Targeting networks which affect a given
phenotype is likely to require the identification of
genes that serve as key nodes in the network (key
information points). There may also be interest in
the response across multiple species (comparative
genomics).
How variations in gene expression relate to func-
tional changes in an organism is a question of key
biological interest. Gene category over-representa-
tion analysis is a widely used method which helps
determine which biological classes (functional
groups) are significantly overrepresented in a gene
list. The analysis comprises grouping genes into
classes by some biological property, commonly
Gene Ontology (GO) categories but alternatives
are possible such as Kyoto Encyclopedia of Genes
and Genomes (KEGG) pathways and testing
whether differentially expressed genes are over-
represented in any categories [87]. This information
combined with knowledge about which pathways
the genes are found in, if available, can result in a
powerful analysis and deepen the biological under-
standing of the gene–organism relationship.
Applications of gene classificationMany tools are available for gene category
over-representation analysis, including GOstats
[95], FUNC [96], EASE [97] and DAVID [98].
More comprehensive summaries are given at:
http://www.geneontology.org/GO.tools.micro-
array.shtml and in the Supplementary Data S1 of
Huang et al. [98]. Tools for specialized purposes
also exist, for example AgriGO, the successor of
EasyGO ([99]; http://bioinfo.cau.edu.cn/agriGO/),
is a web-based tool which is especially useful for
agricultural studies as it supports Affymetrix
GeneChips for both crops and farm animals and pro-
vides excellent capabilities for visualization of the
results. The general assumption underlying the
methodology for each tool is that, under the null
hypothesis, each gene has an equal probability of
being detected as differentially expressed, hence the
Analog and Next-Generation transcriptomic tools page 11 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
number of genes associated with a category that
overlap with the set of differentially expressed
genes follows a hypergeometric distribution [87].
Standard gene category over-representation ana-
lysis for RNA-Seq data has the problem that genes
with longer transcripts are more likely to be differ-
entially expressed ([67, 87], see above) except for
protocols which result in only a single transcript
per gene. Thus any category with a preponderance
of long genes will be more likely to be determined as
over-represented than a category with shorter genes.
Young et al. [87] have recently published a method
which corrects for this selection bias: the likelihood
of differential expression as a function of transcript
length is first quantified and then incorporated in the
statistical test of each category’s significance either by
using this information in a weighted resampling pro-
cedure or to calculate success and failure probabilities
for the Wallenius non-central hypergeometric distri-
bution. Similar results are obtained using either
method; however the latter is considerably less com-
putationally intense. Adjusting the results in this way
compared to a standard GO analysis was found to
have a substantial effect (�20% of significant GO
categories changed) on the results for a prostate
cancer data set, with the adjusted results being
more consistent with previous biological results.
Additionally, this adjustment may be useful for
both analog and RNA-Seq gene expression data
with respect to more highly expressed genes or
those with multiple probes for some genes, as both
these factors also increase the probability of a gene
being called differentially expressed [87].
Finally, consolidating multiple probes that map to
the same gene into a single count, and determining
which genes to include in the ‘universe’ for an ana-
lysis are issues already considered for microarray data
[95] and are equally important for RNA-Seq data in
gene category over-representation analysis.
Applications of pathway interactionnetworksPathway analysis is also a useful tool for both analog
and RNA-Seq data, as it allows the identification of
nodes that are central to interactions between differ-
entially expressed genes. Ingenuity Pathway Analysis
software (IPA, Ingenuity Systems, Inc., Redwood
City, CA, USA; www.ingenuity.com) is a valuable
package for determining biological networks for
mammalian data. Although based on human, rat
and mouse data, because of species homologies, it
is useful for mammalian studies in general [16, 25].
The pathway information in IPA is extracted from
the scientific literature. Used in combination with
gene ontology enrichment, pathway enrichment
analysis, network construction and comparison ana-
lysis it can lead to novel biological insights. To use
IPA, the full data set from an analysis (including gene
identifications (e.g. GenBank), fold changes and
FDR or P-values) is uploaded into IPA. The IPA
library of canonical pathways identifies those path-
ways that are the most significant to the set of dif-
ferentially expressed genes, as defined using selected
fold change and FDR (or P-value) cut-offs. The sig-
nificance of the association between this set of dif-
ferentially expressed genes and a specific canonical
pathway is estimated in two ways: (i) the proportion
of genes in the data set included in the canonical
pathway and (ii) Fisher’s exact test which is used to
calculate a P-value determining the probability of
the association between the data set and the canon-
ical pathway.
The IPA software was designed for microarray
data. However, assuming that RNA-Seq gene ex-
pression data can be successfully mapped, it should
be feasible to also use IPA for this type of data.
However, it is important to note that the aforemen-
tioned problem of longer genes having a greater
probability of being differentially expressed is likely
to also be an issue in IPA analyses. We are not aware
of a publication that currently provides a solution to
this problem. It seems likely that a similar approach
to that used to correct the problem for gene category
over-representation analysis may be able to be used.
An interesting alternative to IPA can be found in
Cytoscape (http://www.cytoscape.org/), an open
source software platform for visualizing and integrat-
ing networks, biological pathways, annotation and
gene expression profiles. Cytoscape’s modular
design means that community-based solutions can
be easily incorporated via plugins, so meaning that
new features are often more readily available than
with commercial applications. For example, the
Genoscape plugin integrates data from GenoScript
(a transcriptome database) with the KEGG database
to highlight gene expression changes and their re-
spective statistical significances. While Cytoscape was
originally designed for biologists, more recent ver-
sions have expanded its functionality to a general
platform for network analyses. This will potentially
facilitate the development of novel plugins with a
synergy effect beyond their original purpose.
page 12 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
Finally, network language can also be used to de-
scribe pair-wise relationships among genes and to
cluster genes with similar expression patterns into
pathways or regulatory networks, which can be de-
picted in heat maps. A heat map is a commonly used
tool to visualize data generated from microarrays,
and potentially RNA-Seq data, reflecting the level
of expression of many genes across a number of com-
parable samples (e.g. different physiological status,
different breeds, etc.) in a graphical representation
where the changes in values of the chosen variable
are represented as colours in a two-dimensional map
[100, 101].
CONCLUSIONThe analysis of gene expression has evolved from the
investigation of individual genes over the analysis of
thousands of genes to know the measurement of
potentially all genes in a sample. While these tech-
nologies promise a much deeper understanding of
the intricate relationships between gene expression
and internal and external stimuli, the rapidly increas-
ing amount of data to be analysed and to be put in
context creates veritable challenges to biologists and
bioinformaticians. As the cost of storing data be-
comes prohibitive, concepts such as cloud comput-
ing may become critical for the success of future
RNA-Seq experiments. Integrating the data from
both microarray and RNA-Seq experiments with
other ‘omics’ data sets open up new possibilities for
creating meaningful informational networks which
will aid our understanding of biological systems.
SUPPLEMENTARYDATASupplementary data are available online at http://
bfgp.oxfordjournals.org/.
Key Points
� DNA and oligonucleotide microarray (chip) technology is ahigh-throughput analog method that has become a standardtool for the analysis of genome-wide expression patterns,whether there is a sequenced genome or not, to establish genenetworks and identify new genes involved in a phenotype.
� Asmicroarrays are an analog technology, they have certain limi-tations. For example, they rely on hybridization which affectstheir ability to detect low abundance genes or distinguish alter-native forms. Also, the knowledge obtained is restricted to thetiled genes. However, the lower cost and established protocolsof microarray technology mean that it currently remains aviable option.
� RNA-Seq is an emergingmethod for fully quantitative transcrip-tomic analysis (i.e. transcripts are counted) andhas thepotentialto overcome the limitations ofmicroarray technology, eventuallyreplacing these analog methods. It is clear that RNA-Seq maybe the new Gold standard. However, the data volume isincreased by several magnitudes, while the tools available fordata analyses are not yet asmature as those used formicroarrayanalyses.
� Changes in technological platforms often require the develop-ment of naive software and analysis applications and care mustbe taken when applying algorithms developed for differentunderlying principles.
� Gene category over-representation analysis and pathway ana-lysis are useful tools for analysing gene expression data frommicroarrays or RNA-Seq and deepen the understanding of thegene^ organism relationship.
References1. Collins FS, Lander ES, Rogers J, et al. Finishing the euchro-
matic sequence of the human genome. Nature 2004;431:931–45.
2. Burt DW. The cattle genome reveals its secrets. J Biol 2009;8:36.
3. Subramanian A, Tamayo P, Mootha VK, et al. Gene setenrichment analysis: a knowledge-based approach for inter-preting genome-wide expression profiles. Proc Natl Acad SciUSA 2005;102:15545–50.
4. Venter JC, Adams MD, Myers EW, et al. The sequence ofthe human genome. Science 2001;291:1304–51.
5. Lander ES, Linton LM, Birren B, et al. Initial sequencingand analysis of the human genome. Nature 2001;409:860–921.
6. Lunshof JE, Bobe J, Aach J, et al. Personal genomes in pro-gress: from the human genome project to the personalgenome project. Dialogues Clin Neurosci 2010;12:47–60.
7. Collins FS, Morgan M, Patrinos A. The human genomeproject: lessons from large-scale biology. Science 2003;300:286–90.
8. de Vogel-van den Bosch HM, Bunger M, de Groot PJ, etal.PPARalpha-mediated effects of dietary lipids on intestinalbarrier gene expression. BMCGenomics 2008;9:231.
9. Knoch B, Barnett MPG, McNabb WC, et al. Dietary ara-chidonic acid-mediated effects on colon inflammation usingtranscriptome analysis. Mol Nutr Food Res 2010;54:1–13.
10. Knoch B, Barnett MPG, Zhu S, etal. Genome-wide analysisof dietary eicosapentaenoic acid- and oleic acid-inducedmodulation of colon inflammation in interleukin-10gene-deficient mice. J Nutrigenet Nutrigenomics 2009;2:9–28.
11. Langmann T, Moehle C, Mauerer R, et al. Loss of detoxi-fication in inflammatory bowel disease: dysregulation ofpregnane X receptor target genes* 1. Gastroenterology2004;127:26–40.
12. Rakhshandehroo M, Sanderson LM, Matilainen M, et al.Comprehensive analysis of pparalpha-dependent regulationof hepatic lipid metabolism by expression profiling. PPARRes 2007;2007:26839.
13. Rivera E, Flores I, Rivera E, et al. Molecular profiling of arat model of colitis: validation of known inflammatorygenes and identification of novel disease-associated targets.Inflamm Bowel Dis 2006;12:950–66.
Analog and Next-Generation transcriptomic tools page 13 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
14. Roy N, Barnett M, Knoch B, et al. Nutrigenomics appliedto an animal model of inflammatory bowel diseases:transcriptomic analysis of the effects of eicosapentaenoicacid- and arachidonic acid-enriched diets. Mutat Res-FundMolM 2007;622:103–16.
15. Sanderson LM, de Groot PJ, Hooiveld GJEJ, et al. Effect ofsynthetic dietary triglycerides: a novel research paradigm fornutrigenomics. PLoSONE 2008;3:e1681.
16. Bonnet A, La Cao KA, SanCristobal M, et al. In vivo geneexpression in granulosa cells during pig terminal folliculardevelopment. Reproduction 2008;136:211–24.
17. Diez-Tascon C, Keane OM, Wilson T, et al. Microarrayanalysis of selection lines from outbred populations to iden-tify genes involved with nematode parasite resistance insheep. Physiol Genomics 2005;21:59–69.
18. Everts RE, Band MR, Liu ZL, et al. A 7872 cDNA micro-array and its use in bovine functional genomics. Vet ImmunolImmunolpathol 2005;105:235–45.
19. Gunther J, Koczan D, Yang W, et al. Assessment of theimmune capacity of mammary epithelial cells: comparisonwith mammary tissue after challenge with Escherichia coli.Vet Res 2009;40:31.
20. Keane OM, Zadissa A, Wilson T, et al. Gene expres-sion profiling of Naive sheep genetically resistant and
susceptible to gastrointestinal nematodes. BMC Genomics2006;7:42.
21. Lehnert SA, Wang YH, Byrne KA. Development andapplication of a bovine cDNA microarray for expressionprofiling of muscle and adipose tissue. AustJ Exp Agr 2004;44:1127–33.
22. Smith GW, Rosa GJ. Interpretation of microarray data:trudging out of the abyss towards elucidation of biologicalsignificance. JAnim Sci 2007;85:E20–3.
23. Spielbauer B, Stahl F. Impact of microarray technologyin nutrition and food research. Mol Nutr Food Res 2005;49:908–17.
24. Abdueva D, Wing MR, Schaub B, et al. Experimental
comparison and evaluation of the affymetrix exon andU133Plus2 GeneChip arrays. PLoSONE 2007;2:e913.
25. Faucon F, Rebours E, Bevilacqua C, et al. Terminal differ-entiation of goat mammary tissue during pregnancy requiresthe expression of genes involved in immune functions.Physiol Genomics 2009;40:61–82.
26. Xie Y, Wang X, Story M. Statistical methods of back-ground correction for Illumina BeadArray data.Bioinformatics 2009;25:751–7.
27. Bertone P, Gerstein M, Snyder M. Applications of DNAtiling arrays to experimental genome annotation andregulatory pathway discovery. Chromosome Res 2005;13:259–74.
28. Kechris K, Yang YH, Yeh RF. Prediction of alternativelyskipped exons and splicing enhancers from exon junctionarrays. BMCGenomics 2008;9:551.
29. Hacia JG, Fan JB, Ryder O, etal. Determination of ancestralalleles for human single-nucleotide polymorphisms usinghigh-density oligonucleotide arrays. Nat Genetics 1999;22:164–7.
30. Wang DG, Fan JB, Siao CJ, et al. Large-scale identification,mapping, and genotyping of single- nucleotide poly-morphisms in the human genome. Science 1998;280:1077–82.
31. Kijas JW, Townley D, Dalrymple BP, et al. A genome widesurvey of SNP variation reveals the genetic structure ofsheep breeds. PLoSONE 2009;4:e4668.
32. Lindblad-Toh K, Tanenbaum DM, Daly MJ, et al.Loss-of-heterozygosity analysis of small-cell lung carcin-omas using single-nucleotide polymorphism arrays.Nat Biotechnol 2000;18:1001–5.
33. Mei R, Galipeau PC, Prass C, etal. Genome-wide detectionof allelic imbalance using human SNPs and high-densityDNA arrays. Genome Res 2000;10:1126–37.
34. Matukumalli LK, Lawley CT, Schnabel RD, et al.Development and characterization of a high density SNPgenotyping assay for cattle. PLoSONE 2009;4:e5350.
35. McKay SD, Schnabel RD, Murdoch BM, et al. Wholegenome linkage disequilibrium maps in cattle.BMCGenetics 2007;8:74.
36. Sargolzaei M, Scnenkel FS, Jansen GB, et al. Extent of link-age disequilibrium in Holstein cattle in North America.J Dairy Sci 2008;91:2106–17.
37. Gautier L, Cope L, Bolstad BM, et al. Affy - Analysis ofaffymetrix GeneChip data at the probe level. Bioinformatics2004;20:307–15.
38. Kapur K, Jiang H, Xing Y, et al. Cross-hybridization mod-eling on Affymetrix exon arrays. Bioinformatics 2008;24:2887–93.
39. Potter DP, Yan P, Huang TH, et al. Probe signal correctionfor differential methylation hybridization experiments.BMCBioinformatics 2008;9:453.
40. Wu Z, Irizarry R, Gentleman R, etal. A model-based back-ground adjustment for oligonucleotide expression arrays.JAm Stat Assoc 2004;99:909–17.
41. Asmann YW, Klee EW, Thompson EA, et al. 30 tag digitalgene expression profiling of human brain and universalreference RNA using Illumina Genome Analyzer.BMCGenomics 2009;10:531.
42. Mortazavi A, Williams BA, McCue K, et al. Mapping andquantifying mammalian transcriptomes by RNA-Seq.NatMethods 2008;5:621–8.
43. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolu-tionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63.
44. Wang ET, Sandberg R, Luo S, et al. Alternative isoformregulation in human tissue transcriptomes. Nature 2008;456:470–6.
45. Allison DB, Cui X, Page GP, et al. Microarray data analysis:from disarray to consolidation and consensus. NatRevGenet2006;7:55–65.
46. Cui X, Churchill GA. Statistical tests for differential expres-sion in cDNA microarray experiments. GenomeBiol 2003;4:210.
47. Benjamini Y, Hochberg Y. Controlling the false discoveryrate: a practical and powerful approach to multiple testing.J Roy Stat Soc Ser B 1995;57:289–300.
48. Diaz-Uriarte R, Alvarez de Andres S. Gene selection andclassification of microarray data using random forest.BMCBioinformatics 2006;7:3.
49. Baird D, Johnstone P, Wilson T. Normalization of micro-array data using a spatial mixed model analysis whichincludes splines. Bioinformatics 2004;20:3196–205.
50. Smyth GK. Limma: Linear models for microarray data. In:Gentleman R, Carey V, Dudoit S, et al (eds). Bioinformatics
page 14 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
and Computational Biology Solutions Using R and Bioconductor.New York: Springer, 2005:397–420.
51. Wilson CL, Miller CJ. Simpleaffy: a BioConductor packagefor Affymetrix Quality Control and data analysis.Bioinformatics 2005;21:3683–5.
52. Du P, Kibbe WA, Lin SM. lumi: a pipeline for processingIllumina microarray. Bioinformatics 2008;24:1547–8.
53. Kauffmann A, Gentleman R, Huber W. arrayQualityMetrics–a bioconductor package for quality assess-ment of microarray data. Bioinformatics 2009;25:415.
54. Reich M, Liefeld T, Gould J, et al. GenePattern 2.0.Nat Genet 2006;38:500–1.
55. De Groot P, Reiff C, Mayer C, et al. NuGO contributionsto GenePattern. Genes Nutr 2008;3:143–6.
56. Kuehn H, Liberzon A, Reich M, et al. Using GenePatternfor gene expression analysis. Curr Protoc Bioinform 2008:Chapter 7, Unit 7.12.
57. Dallas PB, Gottardo NG, Firth MJ, et al. Gene expressionlevels assessed by oligonucleotide microarray analysis andquantitative real-time RT-PCR - How well do they cor-relate? BMCGenomics 2005;6:59.
58. Bustin SA, Benes V, Garson JA, etal. The MIQE guidelines:minimum information for publication of quantitativereal-time PCR experiments. Clin Chem 2009;55:611–22.
59. Audic S, Claverie JM. The significance of digital geneexpression profiles. GenomeRes 1997;7:986–95.
60. Hyman ED. A new method of sequencing DNA.Anal Biochem 1988;174:423–36.
61. Seo TS, Bai X, Kim DH, etal. Four-color DNA sequencingby synthesis on a chip using photocleavable flourescentnucleotides. Proc Natl Acad Sci USA 2005;102:5926–31.
62. Ozsolak F, Ting D, Wittner B, et al. Amplification-freedigital gene expression profiling from minute cell quantities.NatMethods 2010;7:619–21.
63. Morozova O, Hirst M, Marra MA. Applications ofnew sequencing technologies for transcriptome analysis.Annu RevGenomics HumGenet 2009;10:135–51.
64. Harismendy O, Ng PC, Strausberg RL, et al. Evaluation ofnext generation sequencing platforms for populationtargeted sequencing studies. Genome Biol 2009;10:R32.
65. Mardis ER. The impact of Next-Generation Sequencingtechnology on genetics. Trends Genet 2008;24:133–41.
66. Auer P, Doerge R. Statistical design and analysis of RNAsequencing data. Genetics 2010;185:405–16.
67. Oshlack A, Wakefield MJ. Transcript length bias inRNA-seq data confounds systems biology. Biol Direct2009;4:14.
68. Parkhomchuk D, Borodina T, Amstislavskiy V, et al.Transcriptome analysis by strand-specific sequencing ofcomplementary DNA. Nucleic Acids Res 2009;37:e123.
69. Vivancos A, Guell M, Dohm J, et al. Strand-specific deepsequencing of the transcriptome. Genome Res 2010;20:989–99.
70. Flintoft L. Transcriptomics: Digging deep with RNA-Seq.Nat RevGenet 2008;9:568.
71. Marioni JC, Mason CE, Mane SM, et al. RNA-seq: Anassessment of technical reproducibility and comparisonwith gene expression arrays. Genome Res 2008;18:1509–17.
72. Metzker ML. Sequencing technologies - the next gener-ation. Nat Rev Genet 2010;11:31–46.
73. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seqand RNA-seq studies. NatMethods 2009;6:S22–32.
74. Stegle O, Drewe P, Bohnert R, et al. Statistical tests fordetecting differential RNA-transcript expression fromread counts. Nature Precedings, No. 713. (11 May 2010)doi:10.1038/npre.2010.4437.1.
75. Git A, Dvinge H, Salmon-Divon M, et al. Systematic com-parison of microarray profiling, real-time PCR, andnext-generation sequencing technologies for measuring dif-ferential microRNA expression. RNA 2010;16:991.
76. Willenbrock H, Salomon J, Søkilde R, et al. QuantitativemiRNA expression analysis: comparing microarrays withnext-generation sequencing. RNA 2009;15:2028.
77. Xu AG, He L, Li Z, etal. Intergenic and repeat transcriptionin human, chimpanzee and macaque brains measured byRNA-Seq. PLoSComput Biol 2010;6:e1000843.
78. Tuch BB, Laborde RR, Xu X, et al. Tumor transcrip-tome sequencing reveals allelic expression imbalancesassociated with copy number alterations. PLoSONE 2010;5:e9317.
79. Burgun A, Bodenreider O. Accessing and integrating dataand knowledge for biomedical research. Yearb Med Inform2008:91–101.
80. Stein LD. The case for cloud computing in genome inform-atics. Genome Biol 2010;11:207.
81. Baker M. Next-generation sequencing: adjusting to dataoverload. NatMeth 2010;7:495–9.
82. Robinson MD, Smyth GK. Moderated statistical tests forassessing differences in tag abundance. Bioinformatics 2007;23:2881–7.
83. Trapnell C, Pachter L, Salzberg S. TopHat: discoveringsplice junctions with RNA-Seq. Bioinformatics 2009;25:1105–11.
84. Hammer P, Banck M, Amberg R, et al. mRNA-seqwith agnostic splice site discovery for nervous system tran-scriptomics tested in chronic pain. Genome Res 2010;20:847–60.
85. Robinson MD, Oshlack A. A scaling normalization methodfor differential expression analysis of RNA-seq data. GenomeBiol 2010;11:R25.
86. Robinson MD, McCarthy DJ, Smyth GK. edgeR: abioconductor package for differential expression analysisof digital gene expression data. Bioinformatics 2010;26:139–40.
87. Young MD, Wakefield MJ, Smyth GK, et al. Gene ontol-ogy analysis for RNA-seq: accounting for selection bias.Genome Biol 2010;11:R14.
88. Eid J, Fehr A, Gray J, et al. Real-time DNA sequencingfrom single polymerase molecules. Science 2009;323:133–8.
89. Clarke J, Wu HC, Jayasinghe L, et al. Continuous baseidentification for single-molecule nanopore DNA sequen-cing. Nat Nanotechnol 2009;4:265–70.
90. Deamer DW, Akeson M. Nanopores and nucleic acids:prospects for ultrarapid sequencing. Trends Biotechnol 2000;18:147–51.
91. R: A language and environment for statistical computing.http://www.r-project.org/ (10 January 2011, date lastaccessed).
92. Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor:open software development for computational biology andbioinformatics. Genome Biol 2004;5:R80.
Analog and Next-Generation transcriptomic tools page 15 of 16 by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from
93. Wang L, Feng Z, Wang X, et al. DEGseq: an R package foridentifying differentially expressed genes from RNA-seqdata. Bioinformatics;26:136–8.
94. Trapnell C, Williams B, Pertea G, et al. Transcript assemblyand quantification by RNA-Seq reveals unannotatedtranscripts and isoform switching during cell differentiation.Nat Biotechnol 2010;28:511–5.
95. Falcon S, Gentleman R. Using GOstats to test genelists for GO term association. Bioinformatics 2007;23:257–8.
96. Prufer K, Muetzel B, Do HH, et al. FUNC: a pack-age for detecting significant associations between genesets and ontological annotations. BMC Bioinformatics2007;8:41.
97. Hosack DA, Dennis G, Jr, Sherman BT, et al. Identifyingbiological themes within lists of genes with EASE.Genome Biol 2003;4:R70.
98. Huang DW, Sherman BT, Lempicki RA. Systematic andintegrative analysis of large gene lists using DAVID bio-informatics resources. Nat Protocols 2009;4:44–57.
99. Zhou X, Su Z. EasyGO: gene ontology-based annotationand functional enrichment analysis tool for agronomicalspecies. BMCGenomics 2007;8:246.
100.Eisen MB, Spellman PT, Brown PO, et al. Cluster analysisand display of genome-wide expression patterns. Proc NatlAcad Sci USA 1998;95:14863–8.
101.Wilkinson L, Friendly M. The history of the cluster heatmap. AmStat 2009;63:179–84.
page 16 of 16 Roy et al. by guest on D
ecember 4, 2014
http://bfgp.oxfordjournals.org/D
ownloaded from