a comparison of analog and next-generation transcriptomic tools for mammalian studies

16
A comparison of analog and Next-Generation transcriptomic tools for mammalian studies Nicole C. Roy, Eric Altermann, Zaneta A. Park and Warren C. McNabb Abstract This review focuses on tools for studying a cell’s transcriptome, the collection of all RNA transcripts produced at a specific time, and the tools available for determining how these changes in gene expression relate to the functional changes in an organism. While the microarray-based (analog) gene-expression profiling technology has dominated the ‘omics’ era, Next-Generation Sequencing based gene-expression profiling (RNA-Seq) is likely to replace this analog technology in the future. RNA-Seq shows much promise for transcriptomic studies as the genes of interest do not have to be known a priori , new classes of RNA, SNPs and alternative splice variants can be detected, and it is also theoretically possible to detect transcripts from all biologically relevant abundance classes. However, the technology also brings with it new issues to resolve: the specific technical properties of RNA-Seq data differ to those of analog data, leading to novel systematic biases which must be accounted for when analysing this type of data. Additionally, multireads and splice junctions can cause problems when mapping the sequences back to a genome, and concepts such as cloud computing may be required because of the massive amounts of data generated. Keywords: transcriptomics; microarray; Next-Generation sequencing; RNA-Seq INTRODUCTION The understanding of biological systems comprising large numbers of genes, approximately 20 000–25 000 protein-coding sequences for humans [1] and at least 22 000 in cattle, broadly similar to gene counts in other mammals [2] is challenging. Fortunately, the tools available for studying a cell’s transcriptome and for transforming the large volume of data that these techniques generate into knowledge and new hypo- theses have improved over recent years. Traditional methods for gene expression analysis, such as northern blotting, quantitative real-time polymerase chain reaction (qRT–PCR) or differential display, require the pre-selection of single genes. These methods Nicole C. Roy is the Team Leader of Food Nutrition Genomics, Agri-Foods and Health Section at AgResearch, Palmerston North, New Zealand, an Associate Investigator and an Adjunct Senior Lecturer at the Riddet Institute, Massey University, Palmerston North, New Zealand and a member of Nutrigenomics New Zealand. Current research is focused on factors which affect the nutrient–gene interactions (nutrigenomics, food–microbe–host interactions, intestinal barrier function) that regulate the supply of nutrients to tissues. Eric Altermann is a Senior Research Scientist in Rumen Microbial Genomics in the Rumen, Nutrition and Microbiology Section at AgResearch, New Zealand and an Associate Investigator at the Riddet Institute, Massey University, Palmerston North, New Zealand. Current main research is focused on applied bioinformatic analyses of prokaryotes and development and application of customised bioinformatic algorithms to identify key genetic elements within microbial genetic blueprints. Zaneta A. Park is a Bioinformatician in the Bioinformatics, Mathematics and Statistics Group at AgResearch Grasslands, Palmerston North, New Zealand and a member of Nutrigenomics New Zealand. Current research focuses on transcriptomic analyses of mam- malian and other tissues. Warren C. McNabb is Science and Technology General Manager of the Food and Textiles Group at AgResearch, Palmerston North, New Zealand, a Professor of Nutrition at the Riddet Institute, Massey University, Palmerston North, New Zealand and a member of Nutrigenomics New Zealand. Current research focuses on factors which affect the nutrient–gene interactions that ultimately regulate the supply of nutrients to tissues. Corresponding author. Nicole Roy, Team Leader, Food Nutrition Genomics, Agri-Foods and Health Section, AgResearch Grasslands, Tennent Drive, Private Bag 11008, Palmerston North 4442, New Zealand. Tel.: þ64-6-351-8110; Fax: þ64-6-351-8003; E-mail: [email protected] BRIEFINGS IN FUNCTIONAL GENOMICS. page 1 of 16 doi:10.1093/bfgp/elr005 ß The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected] Briefings in Functional Genomics Advance Access published March 9, 2011 by guest on December 4, 2014 http://bfgp.oxfordjournals.org/ Downloaded from

Upload: w-c

Post on 07-Apr-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

A comparison of analog andNext-Generation transcriptomictools for mammalian studiesNicole C. Roy, Eric Altermann, Zaneta A. Park and Warren C. McNabb

AbstractThis review focuses on tools for studying a cell’s transcriptome, the collection of all RNA transcripts produced at aspecific time, and the tools available for determining how these changes in gene expression relate to the functionalchanges in an organism.While the microarray-based (analog) gene-expression profiling technology has dominatedthe ‘omics’ era, Next-Generation Sequencing based gene-expression profiling (RNA-Seq) is likely to replace thisanalog technology in the future. RNA-Seq shows much promise for transcriptomic studies as the genes of interestdo not have to be known a priori, new classes of RNA, SNPs and alternative splice variants can be detected, and itis also theoretically possible to detect transcripts from all biologically relevant abundance classes. However, thetechnology also brings with it new issues to resolve: the specific technical properties of RNA-Seq data differ tothose of analog data, leading to novel systematic biases which must be accounted for when analysing this type ofdata. Additionally, multireads and splice junctions can cause problems when mapping the sequences back to agenome, and concepts such as cloud computing may be required because of the massive amounts of data generated.

Keywords: transcriptomics; microarray; Next-Generation sequencing; RNA-Seq

INTRODUCTIONThe understanding of biological systems comprising

large numbers of genes, approximately 20 000–25 000

protein-coding sequences for humans [1] and at least

22 000 in cattle, broadly similar to gene counts in

other mammals [2] is challenging. Fortunately, the

tools available for studying a cell’s transcriptome and

for transforming the large volume of data that these

techniques generate into knowledge and new hypo-

theses have improved over recent years. Traditional

methods for gene expression analysis, such as northern

blotting, quantitative real-time polymerase chain

reaction (qRT–PCR) or differential display, require

the pre-selection of single genes. These methods

Nicole C. Roy is the Team Leader of Food Nutrition Genomics, Agri-Foods and Health Section at AgResearch, Palmerston North,

New Zealand, an Associate Investigator and an Adjunct Senior Lecturer at the Riddet Institute, Massey University, Palmerston North,

New Zealand and a member of Nutrigenomics New Zealand. Current research is focused on factors which affect the nutrient–gene

interactions (nutrigenomics, food–microbe–host interactions, intestinal barrier function) that regulate the supply of nutrients to tissues.

EricAltermann is a Senior Research Scientist in Rumen Microbial Genomics in the Rumen, Nutrition and Microbiology Section at

AgResearch, New Zealand and an Associate Investigator at the Riddet Institute, Massey University, Palmerston North, New Zealand.

Current main research is focused on applied bioinformatic analyses of prokaryotes and development and application of customised

bioinformatic algorithms to identify key genetic elements within microbial genetic blueprints.

Zaneta A. Park is a Bioinformatician in the Bioinformatics, Mathematics and Statistics Group at AgResearch Grasslands, Palmerston

North, New Zealand and a member of Nutrigenomics New Zealand. Current research focuses on transcriptomic analyses of mam-

malian and other tissues.

WarrenC.McNabb is Science and Technology General Manager of the Food and Textiles Group at AgResearch, Palmerston North,

New Zealand, a Professor of Nutrition at the Riddet Institute, Massey University, Palmerston North, New Zealand and a member of

Nutrigenomics New Zealand. Current research focuses on factors which affect the nutrient–gene interactions that ultimately regulate

the supply of nutrients to tissues.

Corresponding author. Nicole Roy, Team Leader, Food Nutrition Genomics, Agri-Foods and Health Section, AgResearch Grasslands,

Tennent Drive, Private Bag 11008, Palmerston North 4442, New Zealand. Tel.: þ64-6-351-8110; Fax: þ64-6-351-8003;

E-mail: [email protected]

BRIEFINGS IN FUNCTIONAL GENOMICS. page 1 of 16 doi:10.1093/bfgp/elr005

� The Author 2011. Published by Oxford University Press. All rights reserved. For permissions, please email: [email protected]

Briefings in Functional Genomics Advance Access published March 9, 2011 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 2: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

have been and are still useful, but they miss important

effects in biological processes, such as metabolic and

signaling pathways and transcriptional networks

across several pathways [3] because they are confined

to the analysis of single genes or a very limited num-

ber of selected genes of interest in a few samples.

The advent of high-throughput DNA sequencing

and the subsequent development of analog gene

expression techniques such as microarrays, repre-

sented a critical breakthrough as the simultaneous

measurement of the expression of many thousands

of genes in a sample was finally possible. The recent

development of Next-Generation Sequencing and

its use in transcriptomic analysis (RNA-Seq) now

potentially enables the quantitative measurement of

‘all’ genes expressed in a sample.

The completion of the Human Genome Project in

2003 [4, 5], the bovine genome (first draft from 2006

and publication in 2009; http://www.hgsc.bcm.tmc

.edu/project-species-m-Bovine.hgsc), followed by

other initiatives for porcine and ovine, etc. (see

http://www.genomesonline.org/ for a more com-

plete list), and the subsequent availability of complete

genomic DNA sequence data are only a few rea-

sons why these newer technologies have advanced

so quickly. Initiatives that take advantage of these

technologies have been established (Table 1 for

examples). These disciplines cover such a diverse

scope as nutrition, medicine, parasitology, molecular

biology, chemistry, mathematics and bioinformatics

in human and livestock research. These programmes

are leading recent advances in human and animal

genomics [6, 7].

This review presents an overview of currently

existing and emerging transcriptomic technologies,

including their advantages and limitations, and data

management and analysis strategies and concerns.

Available software solutions and the integration of

different data sets are also discussed.

ANALOGMETHODS FORTRANSCRIPTOMICANALYSISBackgroundThe measurement of mRNA levels for a complete

set of transcripts in a single assay can be accomplished

using DNA and oligonucleotide microarray (chip)

technology. This analog high-throughput method-

ology has become a standard tool for gene expression

profiling, facilitating the analysis of genome-wide

expression patterns, whether there is a sequenced

genome or not (although the sequence of the

probes on an array are usually known), to establish

gene networks and identify new genes involved in a

phenotype.

Microarrays have been used to study many differ-

ent organisms. Their use includes nutritional studies

[3, 8–15] and livestock research [16–21] where they

are used to characterize metabolic pathways in tissues

or cells important to phenotypic outcomes. For ex-

ample, in cattle, they have been used to investigate

folliculogenesis, ovulation and oocyte quality, and

early embryonic development [22].

Array typesHybridization arrays include macroarrays (spot diam-

eter of each probe >300 mm) and microarrays (spot

diameter of each probe <250 mm). Microarrays, the

most commonly used of the hybridization arrays,

usually consist of a predefined arrangement of a

large number of probe (DNA) sequences (whole

genome or partial) immobilized on a solid surface

that serve as a hybridization substrate for cRNA or

cDNA fragments generated from a tissue or cell

sample (target). RNA extracted from a tissue or

Table 1: Examples of worldwide genomic initiatives

Initiative Scope Website Country

Nutrigenomics organization Nutrigenomics www.nugo.org EuropeNutrigenomics New Zealand Nutrigenomics www.nutrigenomics.org.nz New ZealandFugato Bovine, porcine,

ovine and equinewww.fugato-forschung.de Germany

Milk genomics and human health Mammalian milkgenomics

www.milkgenomics.org USA

International sheep genomics consortium Ovine https://isgcdata.agresearch.co.nz/ InternationalThe bovine genome sequencing and

analysis consortiumBovine http://www.bcm.edu/news/packages/bovinegenome.cfm International

page 2 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 3: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

cell sample is amplified and reverse transcribed to

produce cDNA labelled with a fluorophore or radi-

olabel which is hybridized to the microarray chip

under stringent conditions. After hybridization is

completed, the microarray slides are washed then

scanned using a microarray scanner to detect the

intensities of the target signal and the background

noise [23].

While in-house microarray production still exists

(especially for incomplete, non-sequenced or confi-

dential genomes), several commercial platforms offer

microarray products of high quality. There are many

variants of this technology that can be grouped in

two categories based on the length and origin of the

spotted DNA: (i) the DNA microarray, where PCR

amplicons of a few hundred base pairs of denatured

double-stranded DNA are spotted onto glass slides;

(ii) the oligonucleotide microarray, where chem-

ically synthesized single-stranded DNA is immobi-

lized on glass slides. For example, the Agilent

platform where probes of 40–70 nt in length are

tiled; and the Affymetrix platform where 25-mers

are tiled, with typically 11–16 25-mers tiled for

each gene. For the latter platform, ‘mismatch’

probes are also tiled for each probe (i.e. the middle

nucleotide is replaced with another) to allow adjust-

ment of the data for cross-hybridization, therefore

increasing the reliability of the microarray data

although at the expense of capacity. More recently,

Affymetrix introduced all-exon arrays (HuEx arrays,

i.e. Human, Mouse and Rat Exon/Gene 1.0 ST

arrays) which differ significantly from the traditional

30-expression arrays described above. Here, exons

are covered by only four probes and T7-linked

random hexamers used for cDNA synthesis eliminate

the need for intact poly-A tails. Studies comparing

30-expression arrays to HuEx arrays reported a high

level of cross-platform comparability with only a

limited number of recognized problems, such as dif-

ferences in detection thresholds [24]. Numerous

mammalian DNA and oligonucleotide microarrays

are available, including for humans, livestock species

(e.g. cattle, pig, horse, sheep and chickens) and other

mammals (e.g. canine, rabbit, Rhesus macaque, etc).

For species where microarrays are less readily avail-

able (e.g. goat), cross-species hybridization to exist-

ing arrays is possible though not ideal [22, 25].

Custom microarrays can also be used to tile a

chosen set of genes of interest.

Additionally, a third category of microarray exists,

the beadarray (Illumina platform), which consists of

thousands of three-micron silica beads each coated

with hundreds of thousands of copies of a specific

oligonucleotide sequence, which self-allocate ran-

domly across an array. A decoding process is used

to determine which bead occupied each well.

Beadarrays typically comprise multiple copies

(around 30) of each bead type per array. The beads

are combined with more than 1000 control bead

types on the arrays (used as negative controls)

which, with the random allocation of multiple

copies of each bead type, result in high quality data

at relatively low cost and sample input [26].

Variations on the microarray theme [e.g. exon

junction arrays, tiling arrays, fusion chips and single

nucleotide polymorphism (SNP) chips] allow micro-

array technology to be used for more than gene ex-

pression profiling. Exon junction, tiling and fusion

arrays allow detection of alternative splice variants of

a (fused) gene [27, 28]. SNP chips allow the identi-

fication of SNPs within and between populations

[29, 30] enabling uses such as assessing levels of

genetic variability including loss of heterozygosity

[31, 32], detection of allelic imbalance [33] and

construction of linkage disequilibrium maps which

facilitate the association of genetic variation with

economically important traits, for example as

developed and characterized in bovine [34–36].

The recent development of an ovine SNP chip con-

taining 1536 SNPs represents the first time that the

sheep genome has been assayed on a genome-wide

basis. Using the allele frequencies at each SNP, cal-

culation of genetic parameters (e.g. genetic distance)

have allowed the levels of genetic variability both

within and between a diverse group of ovine popu-

lations to be determined. This in combination with

cluster analysis has shown that sheep are character-

ized by weak phylogeographic structure, overlapping

genetic similarity and generally low differentiation

which is consistent with their short evolutionary

history [31].

Microarray limitationsMicroarrays are now an affordable technique that

provide RNA expression pattern data based on a

high-throughput and semi-quantitative analysis of

light signaling intensity. They do, however, have

their limitations [37]. The data need to be normal-

ized to remove spatial artefacts and systematic biases,

and appropriate statistical analysis must be used to

reduce the number of false positives obtained from

testing so many genes at once. Furthermore, as the

Analog and Next-Generation transcriptomic tools page 3 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 4: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

technique relies on hybridization, it brings a range of

related potential problems such as background

hybridization levels (including cross-hybridization),

differential probe hybridization properties and dye

binding variances [37–40]. These variables mean

that microarrays do not easily quantify the expression

pattern of low abundance transcripts as the low in-

tensity fluorescence signal is difficult to distinguish

numerically and statistically from background noise.

Conversely, signal saturation can occur at high inten-

sities therefore limiting the ability to compare the

level of expression of transcripts which are expressed

at very high levels [41–43]. The information gener-

ated with hybridization arrays is also limited to the

number of probes on the microarray slide and usually

to genes with known sequence. Microarrays are also

constrained in their ability to detect splice variants,

either because not all the forms are tiled or they

are too closely related for hybridization methods to

distinguish [42, 44].

ANALOG-DERIVEDTRANSCRIPTOMIC DATAAnalysis methodsDetermining the biologically significant changes in

gene expression levels from a large amount of gene

data is still a challenge for microarray-based transcrip-

tomic data.

Analysis of microarrays typically comprises quality

control and normalization, followed by determin-

ation of a list of differentially expressed genes based

on fold change and some type of significance criteria

(most commonly used parameter is the P-value).

The latter is usually calculated from a t-test, with

the recommendation that the variance estimate

used should be determined using both gene-specific

information and information from across all genes

[22, 45, 46]. Volcano plots are an effective way of

summarizing the results for the two criteria [46].

Additionally, adjustments for multiple testing are

usually applied to control the number of false posi-

tives, with the false discovery rate (FDR) [47] a

popular choice. Mixture model methods which

treat genes as being composed of two populations:

one differentially expressed and one not, are also an

option [45].

An alternative analysis approach involves directly

determining sets of genes which can be used to suc-

cessfully differentiate samples from different treat-

ments. The random forest procedure has been

shown to be a useful method for doing this as it

yields very small sets of genes (often smaller than

alternative methods) while preserving predictive

accuracy [48].

Available softwareBehind these analysis tools are sophisticated mathem-

atical and statistical models which have been imple-

mented in a variety of open-source and commercial

software packages. Some of the most well known

packages available include GeneSpring (http://

www.chem.agilent.com/en-US/products/software/

lifesciencesinformatics/genespringgx/Pages/default

.aspx), GenStat (http://www.vsni.co.uk/software/

genstat/, [49]), Spotfire and the Bioconductor suite

of microarray analysis packages written in the R pro-

gramming language (www.bioconductor.org, freely

available). The latter includes linear models for

microarray analysis (limma, [50]), affy [37] and sim-

pleaffy [51] for Affymetrix data analysis, lumi [52] for

Illumina beadarray analysis and arrayQualityMetrics

[53] for quality control.

Most analysis programs accept data from a variety

of sources (e.g. Affymetrix, Agilent, Illumina and in-

house data). The Bioconductor suite of packages has

the advantage that it is open source and readily ac-

cessible, thus facilitating collaborative projects. Many

software packages can also be further customized as

they include application programming interfaces

which allow them to interact directly with other

tools. For example, GeneSpring and Spotfire both

include modules which allow R code to be used

within the package while providing easy interactive

visualization of the data and results. A more compre-

hensive summary of many packages and their inter-

activity can be found in Supplementary Table S1

of [54].

Interfaces to a range of underlying tools have also

been developed, including Chipster (http://nami

.csc.fi/features.shtml, freely available for local instal-

lations or alternatively the CSC server can be used

for a fee) which allows one to perform DNA micro-

array data analysis with R/Bioconductor and other

tools through an intuitive graphical user interface and

GenePattern (freely available). GenePattern was de-

veloped at the Broad Institute of Massachusetts

Institute of Technology and Harvard University. It

is a software environment which incorporates a wide

range of already developed tools and has the ability

to adopt new methods. The GenePattern platform

consequently integrates a large number of analytical

page 4 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 5: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

tools for genomic data, facilitates data entry and

allows parameters to be set with regard to quality

control, statistics, gene enrichment analysis, etc.

Furthermore, custom modules in R/Bioconductor,

MATLAB or the Perl or Java programming lan-

guages can be implemented. GenePattern includes

60–100 pre-packaged analysis modules and more

complex methodologies (not only for microarray,

but also for epigenetic, SNP, proteomic and

sequence analyses) aligning analyses into a single, re-

producible pipeline that caters for users at all levels

of computational experience [54–56].

Validation of resultsSelected differentially expressed genes are further

investigated by qRT–PCR to confirm the expression

patterns seen in microarrays. The use of PCR pri-

mers targeting transcript variants is one reason for

discrepancies between these two methods, thus

care has to be taken to design PCR primers which

recognize the same transcript as each microarray

probe. [57]. The minimum information for publica-

tion of quantitative real-time experiments (MIQE)

guidelines provide best experimental practice for

qRT–PCR to generate data that are more uniform,

more comparable and ultimately, more reliable [58].

RNA-Seq METHODS FORTRANSCRIPTOMICANALYSISBackgroundRNA-Seq transcriptomics replaces the hybridization

of nucleotide probes with sequencing individual

cDNA species followed by counting and mapping al-

gorithms. Emerging methods for these fully quantita-

tive transcriptomic analyses have the potential

to overcome the limitations of microarray technology

and may replace them. Early attempts reach as far back

as 1997 [59] where random clones derived from

cDNA libraries were sequenced using fluorescent

well and capillary DNA sequencers (i.e. Applied Bio

Systems ABI PRISM 373, 377 and 3730xl DNA

Sequencer). Variations in gene expression levels

were deduced from the counts of respective sequence

tags. However, restrictions in technology meant

sequencing and analysing large numbers of sequences

was slow, expensive and labor intensive so that only

a relatively small number of clones (usually in the

thousands) were sequenced. Besides the small sample

size, cloning bias also introduced new drawbacks.

Today, Next-Generation Sequencing techniques

have surpassed these relatively low throughput

technologies. In principle, all competing products

are based on the sequencing-by-synthesis tech-

nique [60] or sequencing-by-ligation (http://www3

.appliedbiosystems.com/cms/groups/mcb_

marketing/documents/generaldocuments/cms_

058265.pdf). Sequencing-by-synthesis relies on the

detection of nucleotides immediately after incorpor-

ation into a newly synthesized DNA strand, whereas

sequencing-by-ligation involves the binding of

known probes to the sequence. While the principle

remains unchanged, a number of variants and im-

provements have been introduced [61].

Number and length of reads obtainedusing different platformsCurrently, three Next-Generation Sequencing sys-

tems [Roche 454 Life Sciences (Pyrosequencing),

Illumina and Applied Biosystems (Solid 3 Plus,

Solid 4 and Solid 4hq)] are dominating the market

for RNA-Seq with emerging alternative systems

(e.g. Helicos BioSciences [62]). These systems can

also be used for other specific applications such as

the targeted resequencing of genomes, ChIP-Seq

or copy number variation analyses [63]. Next-

Generation Sequencing systems have brought a

dramatic change in scale to DNA sequencing.

Traditional Sanger Type plate and capillary sequen-

cers can provide up to 96 reads per run, albeit at

longer reads length in excess of 1000 nt. Roche

454 Pyrosequencing systems generate hundreds of

thousands of reads per run and, with the introduc-

tion of the latest Flx-Titanium system upgrade, have

extended read length up to 500 nt. This extension in

sequence read length has been achieved by the use of

a thin metal coating (titanium) that is applied to the

pico titer plate walls eliminating crosstalk between

individual wells. The coating improves the signal

to noise ratio and also increases the number of sam-

ples that can be analysed in one plate. Recent an-

nouncements by Roche 454 unveiled yet another

doubling of read length, targeting the 1000 nt barrier

(http://454.com/about-454/news/index.asp?display

¼detail&id¼137). Finally, solid state sequencing

systems, such as Illumina or ABI SOLiD, can gener-

ate sequence reads in excess of 100 million per

run, albeit with the shortcoming of very short

read-lengths of <50–100 nt. Indeed, the Applied

Biosystems Solid 4 and Solid 5500xl (which delivers

on the promise of the Solid 4hq) systems have re-

cently been released with up to 300 Gb mappable

data throughput. While the scalability factor reduces

Analog and Next-Generation transcriptomic tools page 5 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 6: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

the cost per run ($US3000 per genome or $US120

per transcriptome, http://www.appliedbiosystems

.com/absite/us/en/home/applications-technologies/

solid-next-generation-sequencing/next-generation-

systems/solid4hq.html; http://www.lifetechnologies

.com/news-gallery/press-releases/2010/life-

techologies-lauches-ew-solid-sequecer-to-drive-

advaces-i-c.html), other potentially limiting factors

such as short read lengths of 75 bp or less or the

necessity of creating fragment libraries still remain.

RNA-Seq gene expression profilingDue to their read-length restrictions these ultra-

high-throughput systems may not necessarily be

suited for de novo genome sequencing (with the

recent exception of 454 pyrosequencing with its

longer reads). However, these limitations are irrele-

vant for RNA-Seq gene expression profiling. Here,

the promise of hundreds of millions of reads creates

for the first time the realistic opportunity to assess the

transcriptome of an organism on a holistic level

without cloning bias (although each platform may

have its own associated biases, see below) [64, 65].

Briefly, current technologies utilize a common

principle: RNA is converted to a library of cDNA

fragments via either RNA or cDNA fragmentation

and adapters are attached to one or both ends of the

fragments. Individual cDNA species are then ampli-

fied into separate clusters to amplify the signal inten-

sity. These clusters are then sequenced from one

(single-end sequencing) or both ends (pair-end

sequencing) and the reads aligned to a reference

genome or transcriptome [43, 65, 66]. The number

of sequencing reads mapped to each gene is then

tabulated and normalized. In general, the larger the

genome and the more complex the transcriptome,

the higher the number of reads and the greater the

sequencing depth that is required for adequate

coverage [43].

Standard protocols use random fragmentation of

cDNA or RNA which means that more than one

fragment may be produced from a single transcript.

However, different processing methods can be com-

bined with RNA-Seq and give the advantage that

only a single tag for each transcript is produced. Such

processing methods include serial analysis of gene

expression (SAGE) (SOLiD-SAGE; http://tools

.invitrogen.com/content/sfs/manuals/SOLiD_

SAGE_man.pdf), cap analysis of gene expression

(CAGE) and massively parallel signature sequencing

(MPSS) [67]. Disadvantages include that non-coding

RNAs may not be detected, the methods rely on the

presence of particular restriction sites [68], and the

fragments may be considerably shorter (for example,

27 bp for the SOLiD-SAGE protocol) and therefore

do not detect splice isoforms [42].

More recently, experimental evidence indicates

that RNA fragmentation, compared to cDNA frag-

mentation, may significantly improve the uniformity

of sequence coverage across transcripts, so allowing

greater sensitivity of detection, accuracy of quantifi-

cation and completeness of splice and exon maps.

RNA fragmentation may work better because of

30-bias induced during cDNA synthesis and also

the secondary RNA structure may mean that prim-

ing is not truly random but instead some sites are

favoured over others [42].

Additionally, whether a strand-specific protocol is

used in RNA-Seq is an important factor to consider

[68], particularly when studying mammals where

antisense transcription has been shown to be a ubi-

quitous phenomenon [69].

Advantages of RNA-SeqInitially, Next-Generation Sequencing techniques

were used to provide insights into the way genes

are expressed and regulated in cells. More recently,

they have also been heavily utilized to determine

new classes of RNA, SNPs, unknown transcripts,

splicing events, etc. which were inaccessible on a

global level using older technologies [42, 44, 70–74].

The digital nature of RNA-Seq gene expression

studies holds the promise of true quantitative analyses

[43]. While the implementation of this technique is

still in its infancy and essentially in validation stage, its

current prohibitive cost is expected to decrease and

this technology will become more and more access-

ible. Out of the currently dominating Next-

Generation Sequencing technologies, Illumina and

Solid platforms are better suited for RNA-Seq appli-

cations than Roche 454 Pyrosequencing. This is

largely due to the much greater number of individual

sequence reads and the resulting increased depth of

coverage. If RNA-Seq is to be combined with

de novo genome sequencing, Illumina/Solid data can

easily be augmented by paired-end pyrosequencing

reads, creating a robust genome scaffold for the

shorter Illumina/Solid reads.

RNA-Seq creates the new Gold standard for gene

expression studies as transcripts from all biologically

relevant abundance classes should theoretically be

able to be detected assuming enough reads are

page 6 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 7: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

collected from a sample [42, 43, 70]. Indeed, using

one of the earlier RNA-Seq gene expression tech-

nologies (digital gene expression profiling) with 20

million tags per library, 10–20% more transcripts

were detected than with microarrays, a majority of

which were expressed at levels below the sensitivity

threshold of microarray platforms [41]. A comparison

of the Illumina sequencing platform with the

Affymetrix microarray platform, showed that 81%

of differentially expressed genes from arrays were de-

tected with Illumina and more of these genes were

true positives with the Illumina technology [71].

Additionally, comparison of relative RNA-Seq read

densities to published qRT–PCR measurements for

787 genes in two reference RNA samples yielded a

nearly linear relationship across five orders of mag-

nitude, indicating that RNA-Seq read counts give

accurate relative gene expression measurements

across a very broad dynamic range [44].

Alternative splice variants have been proposed as a

primary driver of the evolution of phenotypic com-

plexity in mammals [44]. The promise that RNA-Seq

gene expression studies can easily identify these using

direct sequencing may prove to be a major advantage

compared to hybridization methods which cannot

identify closely related forms except via expensive

high-resolution tiling arrays. For example, RNA-

Seq methods found that 3500 mouse genes were

alternatively spliced [42] and 4096 previously un-

known splice junctions in 3106 human genes were

detected in a recent study [70]. Indeed, in humans

there is evidence for multiple isoforms for>95% of all

multi-exon genes. These transcripts are the result of

alternative transcription starts, alternative splicing,

RNA editing and alternative poly-adenylation [74].

RNA-Seq is also useful for discovering novel

microRNAs (small RNAs �20–25 nt in length)

[75]. However, sequence read variations due to ma-

chine error may be higher than the variation found

within a microRNA family, so one must be cautious

in interpreting this data. Also, the accuracy of RNA-

Seq readings may not necessarily be better than that

of microarrays: indeed, a recent study which used

synthetic microRNAs found that microarrays mea-

sured the expression levels of microRNAs better

than RNA-Seq [76].

Examples of biological applications ofRNA-SeqAlthough RNA-Seq can be still considered an emer-

ging technology, it has generated new knowledge of

biological systems. For example, RNA-Seq was used

to characterize the total non-ribosomal transcriptome

of human, chimpanzee and rhesus macaque brain

[77]. In this study, the authors showed that while

transcriptome divergence between species increases

with evolutionary time, intergenic transcripts show

more expression differences among species and exons

show less. These yet uncharacterized evolutionary

conserved transcripts that exist in the human brain

may play roles in transcriptional regulation and con-

tribute to evolution of human-specific phenotypic

traits. Another example relates to the use of a

novel, strand-specific RNA-Seq method. Using

this method with tumors and matched normal

tissue from three patients with oral squamous cell

carcinomas, Tuch et al. [78] showed that it accurately

measures allelic imbalance and that measurement on

the genome-wide scale yields novel insights into

cancer etiology. Cancer-related functions such as

cell adhesion and differentiation functions were

found to be enriched in the set of genes differentially

expressed in the tumors, but, unexpectedly, also in

the set of allelically imbalanced genes.

RNA-SeqTRANSCRIPTOMIC DATAVolume of data producedAs with the analog technique, a large volume of data

is produced. Indeed, the arrival of the RNA-Seq era

increased the data volume by several magnitudes and

handling of this amount of data is an important con-

sideration, both in terms of collecting and managing

the data and the computer hardware (server space)

and software required [43, 79, 80]. The amounts of

data are so large that if current trends continue, it will

soon cost less to sequence a base of DNA than to

store it on a hard disk [80]. Alternative data manage-

ment concepts such as cloud-based computing (essen-

tially renting server space) are already available [81].

For example scientists can currently establish an

account with Amazon Web Services or Microsoft

Azure, attach any one of several large public

genome-oriented data sets to the virtual machine

and analyse this data using any one of several installed

software packages. There are also a growing number

of academic-based clouds, for example the Open

Cloud Consortium (http://opencloudconsortium

.org/). These may be a better option long-term

as academic clouds are more likely to be able to

tune their performance to the specific needs of the

scientific community, for example data read and

Analog and Next-Generation transcriptomic tools page 7 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 8: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

write speeds need to be very high for genomic data

[80].

Statistical testsVarious tests of differential expression have been

proposed for replicated RNA-Seq data using bino-

mial, Poisson, negative binomial or pseudo-

likelihood models for the counts [66, 82]. How to

best analyse RNA-Seq data is an active field of re-

search. Robinson and Smyth [82] have recently de-

veloped a method using the negative binomial

distribution to model over-dispersion relative to

the Poisson, and use conditional weighted likelihood

to moderate the level of over-dispersion across genes.

This method is suitable even when the number of

replicates is very small. Additionally, methods from

the SAGE literature may be useful for analysing

RNA-Seq data [66].

Analysing counts of alternative isoforms creates

particular analytical problems. Initial analysis meth-

ods, for example the Poisson test, have focused

on first assigning reads to transcripts and then testing

for differential expression. Stegle et al. [74] describe a

modification to this approach, the Poisson Region

test, which only utilizes information about the

discriminative regions of a gene. They also present

a non-parametric kernel method, the Maximum

Mean Discrepancy (MMD) test, which directly

tests for differences of the observed read distri-

bution from different samples in the complete

absence of any annotation information. In compar-

ing these three methods using simulated and

real data, the Poisson Region test was the most

sensitive. However, the MMD test was still able to

detect 75% of the differentially expressed tran-

scripts that the Poisson Region test could.

Additionally, the MMD test has the advantage that

it can detect differential expression even if only one

annotation is currently known for a gene. It also does

not depend on the accuracy of existing gene

annotations.

Experimental design and quality controlMany issues must be considered when planning an

RNA-Seq experiment (Figure 1). No matter which

method is used or how many reads are generated,

using generally accepted experimental design prin-

ciples such as randomization of samples to lanes or

plates, and sufficient biological replication are rec-

ommended when designing RNA-Seq experiments.

Biological replication is essential as otherwise the

results from an experiment cannot be generalized.

Similarly, randomization and blocking are equally

important factors in reducing the effects of batch,

lane or flowcell variations. We refer the reader to

the excellent paper which Auer and Doerge have

recently written [66]. This clearly explains key stat-

istical principles which should be incorporated when

designing and analysing RNA-Seq experiments.

They also provide practical suggestions, for example

barcoding may be a useful tool for creating balanced

block designs.

Quality control is also an important aspect of

RNA-Seq data analysis. For example, it is useful to

plot both the proportions of each nucleotide type,

and the base quality scores, for each sequence pos-

ition. A filter can then be applied to trim the se-

quence ends if they contain bases which are of low

quality or which have atypical nucleotide

proportions.

Mapping considerationsWith analog expression data, one usually knows

what the genes are in advance, whereas with

RNA-Seq, all transcripts need to be mapped back

to a reference genome or transcriptome.

Difficulties in mapping transcripts to genes can

occur and mammalian genomes in particular create

difficulties as they are large, complex and often con-

tain families of paralogous genes, repeats and retro-

posed pseudogenes for highly expressed

housekeeping genes. Therefore individual reads, par-

ticularly shorter ones, may map to more than one

gene. Such multiread transcripts cannot be simply

discarded, as these genes, for example those in the

ubiquitin family, will then be undercounted or not

even reported. Alternative approaches such as distri-

buting multireads in proportion to the number of

unique and splice reads recorded at similar loci or

using orthogonal data (for example RNA polymer-

ase II occupancy data) have been proposed to help

resolve these issues [42, 73].

Mapping splice junctions is also an important issue

to consider when mapping reads from complex

mammalian (and other) genomes where reads may

span large introns [42, 73]. Two main approaches are

currently used: the reference genome may be sup-

plemented with known splice junction information

(including information from gene models) or alter-

natively the splice junctions can be determined with-

out a reference annotation. TopHat (http://tophat

.cbcb.umd.edu/; [83]) is a powerful freely available

page 8 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 9: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

Figure 1: Description of RNA-Seq platform, protocol considerations and workflow.

Analog and Next-Generation transcriptomic tools page 9 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 10: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

mapping program which can map reads using any

combination of these methods. A range of other

mapping programs also exist: see Table 2 in Pepke

et al. [73] for an excellent summary. Developing

in silico methods to map splice junctions agnostically

(i.e. independently of existing genome annotations)

is still an active area of research though. Hammer

et al. [84] have recently developed a series of novel

bioinformatic tools which advance RNA-Seq bio-

informatics toward unbiased transcriptome capture.

It is likely that further tools and methodology will be

developed in the short to medium term.

Normalization and biasesAs for the analog transcriptomic analyses, the effi-

ciency of RNA extraction and the quality of

cDNA synthesis remain as variables. Additionally,

the specific technical properties of RNA-Seq data

differ to those of analog data, leading to novel sys-

tematic biases which must be accounted for in the

analysis.

The number of reads obtained per sample usually

differs for RNA-Seq. Thus a range of normaliza-

tion methods for RNA-Seq based on the total

number of reads for each sample have been reported.

However it has recently been shown that the com-

position of the RNA population is also import-

ant [85]. Transcripts which are highly expressed in

only some samples, due to true biological differences

(e.g. genes which are only expressed in liver and

not kidney) or contamination, reduce the sequen-

cing ‘real estate’ available for the remaining genes,

meaning that these genes will be under-represented

if the data is normalized solely using a total gene

count approach. A normalization method,

Trimmed Mean of M values (TMM) which accounts

for this issue has been proposed by Robinson

and Oshlack [85]. The method assumes, similarly

to common microarray normalization methods

(e.g. loess and quantile) that the majority of genes

are not differentially expressed. It then determines

the relative RNA production for all genes in a

sample using a global fold change approach calcu-

lated by using trimmed means. This normaliza-

tion method can be implemented using the edgeR

package in Bioconductor ([86]; www.bioconductor

.org.).

Furthermore, current RNA-Seq protocols usually

use random fragmentation of the RNA (or cDNA)

which implies that the expected count for a transcript

is proportional to the gene’s expression level

multiplied by its transcript length, as longer tran-

scripts generate more fragments. This means that

longer genes have higher transcript counts and so,

relative to shorter genes, are more likely to be

found to be differentially expressed, particularly if

the gene is also a lowly expressed one [67, 87].

Normalization methods which account for gene

length, for example reads per kilobase per million

mapped (RPKM) [42] have been developed.

However, the problem cannot be corrected by

simply dividing by the length of the transcript or

some modification of this, as while this results in

an unbiased measure of expression, the data variance

is still affected in a length dependent manner. This

problem has been observed for a variety of different

analysis methods, experimental designs and sequen-

cing platforms. The bias causes most problems when

the results for different genes are compared, when

creating ranked gene lists, or gene category

over-representation analysis is undertaken; with a

proposed suggestion for the latter recently published

([87]; discussed further below).

This bias is elevated by the fact that current tech-

nologies require both amplification and fragmenta-

tion steps for mRNA/cDNA species used in the

analyses [72]. Emerging technologies such as small

molecule real-time DNA sequencing (SMRT)

[62, 88] and long-read sequencing such as nanopore

DNA sequencing [89, 90] and direct-read genetic

sequencing using Transmission Electron Microscope

(http://www.zsgenetics.com) show promise to over-

come this new system based bias.

Systematic biases in the bases sequenced and

sequencing errors also need to be considered.

These result from the combined effects of the manu-

facturer recommended laboratory methods, se-

quence read alignment tools and base calling

algorithms utilized. Recent advances such as the abil-

ity to obtain longer reads and paired-end sequencing

alleviate these issues, however further optimizations

are desirable [64, 72].

Available softwareThe tools available for RNA-Seq derived transcrip-

tomic data analyses are not as mature yet as those for

analog data. However, in this fast moving science

application, software developers are rapidly closing

the gap. Recently, several new software packages

and modules for existing tools have been released.

One of the most well known and established bio-

informatic companies, DNASTAR, has recently

page 10 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 11: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

released version 3 of its gene expression analysis

software, ArrayStar (a beta version of ArrayStar 4 is

also available). An optional module, QSeq (http://

www.dnastar.com/products/QSeq.php), has been

developed specifically for analysis of RNA-Seq

gene expression data. Data sets from the most

widely used platforms such as Illumina and Roche

454 can be imported directly. The range of sup-

ported analyses includes transcript discovery and

mapping, detection of alternative splicing events

and transcriptome quantification. Similarly, CLCbio

(http://www.clcbio.com) has released its CLC

Genomics Workbench software suite which inte-

grates genomics (genome denovo and re-sequencing),

transcriptomics (RNA-Seq gene expression) and epi-

genomics (ChIP-seq analysis) in one environment.

GenomeQuest takes a decentralized approach by

offering a web-based service. Its RNA-Seq work-

flow (http://wiki.genomequest.com/index.php/

RNA_Seq) accesses the databases of transcriptomes

and genomes while being able to utilize third-party

tools such as GeneSpring and Spotfire for further

analyses.

Open source software also plays an important part

in the analysis of RNA-Seq data as it is able to adapt

quickly to changes in technology, and is not slowed

by the need to wait for official release dates like its

commercial counterparts. With a large Bioconductor

community developing much of this software using

the R programming language ([91, 92], http://

www.bioconductor.org), such software is usually

of similar quality to that of commercial software or

may even surpass it. The DEGseq package [93] for

analysing RNA-Seq data and edgeR package

[82, 86], both from the Bioconductor suite are two

examples of open source software that are freely

available for use in analysing RNA-Seq data.

Additionally, Cufflinks (http://cufflinks.cbcb.umd

.edu/; [94]) is an open source program which can

be used to assemble transcripts, estimate their abun-

dances, and test for differential expression and regu-

lation in RNA-Seq samples. Cufflinks is particularly

useful for researchers who are interested in alterna-

tive transcript or splice variants as it can identify novel

transcripts and probabilistically assign reads to iso-

forms without the need for prior gene annota-

tion knowledge. Finally, the ShortRead R package

and FASTX-Toolkit (http://hannonlab.cshl.edu/

fastx_toolkit/) are two freely available packages

which enable quality control of short read

RNA-Seq data.

FUNCTIONALANALYSIS OFTRANSCRIPTOMIC DATAClassification conceptsFor both analog and RNA-Seq data, gene filtering

methods aim to find a list of differentially expressed

genes that are significantly associated with the

phenotype studied. Tens to hundreds of genes, or

an entire gene network, may be the causal link to

a specific phenotype in response to a particular

stimulus. Targeting networks which affect a given

phenotype is likely to require the identification of

genes that serve as key nodes in the network (key

information points). There may also be interest in

the response across multiple species (comparative

genomics).

How variations in gene expression relate to func-

tional changes in an organism is a question of key

biological interest. Gene category over-representa-

tion analysis is a widely used method which helps

determine which biological classes (functional

groups) are significantly overrepresented in a gene

list. The analysis comprises grouping genes into

classes by some biological property, commonly

Gene Ontology (GO) categories but alternatives

are possible such as Kyoto Encyclopedia of Genes

and Genomes (KEGG) pathways and testing

whether differentially expressed genes are over-

represented in any categories [87]. This information

combined with knowledge about which pathways

the genes are found in, if available, can result in a

powerful analysis and deepen the biological under-

standing of the gene–organism relationship.

Applications of gene classificationMany tools are available for gene category

over-representation analysis, including GOstats

[95], FUNC [96], EASE [97] and DAVID [98].

More comprehensive summaries are given at:

http://www.geneontology.org/GO.tools.micro-

array.shtml and in the Supplementary Data S1 of

Huang et al. [98]. Tools for specialized purposes

also exist, for example AgriGO, the successor of

EasyGO ([99]; http://bioinfo.cau.edu.cn/agriGO/),

is a web-based tool which is especially useful for

agricultural studies as it supports Affymetrix

GeneChips for both crops and farm animals and pro-

vides excellent capabilities for visualization of the

results. The general assumption underlying the

methodology for each tool is that, under the null

hypothesis, each gene has an equal probability of

being detected as differentially expressed, hence the

Analog and Next-Generation transcriptomic tools page 11 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 12: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

number of genes associated with a category that

overlap with the set of differentially expressed

genes follows a hypergeometric distribution [87].

Standard gene category over-representation ana-

lysis for RNA-Seq data has the problem that genes

with longer transcripts are more likely to be differ-

entially expressed ([67, 87], see above) except for

protocols which result in only a single transcript

per gene. Thus any category with a preponderance

of long genes will be more likely to be determined as

over-represented than a category with shorter genes.

Young et al. [87] have recently published a method

which corrects for this selection bias: the likelihood

of differential expression as a function of transcript

length is first quantified and then incorporated in the

statistical test of each category’s significance either by

using this information in a weighted resampling pro-

cedure or to calculate success and failure probabilities

for the Wallenius non-central hypergeometric distri-

bution. Similar results are obtained using either

method; however the latter is considerably less com-

putationally intense. Adjusting the results in this way

compared to a standard GO analysis was found to

have a substantial effect (�20% of significant GO

categories changed) on the results for a prostate

cancer data set, with the adjusted results being

more consistent with previous biological results.

Additionally, this adjustment may be useful for

both analog and RNA-Seq gene expression data

with respect to more highly expressed genes or

those with multiple probes for some genes, as both

these factors also increase the probability of a gene

being called differentially expressed [87].

Finally, consolidating multiple probes that map to

the same gene into a single count, and determining

which genes to include in the ‘universe’ for an ana-

lysis are issues already considered for microarray data

[95] and are equally important for RNA-Seq data in

gene category over-representation analysis.

Applications of pathway interactionnetworksPathway analysis is also a useful tool for both analog

and RNA-Seq data, as it allows the identification of

nodes that are central to interactions between differ-

entially expressed genes. Ingenuity Pathway Analysis

software (IPA, Ingenuity Systems, Inc., Redwood

City, CA, USA; www.ingenuity.com) is a valuable

package for determining biological networks for

mammalian data. Although based on human, rat

and mouse data, because of species homologies, it

is useful for mammalian studies in general [16, 25].

The pathway information in IPA is extracted from

the scientific literature. Used in combination with

gene ontology enrichment, pathway enrichment

analysis, network construction and comparison ana-

lysis it can lead to novel biological insights. To use

IPA, the full data set from an analysis (including gene

identifications (e.g. GenBank), fold changes and

FDR or P-values) is uploaded into IPA. The IPA

library of canonical pathways identifies those path-

ways that are the most significant to the set of dif-

ferentially expressed genes, as defined using selected

fold change and FDR (or P-value) cut-offs. The sig-

nificance of the association between this set of dif-

ferentially expressed genes and a specific canonical

pathway is estimated in two ways: (i) the proportion

of genes in the data set included in the canonical

pathway and (ii) Fisher’s exact test which is used to

calculate a P-value determining the probability of

the association between the data set and the canon-

ical pathway.

The IPA software was designed for microarray

data. However, assuming that RNA-Seq gene ex-

pression data can be successfully mapped, it should

be feasible to also use IPA for this type of data.

However, it is important to note that the aforemen-

tioned problem of longer genes having a greater

probability of being differentially expressed is likely

to also be an issue in IPA analyses. We are not aware

of a publication that currently provides a solution to

this problem. It seems likely that a similar approach

to that used to correct the problem for gene category

over-representation analysis may be able to be used.

An interesting alternative to IPA can be found in

Cytoscape (http://www.cytoscape.org/), an open

source software platform for visualizing and integrat-

ing networks, biological pathways, annotation and

gene expression profiles. Cytoscape’s modular

design means that community-based solutions can

be easily incorporated via plugins, so meaning that

new features are often more readily available than

with commercial applications. For example, the

Genoscape plugin integrates data from GenoScript

(a transcriptome database) with the KEGG database

to highlight gene expression changes and their re-

spective statistical significances. While Cytoscape was

originally designed for biologists, more recent ver-

sions have expanded its functionality to a general

platform for network analyses. This will potentially

facilitate the development of novel plugins with a

synergy effect beyond their original purpose.

page 12 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 13: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

Finally, network language can also be used to de-

scribe pair-wise relationships among genes and to

cluster genes with similar expression patterns into

pathways or regulatory networks, which can be de-

picted in heat maps. A heat map is a commonly used

tool to visualize data generated from microarrays,

and potentially RNA-Seq data, reflecting the level

of expression of many genes across a number of com-

parable samples (e.g. different physiological status,

different breeds, etc.) in a graphical representation

where the changes in values of the chosen variable

are represented as colours in a two-dimensional map

[100, 101].

CONCLUSIONThe analysis of gene expression has evolved from the

investigation of individual genes over the analysis of

thousands of genes to know the measurement of

potentially all genes in a sample. While these tech-

nologies promise a much deeper understanding of

the intricate relationships between gene expression

and internal and external stimuli, the rapidly increas-

ing amount of data to be analysed and to be put in

context creates veritable challenges to biologists and

bioinformaticians. As the cost of storing data be-

comes prohibitive, concepts such as cloud comput-

ing may become critical for the success of future

RNA-Seq experiments. Integrating the data from

both microarray and RNA-Seq experiments with

other ‘omics’ data sets open up new possibilities for

creating meaningful informational networks which

will aid our understanding of biological systems.

SUPPLEMENTARYDATASupplementary data are available online at http://

bfgp.oxfordjournals.org/.

Key Points

� DNA and oligonucleotide microarray (chip) technology is ahigh-throughput analog method that has become a standardtool for the analysis of genome-wide expression patterns,whether there is a sequenced genome or not, to establish genenetworks and identify new genes involved in a phenotype.

� Asmicroarrays are an analog technology, they have certain limi-tations. For example, they rely on hybridization which affectstheir ability to detect low abundance genes or distinguish alter-native forms. Also, the knowledge obtained is restricted to thetiled genes. However, the lower cost and established protocolsof microarray technology mean that it currently remains aviable option.

� RNA-Seq is an emergingmethod for fully quantitative transcrip-tomic analysis (i.e. transcripts are counted) andhas thepotentialto overcome the limitations ofmicroarray technology, eventuallyreplacing these analog methods. It is clear that RNA-Seq maybe the new Gold standard. However, the data volume isincreased by several magnitudes, while the tools available fordata analyses are not yet asmature as those used formicroarrayanalyses.

� Changes in technological platforms often require the develop-ment of naive software and analysis applications and care mustbe taken when applying algorithms developed for differentunderlying principles.

� Gene category over-representation analysis and pathway ana-lysis are useful tools for analysing gene expression data frommicroarrays or RNA-Seq and deepen the understanding of thegene^ organism relationship.

References1. Collins FS, Lander ES, Rogers J, et al. Finishing the euchro-

matic sequence of the human genome. Nature 2004;431:931–45.

2. Burt DW. The cattle genome reveals its secrets. J Biol 2009;8:36.

3. Subramanian A, Tamayo P, Mootha VK, et al. Gene setenrichment analysis: a knowledge-based approach for inter-preting genome-wide expression profiles. Proc Natl Acad SciUSA 2005;102:15545–50.

4. Venter JC, Adams MD, Myers EW, et al. The sequence ofthe human genome. Science 2001;291:1304–51.

5. Lander ES, Linton LM, Birren B, et al. Initial sequencingand analysis of the human genome. Nature 2001;409:860–921.

6. Lunshof JE, Bobe J, Aach J, et al. Personal genomes in pro-gress: from the human genome project to the personalgenome project. Dialogues Clin Neurosci 2010;12:47–60.

7. Collins FS, Morgan M, Patrinos A. The human genomeproject: lessons from large-scale biology. Science 2003;300:286–90.

8. de Vogel-van den Bosch HM, Bunger M, de Groot PJ, etal.PPARalpha-mediated effects of dietary lipids on intestinalbarrier gene expression. BMCGenomics 2008;9:231.

9. Knoch B, Barnett MPG, McNabb WC, et al. Dietary ara-chidonic acid-mediated effects on colon inflammation usingtranscriptome analysis. Mol Nutr Food Res 2010;54:1–13.

10. Knoch B, Barnett MPG, Zhu S, etal. Genome-wide analysisof dietary eicosapentaenoic acid- and oleic acid-inducedmodulation of colon inflammation in interleukin-10gene-deficient mice. J Nutrigenet Nutrigenomics 2009;2:9–28.

11. Langmann T, Moehle C, Mauerer R, et al. Loss of detoxi-fication in inflammatory bowel disease: dysregulation ofpregnane X receptor target genes* 1. Gastroenterology2004;127:26–40.

12. Rakhshandehroo M, Sanderson LM, Matilainen M, et al.Comprehensive analysis of pparalpha-dependent regulationof hepatic lipid metabolism by expression profiling. PPARRes 2007;2007:26839.

13. Rivera E, Flores I, Rivera E, et al. Molecular profiling of arat model of colitis: validation of known inflammatorygenes and identification of novel disease-associated targets.Inflamm Bowel Dis 2006;12:950–66.

Analog and Next-Generation transcriptomic tools page 13 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 14: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

14. Roy N, Barnett M, Knoch B, et al. Nutrigenomics appliedto an animal model of inflammatory bowel diseases:transcriptomic analysis of the effects of eicosapentaenoicacid- and arachidonic acid-enriched diets. Mutat Res-FundMolM 2007;622:103–16.

15. Sanderson LM, de Groot PJ, Hooiveld GJEJ, et al. Effect ofsynthetic dietary triglycerides: a novel research paradigm fornutrigenomics. PLoSONE 2008;3:e1681.

16. Bonnet A, La Cao KA, SanCristobal M, et al. In vivo geneexpression in granulosa cells during pig terminal folliculardevelopment. Reproduction 2008;136:211–24.

17. Diez-Tascon C, Keane OM, Wilson T, et al. Microarrayanalysis of selection lines from outbred populations to iden-tify genes involved with nematode parasite resistance insheep. Physiol Genomics 2005;21:59–69.

18. Everts RE, Band MR, Liu ZL, et al. A 7872 cDNA micro-array and its use in bovine functional genomics. Vet ImmunolImmunolpathol 2005;105:235–45.

19. Gunther J, Koczan D, Yang W, et al. Assessment of theimmune capacity of mammary epithelial cells: comparisonwith mammary tissue after challenge with Escherichia coli.Vet Res 2009;40:31.

20. Keane OM, Zadissa A, Wilson T, et al. Gene expres-sion profiling of Naive sheep genetically resistant and

susceptible to gastrointestinal nematodes. BMC Genomics2006;7:42.

21. Lehnert SA, Wang YH, Byrne KA. Development andapplication of a bovine cDNA microarray for expressionprofiling of muscle and adipose tissue. AustJ Exp Agr 2004;44:1127–33.

22. Smith GW, Rosa GJ. Interpretation of microarray data:trudging out of the abyss towards elucidation of biologicalsignificance. JAnim Sci 2007;85:E20–3.

23. Spielbauer B, Stahl F. Impact of microarray technologyin nutrition and food research. Mol Nutr Food Res 2005;49:908–17.

24. Abdueva D, Wing MR, Schaub B, et al. Experimental

comparison and evaluation of the affymetrix exon andU133Plus2 GeneChip arrays. PLoSONE 2007;2:e913.

25. Faucon F, Rebours E, Bevilacqua C, et al. Terminal differ-entiation of goat mammary tissue during pregnancy requiresthe expression of genes involved in immune functions.Physiol Genomics 2009;40:61–82.

26. Xie Y, Wang X, Story M. Statistical methods of back-ground correction for Illumina BeadArray data.Bioinformatics 2009;25:751–7.

27. Bertone P, Gerstein M, Snyder M. Applications of DNAtiling arrays to experimental genome annotation andregulatory pathway discovery. Chromosome Res 2005;13:259–74.

28. Kechris K, Yang YH, Yeh RF. Prediction of alternativelyskipped exons and splicing enhancers from exon junctionarrays. BMCGenomics 2008;9:551.

29. Hacia JG, Fan JB, Ryder O, etal. Determination of ancestralalleles for human single-nucleotide polymorphisms usinghigh-density oligonucleotide arrays. Nat Genetics 1999;22:164–7.

30. Wang DG, Fan JB, Siao CJ, et al. Large-scale identification,mapping, and genotyping of single- nucleotide poly-morphisms in the human genome. Science 1998;280:1077–82.

31. Kijas JW, Townley D, Dalrymple BP, et al. A genome widesurvey of SNP variation reveals the genetic structure ofsheep breeds. PLoSONE 2009;4:e4668.

32. Lindblad-Toh K, Tanenbaum DM, Daly MJ, et al.Loss-of-heterozygosity analysis of small-cell lung carcin-omas using single-nucleotide polymorphism arrays.Nat Biotechnol 2000;18:1001–5.

33. Mei R, Galipeau PC, Prass C, etal. Genome-wide detectionof allelic imbalance using human SNPs and high-densityDNA arrays. Genome Res 2000;10:1126–37.

34. Matukumalli LK, Lawley CT, Schnabel RD, et al.Development and characterization of a high density SNPgenotyping assay for cattle. PLoSONE 2009;4:e5350.

35. McKay SD, Schnabel RD, Murdoch BM, et al. Wholegenome linkage disequilibrium maps in cattle.BMCGenetics 2007;8:74.

36. Sargolzaei M, Scnenkel FS, Jansen GB, et al. Extent of link-age disequilibrium in Holstein cattle in North America.J Dairy Sci 2008;91:2106–17.

37. Gautier L, Cope L, Bolstad BM, et al. Affy - Analysis ofaffymetrix GeneChip data at the probe level. Bioinformatics2004;20:307–15.

38. Kapur K, Jiang H, Xing Y, et al. Cross-hybridization mod-eling on Affymetrix exon arrays. Bioinformatics 2008;24:2887–93.

39. Potter DP, Yan P, Huang TH, et al. Probe signal correctionfor differential methylation hybridization experiments.BMCBioinformatics 2008;9:453.

40. Wu Z, Irizarry R, Gentleman R, etal. A model-based back-ground adjustment for oligonucleotide expression arrays.JAm Stat Assoc 2004;99:909–17.

41. Asmann YW, Klee EW, Thompson EA, et al. 30 tag digitalgene expression profiling of human brain and universalreference RNA using Illumina Genome Analyzer.BMCGenomics 2009;10:531.

42. Mortazavi A, Williams BA, McCue K, et al. Mapping andquantifying mammalian transcriptomes by RNA-Seq.NatMethods 2008;5:621–8.

43. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolu-tionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63.

44. Wang ET, Sandberg R, Luo S, et al. Alternative isoformregulation in human tissue transcriptomes. Nature 2008;456:470–6.

45. Allison DB, Cui X, Page GP, et al. Microarray data analysis:from disarray to consolidation and consensus. NatRevGenet2006;7:55–65.

46. Cui X, Churchill GA. Statistical tests for differential expres-sion in cDNA microarray experiments. GenomeBiol 2003;4:210.

47. Benjamini Y, Hochberg Y. Controlling the false discoveryrate: a practical and powerful approach to multiple testing.J Roy Stat Soc Ser B 1995;57:289–300.

48. Diaz-Uriarte R, Alvarez de Andres S. Gene selection andclassification of microarray data using random forest.BMCBioinformatics 2006;7:3.

49. Baird D, Johnstone P, Wilson T. Normalization of micro-array data using a spatial mixed model analysis whichincludes splines. Bioinformatics 2004;20:3196–205.

50. Smyth GK. Limma: Linear models for microarray data. In:Gentleman R, Carey V, Dudoit S, et al (eds). Bioinformatics

page 14 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 15: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

and Computational Biology Solutions Using R and Bioconductor.New York: Springer, 2005:397–420.

51. Wilson CL, Miller CJ. Simpleaffy: a BioConductor packagefor Affymetrix Quality Control and data analysis.Bioinformatics 2005;21:3683–5.

52. Du P, Kibbe WA, Lin SM. lumi: a pipeline for processingIllumina microarray. Bioinformatics 2008;24:1547–8.

53. Kauffmann A, Gentleman R, Huber W. arrayQualityMetrics–a bioconductor package for quality assess-ment of microarray data. Bioinformatics 2009;25:415.

54. Reich M, Liefeld T, Gould J, et al. GenePattern 2.0.Nat Genet 2006;38:500–1.

55. De Groot P, Reiff C, Mayer C, et al. NuGO contributionsto GenePattern. Genes Nutr 2008;3:143–6.

56. Kuehn H, Liberzon A, Reich M, et al. Using GenePatternfor gene expression analysis. Curr Protoc Bioinform 2008:Chapter 7, Unit 7.12.

57. Dallas PB, Gottardo NG, Firth MJ, et al. Gene expressionlevels assessed by oligonucleotide microarray analysis andquantitative real-time RT-PCR - How well do they cor-relate? BMCGenomics 2005;6:59.

58. Bustin SA, Benes V, Garson JA, etal. The MIQE guidelines:minimum information for publication of quantitativereal-time PCR experiments. Clin Chem 2009;55:611–22.

59. Audic S, Claverie JM. The significance of digital geneexpression profiles. GenomeRes 1997;7:986–95.

60. Hyman ED. A new method of sequencing DNA.Anal Biochem 1988;174:423–36.

61. Seo TS, Bai X, Kim DH, etal. Four-color DNA sequencingby synthesis on a chip using photocleavable flourescentnucleotides. Proc Natl Acad Sci USA 2005;102:5926–31.

62. Ozsolak F, Ting D, Wittner B, et al. Amplification-freedigital gene expression profiling from minute cell quantities.NatMethods 2010;7:619–21.

63. Morozova O, Hirst M, Marra MA. Applications ofnew sequencing technologies for transcriptome analysis.Annu RevGenomics HumGenet 2009;10:135–51.

64. Harismendy O, Ng PC, Strausberg RL, et al. Evaluation ofnext generation sequencing platforms for populationtargeted sequencing studies. Genome Biol 2009;10:R32.

65. Mardis ER. The impact of Next-Generation Sequencingtechnology on genetics. Trends Genet 2008;24:133–41.

66. Auer P, Doerge R. Statistical design and analysis of RNAsequencing data. Genetics 2010;185:405–16.

67. Oshlack A, Wakefield MJ. Transcript length bias inRNA-seq data confounds systems biology. Biol Direct2009;4:14.

68. Parkhomchuk D, Borodina T, Amstislavskiy V, et al.Transcriptome analysis by strand-specific sequencing ofcomplementary DNA. Nucleic Acids Res 2009;37:e123.

69. Vivancos A, Guell M, Dohm J, et al. Strand-specific deepsequencing of the transcriptome. Genome Res 2010;20:989–99.

70. Flintoft L. Transcriptomics: Digging deep with RNA-Seq.Nat RevGenet 2008;9:568.

71. Marioni JC, Mason CE, Mane SM, et al. RNA-seq: Anassessment of technical reproducibility and comparisonwith gene expression arrays. Genome Res 2008;18:1509–17.

72. Metzker ML. Sequencing technologies - the next gener-ation. Nat Rev Genet 2010;11:31–46.

73. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seqand RNA-seq studies. NatMethods 2009;6:S22–32.

74. Stegle O, Drewe P, Bohnert R, et al. Statistical tests fordetecting differential RNA-transcript expression fromread counts. Nature Precedings, No. 713. (11 May 2010)doi:10.1038/npre.2010.4437.1.

75. Git A, Dvinge H, Salmon-Divon M, et al. Systematic com-parison of microarray profiling, real-time PCR, andnext-generation sequencing technologies for measuring dif-ferential microRNA expression. RNA 2010;16:991.

76. Willenbrock H, Salomon J, Søkilde R, et al. QuantitativemiRNA expression analysis: comparing microarrays withnext-generation sequencing. RNA 2009;15:2028.

77. Xu AG, He L, Li Z, etal. Intergenic and repeat transcriptionin human, chimpanzee and macaque brains measured byRNA-Seq. PLoSComput Biol 2010;6:e1000843.

78. Tuch BB, Laborde RR, Xu X, et al. Tumor transcrip-tome sequencing reveals allelic expression imbalancesassociated with copy number alterations. PLoSONE 2010;5:e9317.

79. Burgun A, Bodenreider O. Accessing and integrating dataand knowledge for biomedical research. Yearb Med Inform2008:91–101.

80. Stein LD. The case for cloud computing in genome inform-atics. Genome Biol 2010;11:207.

81. Baker M. Next-generation sequencing: adjusting to dataoverload. NatMeth 2010;7:495–9.

82. Robinson MD, Smyth GK. Moderated statistical tests forassessing differences in tag abundance. Bioinformatics 2007;23:2881–7.

83. Trapnell C, Pachter L, Salzberg S. TopHat: discoveringsplice junctions with RNA-Seq. Bioinformatics 2009;25:1105–11.

84. Hammer P, Banck M, Amberg R, et al. mRNA-seqwith agnostic splice site discovery for nervous system tran-scriptomics tested in chronic pain. Genome Res 2010;20:847–60.

85. Robinson MD, Oshlack A. A scaling normalization methodfor differential expression analysis of RNA-seq data. GenomeBiol 2010;11:R25.

86. Robinson MD, McCarthy DJ, Smyth GK. edgeR: abioconductor package for differential expression analysisof digital gene expression data. Bioinformatics 2010;26:139–40.

87. Young MD, Wakefield MJ, Smyth GK, et al. Gene ontol-ogy analysis for RNA-seq: accounting for selection bias.Genome Biol 2010;11:R14.

88. Eid J, Fehr A, Gray J, et al. Real-time DNA sequencingfrom single polymerase molecules. Science 2009;323:133–8.

89. Clarke J, Wu HC, Jayasinghe L, et al. Continuous baseidentification for single-molecule nanopore DNA sequen-cing. Nat Nanotechnol 2009;4:265–70.

90. Deamer DW, Akeson M. Nanopores and nucleic acids:prospects for ultrarapid sequencing. Trends Biotechnol 2000;18:147–51.

91. R: A language and environment for statistical computing.http://www.r-project.org/ (10 January 2011, date lastaccessed).

92. Gentleman RC, Carey VJ, Bates DM, et al. Bioconductor:open software development for computational biology andbioinformatics. Genome Biol 2004;5:R80.

Analog and Next-Generation transcriptomic tools page 15 of 16 by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from

Page 16: A comparison of analog and Next-Generation transcriptomic tools for mammalian studies

93. Wang L, Feng Z, Wang X, et al. DEGseq: an R package foridentifying differentially expressed genes from RNA-seqdata. Bioinformatics;26:136–8.

94. Trapnell C, Williams B, Pertea G, et al. Transcript assemblyand quantification by RNA-Seq reveals unannotatedtranscripts and isoform switching during cell differentiation.Nat Biotechnol 2010;28:511–5.

95. Falcon S, Gentleman R. Using GOstats to test genelists for GO term association. Bioinformatics 2007;23:257–8.

96. Prufer K, Muetzel B, Do HH, et al. FUNC: a pack-age for detecting significant associations between genesets and ontological annotations. BMC Bioinformatics2007;8:41.

97. Hosack DA, Dennis G, Jr, Sherman BT, et al. Identifyingbiological themes within lists of genes with EASE.Genome Biol 2003;4:R70.

98. Huang DW, Sherman BT, Lempicki RA. Systematic andintegrative analysis of large gene lists using DAVID bio-informatics resources. Nat Protocols 2009;4:44–57.

99. Zhou X, Su Z. EasyGO: gene ontology-based annotationand functional enrichment analysis tool for agronomicalspecies. BMCGenomics 2007;8:246.

100.Eisen MB, Spellman PT, Brown PO, et al. Cluster analysisand display of genome-wide expression patterns. Proc NatlAcad Sci USA 1998;95:14863–8.

101.Wilkinson L, Friendly M. The history of the cluster heatmap. AmStat 2009;63:179–84.

page 16 of 16 Roy et al. by guest on D

ecember 4, 2014

http://bfgp.oxfordjournals.org/D

ownloaded from