expression profile viewer (exproview): a software tool for transcriptome analysis
TRANSCRIPT
Genomics 63, 341–353 (2000)doi:10.1006/geno.1999.6105, available online at http://www.idealibrary.com on
Expression Profile Viewer (ExProView): A Software Toolfor Transcriptome Analysis
Magnus Larsson,*,1 Stefan Ståhl,* Mathias Uhlen,* and Anders Wennborg†
*Department of Biotechnology, Royal Institute of Technology (KTH), S-100 44 Stockholm, Sweden; and†Department of Biosciences, Karolinska Institute, Novum, S-141 57 Huddinge, Sweden
Received June 9, 1999; accepted December 21, 1999
cabNr
A software tool, Expression Profile Viewer (ExPro-View), for analysis of gene expression profiles de-rived from expressed sequence tags (ESTs) andSAGE (serial analysis of gene expression) is pre-sented. The software visualizes a complete set ofclassified transcript data in a two-dimensional arrayof dots, a “virtual chip,” in which each dot repre-sents a known gene as characterized in the tran-script databases Expressed Gene Anatomy Databaseor UniGene. The virtual chip display can be changedbetween representations of different conceptual sys-tems for gene/protein classification and grouping.Four alternative projections are currently available:(i) cellular role, (ii) subcellular compartment, (iii)chromosome localization, and (iv) total UniGene dis-play. However, the chip can be adapted to any otherdesired layout. By selecting dots, further informa-tion about the represented genes is obtained fromthe local database and WWW links. The softwarethus provides a visualization of global mRNA expres-sion at the descriptive level and guides in the explo-ration of patterns of functional expression, whilemaintaining direct access to detailed information oneach individual gene. To evaluate the software, pub-lic EST and SAGE gene expression data obtainedfrom the Cancer Genome Anatomy Project at theNational Center for Biotechnology Information wereanalyzed and visualized. A demonstration of thesoftware is available at http://www.biochem.kth.se/exproview/. © 2000 Academic Press
INTRODUCTION
Global analysis of the trancriptome (i.e., all mRNAs)and/or the proteome (i.e., all translation products) hasbeen broadly defined as “functional genomics” (Hieterand Boguski, 1997). The recent technological advancesthat enable analysis of global gene expression at the
1 To whom correspondence should be addressed at Department ofBiotechnology, Royal Institute of Technology (KTH), Teknikringen34, S-100 44 Stockholm, Sweden. Telephone: 146 8 7908287. Fax:
146 8 245452. E-mail: [email protected].341
mRNA level (Velculescu et al., 1995; Bowtell, 1999),coupled with a rapid growth of available databaseswith such data (Burks, 1999), create new demands onsoftware tools for data management, visualization,comparison, and exploration.
For organisms with completely sequenced ge-nomes, e.g., Saccharomyces cerevisiae, essentiallyomplete maps of the transcribable sequences arevailable (Goffeau et al., 1996). This makes it possi-le to monitor the expression of the majority of mR-As simultaneously, using probe-hybridization ar-
ays (DeRisi et al., 1997) or expressed sequence tags(ESTs) from cDNA libraries, including the 10-basetags used for SAGE (serial analysis of gene expres-sion) (Velculescu et al., 1997). For the human tran-scriptome, a number of databases with different de-grees of completeness, validity, and annotation exist.It is likely that almost every transcript from thehuman genome has been sampled some time in thelarge multicenter sequencing projects that have gen-erated more than one million ESTs (Benson et al.,1999). However, it has not yet been possible to as-semble a comprehensive transcript catalog. The orig-inal data archives (e.g., dbEST in GenBank) containfragmented and partially redundant sequences, andefforts have been made to reduce redundancy andcluster related ESTs together with different algo-rithms. Examples of such databases are UniGenefrom the National Center for Biotechnology Informa-tion (NCBI) (Boguski and Schuler, 1995), HumanGene Index (HGI) from The Institute for GenomeResearch (TIGR) (www.tigr.org), and Merck GeneIndex (MGI) (Eckman et al., 1998). For a subset ofthe human genes, the structure of the transcriptionunit(s) has been verified, and a more extensive an-notation of the gene product structure and functionis available. Such sequences have been collected inspecialized databases, e.g., the Expressed GeneAnatomy Database (EGAD) (White and Kerlavage,1996) from TIGR (www.tigr.org). Functional annota-tion can also be derived through links between indi-vidual records in mRNA databases, e.g., UniGene
(Boguski and Schuler, 1995), and protein databases,0888-7543/00 $35.00Copyright © 2000 by Academic Press
All rights of reproduction in any form reserved.
FIG
.1.
Ove
rvie
wof
the
ExP
roV
iew
con
cept
.In
put
gen
eex
pres
sion
data
and
copi
esof
rele
van
ttr
ansc
ript
data
base
sar
est
ored
ina
loca
lre
lati
onal
data
base
toge
ther
wit
hre
sult
sfr
omda
taba
sem
atch
ing.
ExP
roV
iew
show
sth
ere
sult
ofda
taba
sem
atch
ing
ina
virt
ual
chip
disp
lay
win
dow
,wh
ere
each
gen
ein
the
data
base
isre
pres
ente
dby
ado
t,w
hic
his
colo
r-co
ded
acco
rdin
gto
the
abu
nda
nce
ofth
eco
rres
pon
din
gE
ST
inth
ecu
rren
tda
tase
t.A
his
togr
ampr
ofile
ofth
eab
un
dan
cedi
stri
buti
onfo
rth
eda
tase
tan
din
form
atio
nab
out
sele
cted
gen
es(d
ots)
are
also
disp
laye
d.T
he
dots
are
clic
kabl
ew
ith
imm
edia
teac
cess
toth
eu
nde
rlyi
ng
data
base
asw
ell
asW
WW
lin
ksto
the
orig
inal
data
base
s.
342 LARSSON ET AL.
343ExProView
lsa
nwbfab
tgEs_1hcotsaflaIUct
cf
344 LARSSON ET AL.
e.g., SWISS-PROT (Bairoch and Apweiler, 1999), es-tablished by similarity searching with BLASTX(Altschul et al., 1994).
In addition to databases aiming at describing allpossible human transcripts based on clusters of ESTspooled from different tissue sources, several databaseswith ESTs from individual cDNA libraries are nowpublicly available. One example is the Cancer GenomeAnatomy Project (CGAP) (Strausberg et al., 1997) atNCBI, in which EST data from several cancer types arecontinuously accumulated. These large gene expres-sion data sets constitute a novel type of bioinformaticdata entity, which cannot be readily analyzed withmanual methods. Furthermore, there exists a need forrepeated analysis of these growing data sets and forcomparison between public and proprietary data.
We describe here a software tool, ExProView, thathas been designed for efficient analysis of large datasets from mRNA expression studies obtained from ESTsequencing, including method variants that rely onvery short sequence tags, e.g., SAGE. ExProView is auniversal and flexible platform that allows differenttypes of data access and visualization. It reads theoutput obtained from a local database search for iden-tification of different types of EST tags and is capableof displaying this data set in various formats. The ESTexpression data are represented on a two-dimensionalarray of dots, called a “virtual chip,” in which each dotrepresents a known gene that is color-coded accordingto the abundance of corresponding ESTs in the currentdata set. The virtual chip array is subdivided based ona given gene classification scheme (e.g., cellular role,subcellular compartment, or chromosome localization),and the same data set can be viewed using differentvirtual chip layouts. Custom-made virtual chip layoutscan be adapted to meet the specific needs of a particu-lar user.
EST data are digital in nature (as opposed to hybrid-ization data, which are analog) and therefore directlysuitable for various statistical analyses, including com-parisons of two different data sets (Audic and Claverie,1997). Functionality for such comparisons is integratedin the ExProView software. Two data sets are normal-ized to the total tag count in each sample, and theresulting differences are plotted in a color-coded dis-play in the virtual chip. The desired degree of differ-ence to be displayed can be set as boundaries of statis-tical significance or manually adjusted with a clickabletool in the virtual chip.
The computational analysis of a set of ESTs can beregarded on three levels: qualitative, descriptive, andexplorative. The qualitative analysis includes prepro-
FIG. 2. Display of a single EST data set (normal prostate epitheliuESTs as dots, which are color-coded according to abundance, as indicatedin gray. The color-coding can be changed by selecting between linearabundance to be displayed can also be set. Genes/dots can be selected byby white squares on the virtual chip) are identified by name in the pan
displayed in the lower right-hand corner.cessing to remove known sequence artifacts, scoringthe copy number of identical ESTs, and describingabundance classes. In the descriptive analysis, ESTsare identified by gene names by searching against ap-propriate databases. Explorative analyses aim at de-ducing regulatory or functional patterns at higher lev-els of complexity, e.g., metabolic pathways, from theglobal mRNA expression data. ExProView is a tool fordescriptive analysis and serves as guide for explorationof gene expression patterns, thus providing a basis forsubsequent deduction of functional patterns.
MATERIALS AND METHODS
Systems and hardware. The ExProView program was written inJava and runs as an applet or application on Java-compliant plat-forms. The program uses Java database connectivity to communicatewith a relational database containing input EST data sets, informa-tion extracted from downloaded sequence databases, and resultsfrom sequence matching (Fig. 1). The database was implemented onan Oracle Workgroup server 7.3 for Windows NT, but is constructedto be easily adaptable to other platforms. ExProView and the data-base are designed in a flexible manner to allow for new sources ofinformation. Running ExProView in applet mode requires a browsercorresponding to Netscape 4.5 or higher and a screen resolution of800 3 600 or more. On a PC system, a 200 MMX1 processor and ateast 64 MB of RAM are recommended. A demonstration of theoftware applet with access to the data described in this article isvailable at http://www.biochem.kth.se/exproview/.
Local sequence databases. To construct the local copies of exter-al sequence databases, sequence data from EGAD and UniGeneere downloaded as FASTA files and used to produce BLAST data-ases for similarity searches. Sequences and annotation informationor each entry were obtained by parsing the corresponding flat filesnd importing the relevant features into the local relational data-ase.For Unigene (Build 89, August 1999), the selected best represen-
ative sequence for each cluster in the file Hs_seq_uniq was used,iving a database of 83,240 entries, against which matching of inputSTs was performed. Active hyperlinks to the original database Webites were established based on information obtained from the Hs-data flat file. The page for each UniGene (Boguski and Schuler,995) entry is accessed through its Hs. number. The best BLASTXits, as specified for a subset of the UniGene entries, are used tolassify the Unigene entries according to whether they match humanr, if not, a nonhuman protein sequence. The chromosomal localiza-ion, if known, was obtained by record linking of the sequence taggedite (STS) ID numbers for the UniGene entries that have these datand the mapping information in the Radiation Hybrid GB4 databaseat files (Deloukas et al., 1998), where every mapped STS is associ-ted with a distance measure (centirays) from the chromosome top.n the cases in which more than one STS was associated with a givenniGene entry, only one of them was used, since they normally map
lose to each other. The STS number is used to establish a direct linko GeneMap ’98 (Deloukas et al., 1998).
EGAD was similarly downloaded and used for EST matching. Thelassification of cellular roles for each individual entry was extractedrom the EGAD Web site files. Active links to the EGAD Web site
after matching to EGAD. The virtual chip shows gene matches to thethe histogram to the right. Genes matched to singleton ESTs are shownd logarithmic scales. Specific boundaries for minimum or maximum
cking or dragging on the virtual chip. These selected genes (surroundedelow the virtual chip. A miniature representation of the virtual chip is
m)inancli
el b
eG
SeBs3oorgctUm
(
ifmtmgaed
icctmmitqutcvo
a
345ExProView
(www.tigr.org/tdb/egad/) are based on the HG numbers for eachentry.
Analysis of EST and SAGE data. In the present study, two ESTdata sets and one SAGE data set from CGAP at NCBI were down-loaded and analyzed. The EST data sets were analyzed withBLASTN against UniGene and EGAD. Similarity search matchingwas performed with BLASTN (version 2.0.10) and a default cut-off of1 3 e230. The EST libraries originate from normal human prostatepithelium (NCI_CGAP_Pr1) and invasive prostate cancer (NCI_C-AP_Pr3) and contained 5689 and 5209 ESTs, respectively.The SAGE data were from a primary colon tumor library
AGE_Tu102 containing 20,050 tags. As an option for short ESTs,.g., from SAGE, exact text matching can be performed instead ofLASTN. Input EST sequences are then matched against a sequencetretch of defined length (10 bases in the case of SAGE) immediately9 of a given restriction enzyme site that has been preextracted outf each entry in the sequence database. NCBI has derived a mappingf UniGene with this information for the combination of the NlaIIIestriction enzyme and a tag length of 10 bases (www.ncbi.nlm.nih.ov/CGAP/). The cross-references between a tag and a UniGeneluster are divided into “full” and “reliable” matches, depending onhe degree of sequence consensus among the ESTs forming a givenniGene cluster. The reliable map forms the basis of SAGE-tagatching against UniGene in ExProView.
RESULTS
Overview of the ExProView System
Figure 1 shows the design of the ExProView system.The purpose of ExProView is to provide a visualizationand comparison tool for large data sets from mRNAexpression studies, which can be obtained from ESTsequencing, including high-throughput methods suchas SAGE (Velculescu et al., 1995) and pyrosequencingRonaghi et al., 1998).
A local relational database is used to store (i) thenput sequence data, (ii) characterized gene data (e.g.,rom EGAD and UniGene), and (iii) results from theatching of mRNA transcription data to the charac-
erized gene data. A set of server-sided applicationodules is used for the import and matching of ESTs to
ene data. The ExProView software is implemented asJava applet or application, which reads the EST
xpression profile results from the local database andisplays this data set in various formats.The central concept in ExProView is the virtual chip,
n which dots representing each known gene or ESTluster in a gene database are displayed in arrays thatan be designed to represent different conceptual sys-ems for gene/protein classification and grouping, e.g.,etabolic function, subcellular compartment, or chro-osome localization. The expression level of each gene
s color-coded, yielding a three-dimensional informa-ion matrix representing all available qualitative anduantitative data. Depending on the interest of theser, the same data set can thus be displayed accordingo different classification schemes in different virtualhip layouts, and the relative expression levels of indi-idual entries (genes) in each group are then instantlybservable.The dots are clickable and selectable, with immedi-
te access to corresponding information in the under-
lying database and through further WWW links to theorigin of primary data, e.g., EGAD, UniGene, SWISS-PROT, and PubMed, as applicable. A link to the locallaboratory data is also provided. This general softwareinterface design offers the researcher a complete andflexible overview of complex data sets, while allowingrapid access to detailed gene-specific information whenrequired.
Characterization of EST Data
Sequencing of an EST library yields data consistingof tag identity (quality) and abundance (quantity), i.e.,the number of times the tag has occurred among thecurrent total number of sequenced tags. The input toExProView consists of information about the genesidentified by ID number after searching in the currentdatabase and the abundance of each gene, i.e., thenumber of ESTs identified as that gene. These ESTscan have different origins: (i) ESTs consisting of se-quences generated by conventional sequencing ofcDNA library clones, usually longer than 100 bases; (ii)SAGE tags with shorter sequences of 10 bases; and (iii)pyrosequencing tags with sequences of 20 bases ormore.
Matching of these types of sequence data to appro-priate databases can be achieved either with sequencealignment programs like BLAST (Altschul et al., 1990)or by exact text matching. BLAST searches have theadvantage of being capable of classifying the EST evenif there are some minor sequence errors in the EST orthe database, although it may sometimes be misclas-sified as a closely related gene in the same gene family.Due to the relatively short tag length, sequence datagenerated by SAGE (and pyrosequencing, in some in-stances) can instead be classified by performing exacttext searches against the characterized gene datasources. This introduces a risk for misclassification insearches against databases that have not been verifiedby repeated sequencing and where the clone orienta-tion is unknown. This is true for a subset of the ESTclusters in UniGene, and data from text matches mustbe interpreted with caution in these cases. An algo-rithm for weighing the risk of sequence errors occur-ring in defined SAGE tags compared to UniGene hasrecently been described (http://www.ncbi.nlm.nih.gov/CGAP/). This results in two alternative mappings oftags to UniGene clusters, one with a full tag-to-clustermap and one reliable, which can be used to make theanalysis with different degrees of uncertainty in theindividual database matches. The set of unique tag-to-cluster matches in the reliable UniGene map allows anonredundant SAGE-tag matching in ExProView.
After database matching as described above, a sub-set of the ESTs are likely to remain unidentified. Thesewill be stored and displayed by ExProView in order ofabundance. For the short, fixed-length ESTs fromSAGE and pyrosequencing, identical tags are counted
together to determine their abundance. For longerFIG
.3.
Mat
chin
gof
aS
AG
Eda
tase
t(p
rim
ary
colo
ntu
mor
)to
the
enti
reU
niG
ene
data
base
acco
rdin
gto
the
reli
able
and
non
redu
nda
nt
SA
GE
tag
map
pin
g.
346 LARSSON ET AL.
ual
347ExProView
ESTs with varying sequence read-length, the compar-ison cannot be based on identity, but must involve aclustering of related ESTs into discrete groups. Thiscan be accomplished by using an assembly programsuch as the TIGR Assembler (Sutton et al., 1995) ortlcluster (http://ratest.uiowa.edu/) and listing the out-put as the number of original ESTs that constituteeach cluster. In the ExProView display of unknownsequences, each cluster will be represented separately.
User Interface: The Virtual Chip
To be able to view the entire data set comprising a
FIG. 4. Schematic views of different virt
transcriptome, it would be desirable to project it on a
representation of all possible transcripts, i.e., the tran-scribable genome. In ExProView, this is accomplishedby an array of dots representing the known genes in agiven database, where those that have been detected inthe analyzed data set are color-coded according toabundance. The visual impression is similar to viewinga microarray or chip with probes to which a labeledsample has been hybridized and yields a signal whoseintensity is proportional to the amount of hybridizedtranscript. This display in ExProView is thereforecalled a virtual chip.
The virtual chip provides a global view of the com-
chip layouts. See Results for description.
plete data set with visual cues other than those avail-
348 LARSSON ET AL.
able in conventional tables or two-axis scatter dia-grams, while maintaining immediate point-and-clickaccess to each individual record in the set. Since allgenes in the database are represented by a dot, thedisplay also shows the absence of gene expression,which may be as biologically relevant as presence. Thearray can be subdivided with frames according to anyrelevant structural and/or functional grouping of thegenes, e.g., by chromosomal localization of the gene ormetabolic function of the corresponding protein. Usingdifferent array layouts, the same transcriptome dataset can thus be viewed projected on different geneclassification schemes. With this representation, it isdirectly possible to extrapolate global functional infor-mation from the data set, e.g., how much of the currenttranscriptome is devoted to ribosomal proteins,whether all genes in a given signal pathway are ex-pressed, or whether one region of a chromosome ismore transcriptionally active than another.
For S. cerevisiae, the total set of approximately 6000genes can be represented by an array of 80 3 80 dots,which is easily visible on a conventional computerscreen at 7 3 7 pixels each. The corresponding com-plete human gene map is expected to be one order ofmagnitude larger (Fields et al., 1994). However, eventhis amount of information can be displayed in a ma-trix that is readily observable at 2 pixels per dot. Ex-ProView has a zoom function that allows a choice of dotsize down to 1 pixel. A future visualization of the com-plete set of human genes is thus feasible in the virtualchip format implemented in ExProView, and the indi-vidual dots in such a matrix will be clearly visible on apaper printout in A4 format.
Display of a Single Data Set
A data set derived from a single sample can beviewed in a virtual chip for the purpose of visualizingthe expression levels of the genes. The user selects avirtual chip layout and data set from those available inthe “Expression Load” dialog box. Figure 2 shows anexample of an EST data set from normal prostate tis-sue (CGAP_NCI_Pr1) displayed on the virtual chip forEGAD, and Fig. 3 shows a SAGE primary colon tumordata set (SAGE_Tu102) projected on a view of the totalUniGene database.
The virtual chip, which may extend outside thepanel, is displayed in a scrollable and zoomable win-dow. The histogram shows a step-gradient of color cor-responding to the level of expression for the genesdisplayed in the virtual chip. The gradient is scaled inseven steps to enhance the contrast in the comparison.ESTs that occur only once in the data set (singletons)are coded with gray, and the rest of the scale is bydefault divided in equal steps with a linear or logarith-mic scale according to the user’s preference.
When a spot in the virtual chip is selected, a text lineappears in the selection panel with information for the
gene. The amount of this transcript (number of tagsand percentage of total) and its name are followed bylinks to PubMed, SWISS-PROT, and STS, if available.By clicking the name, a link is opened to the originaldatabase Web site (UniGene or EGAD). By selectingthe tabs labeled “Dataset” and “VirtualChip,” informa-tion about the current data set and virtual chip, respec-tively, is displayed.
A miniature of the virtual chip (Fig. 2, lower right) isused for orientation when zooming and scrolling. Bysingle-clicking a rectangle in the miniature chip, thename of that category is displayed. By double-clickingit, that part of the chip will be centered in the maindisplay.
To facilitate the localization of particular genes on avirtual chip, ExProView provides a search dialog inwhich genes can be searched by name or ID. The re-sults from an ExProView session can be printed or sentby e-mail using the report generator dialog (this func-tion is disabled in applet mode because of securityrestrictions for applet access to the client computer’sresources).
Different Virtual Chips
It is possible to view the same EST data set projectedon different virtual chips corresponding to differentdatabase classifications. The chips can be designed tocorrespond to a complete database or any subset, thatis suitable for the investigator.
Figure 4 shows four currently available virtualchips. The “EGAD-HG” chip (5232 dots) shows thecellular roles in two levels of subcategories. “UniGenetotal” contains the entire set of 83,240 clusters found inthe UniGene build 89 database. “UniGene subcellular”(2596 dots) is a representation of UniGene clustershaving similarities to SWISS-PROT (Bairoch and Ap-weiler, 1999) entries whose subcellular localizationcould be found by parsing the annotation. “UniGeneGB4/STS” shows the order of chromosomal localizationfor the subset of UniGene clusters having an STS thatis mapped in the GeneBridge 4 radiation hybrid maps,in which the ESTs are ordered from top to bottomaccording to their centiray distance. This chromosomelocation map is thus only schematic, and the currentdetailed map can be reached through a direct WWWlink.
Differential Display of Two Data Sets
Figure 5 shows the program interface displaying twodifferent data sets that have been searched againstEGAD and are shown on the EGAD-HG chip. The dataderived from normal prostate tissue shown in Fig. 2have here been compared to a second EST data setfrom invasive prostate tumor tissue.
Genes in the virtual chip are displayed in red if theyare more abundant, with selected sensitivity, in theresult set that was first loaded (i.e., normal prostate),and green if abundant in the second (i.e., invasive
prostate tumor), while genes that are regarded asEacecersoadBgtasccdfn
O
tldf
349ExProView
equally expressed are shown in yellow. Maintainingthe virtual chip analogy, this color scheme is concep-tually similar to the image data obtained from hybrid-izations to probe arrays of pairs of RNA samples la-beled with red or green fluorescence, where similarexpression levels for a given probe appear as yellowcolor, while differential expression produces shades ofred or green. Specific fold intervals of interest can beselected by clicking the histogram categories, thus tog-gling between red/green and yellow color-coding. Alter-natively, the yellow dots can be hidden from view onthe virtual chip, thereby allowing easier observation ofa defined subset of the data. It is possible to enterboundaries for the range of abundance (tag number)that will be displayed, e.g., to exclude the low-abun-dance range and show only the genes being repre-sented by five or more copies in both data sets. As aninitial screening for differential expression, the “foldincrease,” i.e., the normalized ratio of transcript num-bers, may be indicative of possible abundance differ-ences, as shown in Fig. 5. The selected genes are dis-played as for a single data set except that the tagcounts from both samples are shown.
Statistical Analysis of Differential Expression
A stringent quantitative comparative analysis re-quires a test for statistical significance to separate realfrom random differences. Different approaches havebeen used for this type of assessment (Audic and Cla-verie, 1997; Madden et al., 1997; Chen et al., 1998).
xProView currently implements the method of Audicnd Claverie (1997), to score for significance. Eachomparison will result in a significance value interval,.g., a probability between 0.95 and 0.96, and the useran select the lowest level to be shown as being differ-nt (red or green) in the display. Figure 6 shows theesult of applying a significance threshold of 0.95 to theame data sets as in Fig. 5. In this manner, the rangef possibly differentially expressed genes can be inter-ctively narrowed down to a desired number of candi-ate genes and accompanying risk of false-positives.y combining the settings in the tab-marked histo-rams, a range of fold differences can be selected, so aso be able to indicate, for example, those entries thatre between three- and fivefold different and lie in theignificance interval above 0.98. The transcripts thatannot be classified are displayed in a separate virtualhip, as described above, with the same options forisplay of abundance differences. This provides a basisor identifying differentially expressed genes that haveo matches in the currently used database.
bservations in the Test Data Sets
The data sets shown in Figs. 2, 5, and 6 were selectedo demonstrate the features of ExProView using pub-icly available EST data. Although a more profoundiscussion and biological interpretation of the results
rom these particular data sets are beyond the scope ofthis article, certain general observations are notewor-thy.
In Fig. 2, the category containing the most abundantmRNAs is the group of ribosomal proteins. Overall, theimpression is that the identified genes are distributedamong all the functional categories. Only three genesare found in the highest tag count range (red color),which in this analysis corresponds to approximately1% of the total tags. One of these encodes an expectedtissue-associated factor, prostate secreted seminalplasma protein, which was selected in the figure fromthe “unclassified” category and is expressed at the1.15% level. Figure 5 shows that this particular gene isexpressed at a similar level in the invasive cancersample, although in this comparative display, mostdots are green or red since the stringent criterion “ex-actly equal” was set for the yellow color-coding. In thetwo data sets, 238 and 229 genes, respectively, haveinfinite ratios due to their complete absence from ei-ther of the samples, while 248 genes reside in the rangeof equal or up to 15-fold difference. In Fig. 6, the com-plexity of the comparison has been drastically reducedby applying a statistical significance criterion. In thisexample, only 27 genes remain scored as differentiallyexpressed, and the 2 prostate-associated genes selectedin Fig. 5 are now yellow-coded and hidden from view.The differentially expressed genes reside in all themajor “cellular role” categories with some emphasis onribosomal proteins. It should be noted that a smallerrelative difference will score as statistically significantfor genes that are more abundantly expressed, as werethe ribosomal genes in this example.
From the input data of approximately 11,000 ESTs,the range of candidate genes in this comparison hasthus been limited to fewer than 30 functionally classi-fied genes in the well-characterized EGAD, with theaid of the ExProView software tool.
DISCUSSION
The current rapid increase in available data concern-ing global mRNA expression patterns leads to a needfor software that can accept and process large data setsfrom various public and proprietary sources for de-scriptive, comparative, and explorative analyses.
The following design principles have governed thedevelopment of ExProView: (i) visualization of the en-tire data set in one image, (ii) the possibility to view thesame data set with different layouts/classificationschemes, (iii) the ability to import experimental data ofdifferent methodological origins, (iv) immediate accessto relevant background information for individual datapoints, and (v) options for display of differences inexpression levels.
The complex data sets that are to be analyzed intranscriptome research create a need for display for-mats at higher levels of organization, where relatedgenes are grouped together according to different cri-
teria. The virtual chip design in ExProView allows theFIG
.5.
Dis
play
oftw
oE
ST
data
sets
,fr
omn
orm
alpr
osta
tean
din
vasi
vepr
osta
tetu
mor
tiss
ue,
that
hav
ebe
ense
arch
edag
ain
stE
GA
D.
350 LARSSON ET AL.
FIG
.6.
Dis
play
ofst
atis
tica
lsig
nifi
can
ce.T
he
sam
etw
oda
tase
tsas
inF
ig.5
are
her
esh
own
afte
rse
lect
ion
ofa
con
fide
nce
inte
rval
thre
shol
dof
0.95
for
the
diff
eren
ces.
Th
en
onsi
gnifi
can
tm
atch
es,i
ndi
cate
din
yell
owin
the
his
togr
am,h
ave
been
chos
enn
otto
besh
own
inth
evi
rtu
alch
ipdi
spla
y.
351ExProView
D
D
E
352 LARSSON ET AL.
researcher to view all qualitative and quantitative as-pects of the data in one image, which may make itpossible to perceive patterns that would not be obviousin a list printout or abundance scatter plot. Further-more, the same data set can be projected on differentvirtual chip layouts representing various conceptualclassification schemes and thereby indicate patterns ofgene function suggested by the expression data.
With the emergence of several different methods foranalysis of global gene expression patterns, it is ofimportance to be able to study data from varioussources with a common visualization tool. ExProViewis designed to accept data from different types of cDNAsequencing and project them on the same comparisonmatrix. Thus, the laboratory may choose different ex-perimental methods depending on the needs of aproject and also compare results with data obtainedfrom external sources.
Conventional cDNA sequencing yields an extended(.100 bp) sequence that allows a reliable identificationof the clone, but with current sequencing technologies,it is difficult to obtain data from a large enough num-ber of clones to allow a quantitative comparison be-tween libraries. At least five to seven occurrences of agiven mRNA sequence tag are needed to ensure a sta-tistically significant difference from 0 (95 and 99%confidence intervals, respectively) in a comparisonwith a sample from another cDNA library (Audic andClaverie, 1997). This aspect can be overcome in theSAGE (Velculescu et al., 1995) and pyrosequencing(Ronaghi et al., 1998) approaches, in which shortertags are used to extract information from a muchlarger number of mRNA molecules per sample, allow-ing a quantitative comparison for a large proportion ofthe expressed genes. The SAGE method is based ondigestion of the EST library at a given restriction en-zyme site (the anchoring enzyme, AE) and capture ofthe sequence immediately downstream of the AE siteclosest to the poly(A) tail. A certain loss of informationis inherent in this approach, since a fraction of themRNA pool will not contain the AE site or will have ittoo close to the poly(A) tail, and the short sequence tagmay not always be unique, e.g., among members of agene family. In pyrosequencing, the original clone isdirectly available for further sequencing, if additionalsequence information is desired, while in the SAGEapproach, longer clones may be recovered from theinitial cDNA library by tag-directed PCR. These ad-vantages and disadvantages of various experimentalmethods make it of interest to enable a direct compar-ison of data obtained from them, as is possible withExProView. Provided that any systematic bias due todifferences between the experimental methods for gen-erating the cDNA libraries is considered to be minor,comparisons of expression levels may be performed atdifferent levels of statistical stringency, depending onthe proportion of false-positives that are acceptable.
ExProView allows a simple representation of fold dif-ference or a display based on computed statistical sig-nificance, as well as combinations of both.
ExProView can also be used as a tool for explorativevisualization of other types of transcript data in whicha group of mRNA sequences of interest has been iden-tified, e.g., a set of transcripts isolated by differentialdisplay (Liang and Pardee, 1992) or representationaldifference analysis (Hubank and Schatz, 1994). Twoadditional areas of functionality will be added in thenext version of ExProView: multisample comparisonsbased on cluster analysis (Wen et al., 1998) and importof digitized hybridization array data for qualitative orsemiquantitative comparison to EST data.
In summary, ExProView allows a comprehensive,convenient, and customizable display of complex geneexpression data sets and may serve as a starting pointfor explorative functional genomics studies with broadapplicability.
ACKNOWLEDGMENTS
This work was supported by grants from the Swedish ResearchCouncil for Engineering Sciences (TFR), the Swedish Cancer Society,the Swedish Society for Medicine, the Swedish Radiation ProtectionInstitute, the Research Funds of the Karolinska Institute, and EUBiomed 2, BMH4-CT-2284.
REFERENCES
Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994).Issues in searching molecular sequence databases. Nat. Genet. 6:119–129.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.(1990). Basic local alignment search tool. J. Mol. Biol. 215: 403–410.
Audic, S., and Claverie, J. M. (1997). The significance of digital geneexpression profiles. Genome Res. 7: 986–995.
Bairoch, A., and Apweiler, R. (1999). The SWISS-PROT proteinsequence data bank and its supplement TrEMBL in 1999. NucleicAcids Res. 27: 49–54.
Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., Ouellette,B. F. F., Rapp, B. A., and Wheeler, D. L. (1999). GenBank. NucleicAcids Res. 27: 12–17.
Boguski, M. S., and Schuler, G. D. (1995). ESTablishing a humantranscript map. Nat. Genet. 10: 369–371.
Bowtell, D. D. (1999). Options available—from start to finish—forobtaining expression data by microarray. Nat. Genet. 21: 25–32.
Burks, C. (1999). Molecular biology database list. Nucleic Acids Res.27: 1–19.
Chen, H., Centola, M., Altschul, S. F., and Metzger, H. (1998).Characterization of gene expression in resting and activated mastcells. J. Exp. Med. 188: 1657–1668.eloukas, P., Schuler, G. D., Gyapay, G., Beasley, E. M., Soderlund,C., Rodriguez-Tome, P., Hui, L., Matise, T. C., McKusick, K. B.,Beckmann, J. S., Bentolila, S., Bihoreau, M., Birren, B. B.,Browne, J., Butler, A., Castle, A. B., Chiannilkulchai, N., Clee, C.,Day, P. J., Dehejia, A., Dibling, T., Drouot, N., Duprat, S., Fizames,C., Bentley, D. R., et al. (1998). A physical map of 30,000 humangenes. Science 282: 744–746.eRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring themetabolic and genetic control of gene expression on a genomicscale. Science 278: 680–686.
ckman, B. A., Aaronson, J. S., Borkowski, J. A., Bailey, W. J.,
Elliston, K. O., Williamson, A. R., and Blevins, R. A. (1998). TheF
G
H
H
L
M
R
S
S
V
V
W
W
353ExProView
Merck Gene Index browser: An extensible data integration systemfor gene finding, gene characterization and EST data mining.Bioinformatics 14: 2–13.
ields, C., Adams, M. D., White, O., and Venter, J. C. (1994). Howmany genes in the human genome? Nat. Genet. 7: 345–346.offeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B.,Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M.,Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin,H., and Oliver, S. G. (1996). Life with 6000 genes. Science 274: 546.ieter, P., and Boguski, M. (1997). Functional genomics: It’s all howyou read it. Science 278: 601–602.ubank, M., and Schatz, D. G. (1994). Identifying differences inmRNA expression by representational difference analysis ofcDNA. Nucleic Acids Res. 22: 5640–5648.
iang, P., and Pardee, A. B. (1992). Differential display of eukaryoticmessenger RNA by means of the polymerase chain reaction. Sci-ence 257: 967–971.adden, S. L., Galella, E. A., Zhu, J., Bertelsen, A. H., and Beaudry,G. A. (1997). SAGE transcript profiles for p53-dependent growth
regulation. Oncogene 15: 1079–1085.onaghi, M., Uhlen, M., and Nyren, P. (1998). A sequencing methodbased on real-time pyrophosphate. Science 281: 363.
trausberg, R. L., Dahl, C. A., and Klausner, R. D. (1997). Newopportunities for uncovering the molecular basis of cancer. Nat.Genet. 15 (Spec. No.): 415–416.
utton, G. G., White, O., Adams, M. D., and Kerlavage, A. R. (1995).TIGR assembler: A new tool for assembling large shotgun sequenc-ing projects. Genome Sci. Tech. 1: 9–19.elculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W.(1995). Serial analysis of gene expression. Science 270: 484–487.
elculescu, V. E., Zhang, L., Zhou, W., Vogelstein, J., Basrai, M. A.,Bassett, D. E., Jr., Hieter, P., Vogelstein, B., and Kinzler, K. W.(1997). Characterization of the yeast transcriptome. Cell 88: 243–251.en, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker,J. L., and Somogyi, R. (1998). Large-scale temporal gene expres-sion mapping of central nervous system development. Proc. Natl.Acad. Sci. USA 95: 334–339.hite, O., and Kerlavage, A. R. (1996). TDB: New databases for
biological discovery. Methods Enzymol. 266: 27–40.