expression profile viewer (exproview): a software tool for transcriptome analysis

13
Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis Magnus Larsson,* ,1 Stefan Ståhl,* Mathias Uhle ´ n,* and Anders Wennborg² *Department of Biotechnology, Royal Institute of Technology (KTH), S-100 44 Stockholm, Sweden; and ²Department of Biosciences, Karolinska Institute, Novum, S-141 57 Huddinge, Sweden Received June 9, 1999; accepted December 21, 1999 A software tool, Expression Profile Viewer (ExPro- View), for analysis of gene expression profiles de- rived from expressed sequence tags (ESTs) and SAGE (serial analysis of gene expression) is pre- sented. The software visualizes a complete set of classified transcript data in a two-dimensional array of dots, a “virtual chip,” in which each dot repre- sents a known gene as characterized in the tran- script databases Expressed Gene Anatomy Database or UniGene. The virtual chip display can be changed between representations of different conceptual sys- tems for gene/protein classification and grouping. Four alternative projections are currently available: (i) cellular role, (ii) subcellular compartment, (iii) chromosome localization, and (iv) total UniGene dis- play. However, the chip can be adapted to any other desired layout. By selecting dots, further informa- tion about the represented genes is obtained from the local database and WWW links. The software thus provides a visualization of global mRNA expres- sion at the descriptive level and guides in the explo- ration of patterns of functional expression, while maintaining direct access to detailed information on each individual gene. To evaluate the software, pub- lic EST and SAGE gene expression data obtained from the Cancer Genome Anatomy Project at the National Center for Biotechnology Information were analyzed and visualized. A demonstration of the software is available at http://www.biochem.kth. se/exproview/. © 2000 Academic Press INTRODUCTION Global analysis of the trancriptome (i.e., all mRNAs) and/or the proteome (i.e., all translation products) has been broadly defined as “functional genomics” (Hieter and Boguski, 1997). The recent technological advances that enable analysis of global gene expression at the mRNA level (Velculescu et al., 1995; Bowtell, 1999), coupled with a rapid growth of available databases with such data (Burks, 1999), create new demands on software tools for data management, visualization, comparison, and exploration. For organisms with completely sequenced ge- nomes, e.g., Saccharomyces cerevisiae, essentially complete maps of the transcribable sequences are available (Goffeau et al., 1996). This makes it possi- ble to monitor the expression of the majority of mR- NAs simultaneously, using probe-hybridization ar- rays (DeRisi et al., 1997) or expressed sequence tags (ESTs) from cDNA libraries, including the 10-base tags used for SAGE (serial analysis of gene expres- sion) (Velculescu et al., 1997). For the human tran- scriptome, a number of databases with different de- grees of completeness, validity, and annotation exist. It is likely that almost every transcript from the human genome has been sampled some time in the large multicenter sequencing projects that have gen- erated more than one million ESTs (Benson et al., 1999). However, it has not yet been possible to as- semble a comprehensive transcript catalog. The orig- inal data archives (e.g., dbEST in GenBank) contain fragmented and partially redundant sequences, and efforts have been made to reduce redundancy and cluster related ESTs together with different algo- rithms. Examples of such databases are UniGene from the National Center for Biotechnology Informa- tion (NCBI) (Boguski and Schuler, 1995), Human Gene Index (HGI) from The Institute for Genome Research (TIGR) (www.tigr.org), and Merck Gene Index (MGI) (Eckman et al., 1998). For a subset of the human genes, the structure of the transcription unit(s) has been verified, and a more extensive an- notation of the gene product structure and function is available. Such sequences have been collected in specialized databases, e.g., the Expressed Gene Anatomy Database (EGAD) (White and Kerlavage, 1996) from TIGR (www.tigr.org). Functional annota- tion can also be derived through links between indi- vidual records in mRNA databases, e.g., UniGene (Boguski and Schuler, 1995), and protein databases, 1 To whom correspondence should be addressed at Department of Biotechnology, Royal Institute of Technology (KTH), Teknikringen 34, S-100 44 Stockholm, Sweden. Telephone: 146 8 7908287. Fax: 146 8 245452. E-mail: [email protected]. Genomics 63, 341–353 (2000) doi:10.1006/geno.1999.6105, available online at http://www.idealibrary.com on 341 0888-7543/00 $35.00 Copyright © 2000 by Academic Press All rights of reproduction in any form reserved.

Upload: magnus-larsson

Post on 02-Oct-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

Genomics 63, 341–353 (2000)doi:10.1006/geno.1999.6105, available online at http://www.idealibrary.com on

Expression Profile Viewer (ExProView): A Software Toolfor Transcriptome Analysis

Magnus Larsson,*,1 Stefan Ståhl,* Mathias Uhlen,* and Anders Wennborg†

*Department of Biotechnology, Royal Institute of Technology (KTH), S-100 44 Stockholm, Sweden; and†Department of Biosciences, Karolinska Institute, Novum, S-141 57 Huddinge, Sweden

Received June 9, 1999; accepted December 21, 1999

cabNr

A software tool, Expression Profile Viewer (ExPro-View), for analysis of gene expression profiles de-rived from expressed sequence tags (ESTs) andSAGE (serial analysis of gene expression) is pre-sented. The software visualizes a complete set ofclassified transcript data in a two-dimensional arrayof dots, a “virtual chip,” in which each dot repre-sents a known gene as characterized in the tran-script databases Expressed Gene Anatomy Databaseor UniGene. The virtual chip display can be changedbetween representations of different conceptual sys-tems for gene/protein classification and grouping.Four alternative projections are currently available:(i) cellular role, (ii) subcellular compartment, (iii)chromosome localization, and (iv) total UniGene dis-play. However, the chip can be adapted to any otherdesired layout. By selecting dots, further informa-tion about the represented genes is obtained fromthe local database and WWW links. The softwarethus provides a visualization of global mRNA expres-sion at the descriptive level and guides in the explo-ration of patterns of functional expression, whilemaintaining direct access to detailed information oneach individual gene. To evaluate the software, pub-lic EST and SAGE gene expression data obtainedfrom the Cancer Genome Anatomy Project at theNational Center for Biotechnology Information wereanalyzed and visualized. A demonstration of thesoftware is available at http://www.biochem.kth.se/exproview/. © 2000 Academic Press

INTRODUCTION

Global analysis of the trancriptome (i.e., all mRNAs)and/or the proteome (i.e., all translation products) hasbeen broadly defined as “functional genomics” (Hieterand Boguski, 1997). The recent technological advancesthat enable analysis of global gene expression at the

1 To whom correspondence should be addressed at Department ofBiotechnology, Royal Institute of Technology (KTH), Teknikringen34, S-100 44 Stockholm, Sweden. Telephone: 146 8 7908287. Fax:

146 8 245452. E-mail: [email protected].

341

mRNA level (Velculescu et al., 1995; Bowtell, 1999),coupled with a rapid growth of available databaseswith such data (Burks, 1999), create new demands onsoftware tools for data management, visualization,comparison, and exploration.

For organisms with completely sequenced ge-nomes, e.g., Saccharomyces cerevisiae, essentiallyomplete maps of the transcribable sequences arevailable (Goffeau et al., 1996). This makes it possi-le to monitor the expression of the majority of mR-As simultaneously, using probe-hybridization ar-

ays (DeRisi et al., 1997) or expressed sequence tags(ESTs) from cDNA libraries, including the 10-basetags used for SAGE (serial analysis of gene expres-sion) (Velculescu et al., 1997). For the human tran-scriptome, a number of databases with different de-grees of completeness, validity, and annotation exist.It is likely that almost every transcript from thehuman genome has been sampled some time in thelarge multicenter sequencing projects that have gen-erated more than one million ESTs (Benson et al.,1999). However, it has not yet been possible to as-semble a comprehensive transcript catalog. The orig-inal data archives (e.g., dbEST in GenBank) containfragmented and partially redundant sequences, andefforts have been made to reduce redundancy andcluster related ESTs together with different algo-rithms. Examples of such databases are UniGenefrom the National Center for Biotechnology Informa-tion (NCBI) (Boguski and Schuler, 1995), HumanGene Index (HGI) from The Institute for GenomeResearch (TIGR) (www.tigr.org), and Merck GeneIndex (MGI) (Eckman et al., 1998). For a subset ofthe human genes, the structure of the transcriptionunit(s) has been verified, and a more extensive an-notation of the gene product structure and functionis available. Such sequences have been collected inspecialized databases, e.g., the Expressed GeneAnatomy Database (EGAD) (White and Kerlavage,1996) from TIGR (www.tigr.org). Functional annota-tion can also be derived through links between indi-vidual records in mRNA databases, e.g., UniGene

(Boguski and Schuler, 1995), and protein databases,

0888-7543/00 $35.00Copyright © 2000 by Academic Press

All rights of reproduction in any form reserved.

Page 2: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

FIG

.1.

Ove

rvie

wof

the

ExP

roV

iew

con

cept

.In

put

gen

eex

pres

sion

data

and

copi

esof

rele

van

ttr

ansc

ript

data

base

sar

est

ored

ina

loca

lre

lati

onal

data

base

toge

ther

wit

hre

sult

sfr

omda

taba

sem

atch

ing.

ExP

roV

iew

show

sth

ere

sult

ofda

taba

sem

atch

ing

ina

virt

ual

chip

disp

lay

win

dow

,wh

ere

each

gen

ein

the

data

base

isre

pres

ente

dby

ado

t,w

hic

his

colo

r-co

ded

acco

rdin

gto

the

abu

nda

nce

ofth

eco

rres

pon

din

gE

ST

inth

ecu

rren

tda

tase

t.A

his

togr

ampr

ofile

ofth

eab

un

dan

cedi

stri

buti

onfo

rth

eda

tase

tan

din

form

atio

nab

out

sele

cted

gen

es(d

ots)

are

also

disp

laye

d.T

he

dots

are

clic

kabl

ew

ith

imm

edia

teac

cess

toth

eu

nde

rlyi

ng

data

base

asw

ell

asW

WW

lin

ksto

the

orig

inal

data

base

s.

342 LARSSON ET AL.

Page 3: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

343ExProView

Page 4: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

lsa

nwbfab

tgEs_1hcotsaflaIUct

cf

344 LARSSON ET AL.

e.g., SWISS-PROT (Bairoch and Apweiler, 1999), es-tablished by similarity searching with BLASTX(Altschul et al., 1994).

In addition to databases aiming at describing allpossible human transcripts based on clusters of ESTspooled from different tissue sources, several databaseswith ESTs from individual cDNA libraries are nowpublicly available. One example is the Cancer GenomeAnatomy Project (CGAP) (Strausberg et al., 1997) atNCBI, in which EST data from several cancer types arecontinuously accumulated. These large gene expres-sion data sets constitute a novel type of bioinformaticdata entity, which cannot be readily analyzed withmanual methods. Furthermore, there exists a need forrepeated analysis of these growing data sets and forcomparison between public and proprietary data.

We describe here a software tool, ExProView, thathas been designed for efficient analysis of large datasets from mRNA expression studies obtained from ESTsequencing, including method variants that rely onvery short sequence tags, e.g., SAGE. ExProView is auniversal and flexible platform that allows differenttypes of data access and visualization. It reads theoutput obtained from a local database search for iden-tification of different types of EST tags and is capableof displaying this data set in various formats. The ESTexpression data are represented on a two-dimensionalarray of dots, called a “virtual chip,” in which each dotrepresents a known gene that is color-coded accordingto the abundance of corresponding ESTs in the currentdata set. The virtual chip array is subdivided based ona given gene classification scheme (e.g., cellular role,subcellular compartment, or chromosome localization),and the same data set can be viewed using differentvirtual chip layouts. Custom-made virtual chip layoutscan be adapted to meet the specific needs of a particu-lar user.

EST data are digital in nature (as opposed to hybrid-ization data, which are analog) and therefore directlysuitable for various statistical analyses, including com-parisons of two different data sets (Audic and Claverie,1997). Functionality for such comparisons is integratedin the ExProView software. Two data sets are normal-ized to the total tag count in each sample, and theresulting differences are plotted in a color-coded dis-play in the virtual chip. The desired degree of differ-ence to be displayed can be set as boundaries of statis-tical significance or manually adjusted with a clickabletool in the virtual chip.

The computational analysis of a set of ESTs can beregarded on three levels: qualitative, descriptive, andexplorative. The qualitative analysis includes prepro-

FIG. 2. Display of a single EST data set (normal prostate epitheliuESTs as dots, which are color-coded according to abundance, as indicatedin gray. The color-coding can be changed by selecting between linearabundance to be displayed can also be set. Genes/dots can be selected byby white squares on the virtual chip) are identified by name in the pan

displayed in the lower right-hand corner.

cessing to remove known sequence artifacts, scoringthe copy number of identical ESTs, and describingabundance classes. In the descriptive analysis, ESTsare identified by gene names by searching against ap-propriate databases. Explorative analyses aim at de-ducing regulatory or functional patterns at higher lev-els of complexity, e.g., metabolic pathways, from theglobal mRNA expression data. ExProView is a tool fordescriptive analysis and serves as guide for explorationof gene expression patterns, thus providing a basis forsubsequent deduction of functional patterns.

MATERIALS AND METHODS

Systems and hardware. The ExProView program was written inJava and runs as an applet or application on Java-compliant plat-forms. The program uses Java database connectivity to communicatewith a relational database containing input EST data sets, informa-tion extracted from downloaded sequence databases, and resultsfrom sequence matching (Fig. 1). The database was implemented onan Oracle Workgroup server 7.3 for Windows NT, but is constructedto be easily adaptable to other platforms. ExProView and the data-base are designed in a flexible manner to allow for new sources ofinformation. Running ExProView in applet mode requires a browsercorresponding to Netscape 4.5 or higher and a screen resolution of800 3 600 or more. On a PC system, a 200 MMX1 processor and ateast 64 MB of RAM are recommended. A demonstration of theoftware applet with access to the data described in this article isvailable at http://www.biochem.kth.se/exproview/.

Local sequence databases. To construct the local copies of exter-al sequence databases, sequence data from EGAD and UniGeneere downloaded as FASTA files and used to produce BLAST data-ases for similarity searches. Sequences and annotation informationor each entry were obtained by parsing the corresponding flat filesnd importing the relevant features into the local relational data-ase.For Unigene (Build 89, August 1999), the selected best represen-

ative sequence for each cluster in the file Hs_seq_uniq was used,iving a database of 83,240 entries, against which matching of inputSTs was performed. Active hyperlinks to the original database Webites were established based on information obtained from the Hs-data flat file. The page for each UniGene (Boguski and Schuler,995) entry is accessed through its Hs. number. The best BLASTXits, as specified for a subset of the UniGene entries, are used tolassify the Unigene entries according to whether they match humanr, if not, a nonhuman protein sequence. The chromosomal localiza-ion, if known, was obtained by record linking of the sequence taggedite (STS) ID numbers for the UniGene entries that have these datand the mapping information in the Radiation Hybrid GB4 databaseat files (Deloukas et al., 1998), where every mapped STS is associ-ted with a distance measure (centirays) from the chromosome top.n the cases in which more than one STS was associated with a givenniGene entry, only one of them was used, since they normally map

lose to each other. The STS number is used to establish a direct linko GeneMap ’98 (Deloukas et al., 1998).

EGAD was similarly downloaded and used for EST matching. Thelassification of cellular roles for each individual entry was extractedrom the EGAD Web site files. Active links to the EGAD Web site

after matching to EGAD. The virtual chip shows gene matches to thethe histogram to the right. Genes matched to singleton ESTs are shownd logarithmic scales. Specific boundaries for minimum or maximum

cking or dragging on the virtual chip. These selected genes (surroundedelow the virtual chip. A miniature representation of the virtual chip is

m)inancli

el b

Page 5: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

eG

SeBs3oorgctUm

(

ifmtmgaed

icctmmitqutcvo

a

345ExProView

(www.tigr.org/tdb/egad/) are based on the HG numbers for eachentry.

Analysis of EST and SAGE data. In the present study, two ESTdata sets and one SAGE data set from CGAP at NCBI were down-loaded and analyzed. The EST data sets were analyzed withBLASTN against UniGene and EGAD. Similarity search matchingwas performed with BLASTN (version 2.0.10) and a default cut-off of1 3 e230. The EST libraries originate from normal human prostatepithelium (NCI_CGAP_Pr1) and invasive prostate cancer (NCI_C-AP_Pr3) and contained 5689 and 5209 ESTs, respectively.The SAGE data were from a primary colon tumor library

AGE_Tu102 containing 20,050 tags. As an option for short ESTs,.g., from SAGE, exact text matching can be performed instead ofLASTN. Input EST sequences are then matched against a sequencetretch of defined length (10 bases in the case of SAGE) immediately9 of a given restriction enzyme site that has been preextracted outf each entry in the sequence database. NCBI has derived a mappingf UniGene with this information for the combination of the NlaIIIestriction enzyme and a tag length of 10 bases (www.ncbi.nlm.nih.ov/CGAP/). The cross-references between a tag and a UniGeneluster are divided into “full” and “reliable” matches, depending onhe degree of sequence consensus among the ESTs forming a givenniGene cluster. The reliable map forms the basis of SAGE-tagatching against UniGene in ExProView.

RESULTS

Overview of the ExProView System

Figure 1 shows the design of the ExProView system.The purpose of ExProView is to provide a visualizationand comparison tool for large data sets from mRNAexpression studies, which can be obtained from ESTsequencing, including high-throughput methods suchas SAGE (Velculescu et al., 1995) and pyrosequencingRonaghi et al., 1998).

A local relational database is used to store (i) thenput sequence data, (ii) characterized gene data (e.g.,rom EGAD and UniGene), and (iii) results from theatching of mRNA transcription data to the charac-

erized gene data. A set of server-sided applicationodules is used for the import and matching of ESTs to

ene data. The ExProView software is implemented asJava applet or application, which reads the EST

xpression profile results from the local database andisplays this data set in various formats.The central concept in ExProView is the virtual chip,

n which dots representing each known gene or ESTluster in a gene database are displayed in arrays thatan be designed to represent different conceptual sys-ems for gene/protein classification and grouping, e.g.,etabolic function, subcellular compartment, or chro-osome localization. The expression level of each gene

s color-coded, yielding a three-dimensional informa-ion matrix representing all available qualitative anduantitative data. Depending on the interest of theser, the same data set can thus be displayed accordingo different classification schemes in different virtualhip layouts, and the relative expression levels of indi-idual entries (genes) in each group are then instantlybservable.The dots are clickable and selectable, with immedi-

te access to corresponding information in the under-

lying database and through further WWW links to theorigin of primary data, e.g., EGAD, UniGene, SWISS-PROT, and PubMed, as applicable. A link to the locallaboratory data is also provided. This general softwareinterface design offers the researcher a complete andflexible overview of complex data sets, while allowingrapid access to detailed gene-specific information whenrequired.

Characterization of EST Data

Sequencing of an EST library yields data consistingof tag identity (quality) and abundance (quantity), i.e.,the number of times the tag has occurred among thecurrent total number of sequenced tags. The input toExProView consists of information about the genesidentified by ID number after searching in the currentdatabase and the abundance of each gene, i.e., thenumber of ESTs identified as that gene. These ESTscan have different origins: (i) ESTs consisting of se-quences generated by conventional sequencing ofcDNA library clones, usually longer than 100 bases; (ii)SAGE tags with shorter sequences of 10 bases; and (iii)pyrosequencing tags with sequences of 20 bases ormore.

Matching of these types of sequence data to appro-priate databases can be achieved either with sequencealignment programs like BLAST (Altschul et al., 1990)or by exact text matching. BLAST searches have theadvantage of being capable of classifying the EST evenif there are some minor sequence errors in the EST orthe database, although it may sometimes be misclas-sified as a closely related gene in the same gene family.Due to the relatively short tag length, sequence datagenerated by SAGE (and pyrosequencing, in some in-stances) can instead be classified by performing exacttext searches against the characterized gene datasources. This introduces a risk for misclassification insearches against databases that have not been verifiedby repeated sequencing and where the clone orienta-tion is unknown. This is true for a subset of the ESTclusters in UniGene, and data from text matches mustbe interpreted with caution in these cases. An algo-rithm for weighing the risk of sequence errors occur-ring in defined SAGE tags compared to UniGene hasrecently been described (http://www.ncbi.nlm.nih.gov/CGAP/). This results in two alternative mappings oftags to UniGene clusters, one with a full tag-to-clustermap and one reliable, which can be used to make theanalysis with different degrees of uncertainty in theindividual database matches. The set of unique tag-to-cluster matches in the reliable UniGene map allows anonredundant SAGE-tag matching in ExProView.

After database matching as described above, a sub-set of the ESTs are likely to remain unidentified. Thesewill be stored and displayed by ExProView in order ofabundance. For the short, fixed-length ESTs fromSAGE and pyrosequencing, identical tags are counted

together to determine their abundance. For longer
Page 6: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

FIG

.3.

Mat

chin

gof

aS

AG

Eda

tase

t(p

rim

ary

colo

ntu

mor

)to

the

enti

reU

niG

ene

data

base

acco

rdin

gto

the

reli

able

and

non

redu

nda

nt

SA

GE

tag

map

pin

g.

346 LARSSON ET AL.

Page 7: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

ual

347ExProView

ESTs with varying sequence read-length, the compar-ison cannot be based on identity, but must involve aclustering of related ESTs into discrete groups. Thiscan be accomplished by using an assembly programsuch as the TIGR Assembler (Sutton et al., 1995) ortlcluster (http://ratest.uiowa.edu/) and listing the out-put as the number of original ESTs that constituteeach cluster. In the ExProView display of unknownsequences, each cluster will be represented separately.

User Interface: The Virtual Chip

To be able to view the entire data set comprising a

FIG. 4. Schematic views of different virt

transcriptome, it would be desirable to project it on a

representation of all possible transcripts, i.e., the tran-scribable genome. In ExProView, this is accomplishedby an array of dots representing the known genes in agiven database, where those that have been detected inthe analyzed data set are color-coded according toabundance. The visual impression is similar to viewinga microarray or chip with probes to which a labeledsample has been hybridized and yields a signal whoseintensity is proportional to the amount of hybridizedtranscript. This display in ExProView is thereforecalled a virtual chip.

The virtual chip provides a global view of the com-

chip layouts. See Results for description.

plete data set with visual cues other than those avail-

Page 8: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

348 LARSSON ET AL.

able in conventional tables or two-axis scatter dia-grams, while maintaining immediate point-and-clickaccess to each individual record in the set. Since allgenes in the database are represented by a dot, thedisplay also shows the absence of gene expression,which may be as biologically relevant as presence. Thearray can be subdivided with frames according to anyrelevant structural and/or functional grouping of thegenes, e.g., by chromosomal localization of the gene ormetabolic function of the corresponding protein. Usingdifferent array layouts, the same transcriptome dataset can thus be viewed projected on different geneclassification schemes. With this representation, it isdirectly possible to extrapolate global functional infor-mation from the data set, e.g., how much of the currenttranscriptome is devoted to ribosomal proteins,whether all genes in a given signal pathway are ex-pressed, or whether one region of a chromosome ismore transcriptionally active than another.

For S. cerevisiae, the total set of approximately 6000genes can be represented by an array of 80 3 80 dots,which is easily visible on a conventional computerscreen at 7 3 7 pixels each. The corresponding com-plete human gene map is expected to be one order ofmagnitude larger (Fields et al., 1994). However, eventhis amount of information can be displayed in a ma-trix that is readily observable at 2 pixels per dot. Ex-ProView has a zoom function that allows a choice of dotsize down to 1 pixel. A future visualization of the com-plete set of human genes is thus feasible in the virtualchip format implemented in ExProView, and the indi-vidual dots in such a matrix will be clearly visible on apaper printout in A4 format.

Display of a Single Data Set

A data set derived from a single sample can beviewed in a virtual chip for the purpose of visualizingthe expression levels of the genes. The user selects avirtual chip layout and data set from those available inthe “Expression Load” dialog box. Figure 2 shows anexample of an EST data set from normal prostate tis-sue (CGAP_NCI_Pr1) displayed on the virtual chip forEGAD, and Fig. 3 shows a SAGE primary colon tumordata set (SAGE_Tu102) projected on a view of the totalUniGene database.

The virtual chip, which may extend outside thepanel, is displayed in a scrollable and zoomable win-dow. The histogram shows a step-gradient of color cor-responding to the level of expression for the genesdisplayed in the virtual chip. The gradient is scaled inseven steps to enhance the contrast in the comparison.ESTs that occur only once in the data set (singletons)are coded with gray, and the rest of the scale is bydefault divided in equal steps with a linear or logarith-mic scale according to the user’s preference.

When a spot in the virtual chip is selected, a text lineappears in the selection panel with information for the

gene. The amount of this transcript (number of tags

and percentage of total) and its name are followed bylinks to PubMed, SWISS-PROT, and STS, if available.By clicking the name, a link is opened to the originaldatabase Web site (UniGene or EGAD). By selectingthe tabs labeled “Dataset” and “VirtualChip,” informa-tion about the current data set and virtual chip, respec-tively, is displayed.

A miniature of the virtual chip (Fig. 2, lower right) isused for orientation when zooming and scrolling. Bysingle-clicking a rectangle in the miniature chip, thename of that category is displayed. By double-clickingit, that part of the chip will be centered in the maindisplay.

To facilitate the localization of particular genes on avirtual chip, ExProView provides a search dialog inwhich genes can be searched by name or ID. The re-sults from an ExProView session can be printed or sentby e-mail using the report generator dialog (this func-tion is disabled in applet mode because of securityrestrictions for applet access to the client computer’sresources).

Different Virtual Chips

It is possible to view the same EST data set projectedon different virtual chips corresponding to differentdatabase classifications. The chips can be designed tocorrespond to a complete database or any subset, thatis suitable for the investigator.

Figure 4 shows four currently available virtualchips. The “EGAD-HG” chip (5232 dots) shows thecellular roles in two levels of subcategories. “UniGenetotal” contains the entire set of 83,240 clusters found inthe UniGene build 89 database. “UniGene subcellular”(2596 dots) is a representation of UniGene clustershaving similarities to SWISS-PROT (Bairoch and Ap-weiler, 1999) entries whose subcellular localizationcould be found by parsing the annotation. “UniGeneGB4/STS” shows the order of chromosomal localizationfor the subset of UniGene clusters having an STS thatis mapped in the GeneBridge 4 radiation hybrid maps,in which the ESTs are ordered from top to bottomaccording to their centiray distance. This chromosomelocation map is thus only schematic, and the currentdetailed map can be reached through a direct WWWlink.

Differential Display of Two Data Sets

Figure 5 shows the program interface displaying twodifferent data sets that have been searched againstEGAD and are shown on the EGAD-HG chip. The dataderived from normal prostate tissue shown in Fig. 2have here been compared to a second EST data setfrom invasive prostate tumor tissue.

Genes in the virtual chip are displayed in red if theyare more abundant, with selected sensitivity, in theresult set that was first loaded (i.e., normal prostate),and green if abundant in the second (i.e., invasive

prostate tumor), while genes that are regarded as
Page 9: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

EacecersoadBgtasccdfn

O

tldf

349ExProView

equally expressed are shown in yellow. Maintainingthe virtual chip analogy, this color scheme is concep-tually similar to the image data obtained from hybrid-izations to probe arrays of pairs of RNA samples la-beled with red or green fluorescence, where similarexpression levels for a given probe appear as yellowcolor, while differential expression produces shades ofred or green. Specific fold intervals of interest can beselected by clicking the histogram categories, thus tog-gling between red/green and yellow color-coding. Alter-natively, the yellow dots can be hidden from view onthe virtual chip, thereby allowing easier observation ofa defined subset of the data. It is possible to enterboundaries for the range of abundance (tag number)that will be displayed, e.g., to exclude the low-abun-dance range and show only the genes being repre-sented by five or more copies in both data sets. As aninitial screening for differential expression, the “foldincrease,” i.e., the normalized ratio of transcript num-bers, may be indicative of possible abundance differ-ences, as shown in Fig. 5. The selected genes are dis-played as for a single data set except that the tagcounts from both samples are shown.

Statistical Analysis of Differential Expression

A stringent quantitative comparative analysis re-quires a test for statistical significance to separate realfrom random differences. Different approaches havebeen used for this type of assessment (Audic and Cla-verie, 1997; Madden et al., 1997; Chen et al., 1998).

xProView currently implements the method of Audicnd Claverie (1997), to score for significance. Eachomparison will result in a significance value interval,.g., a probability between 0.95 and 0.96, and the useran select the lowest level to be shown as being differ-nt (red or green) in the display. Figure 6 shows theesult of applying a significance threshold of 0.95 to theame data sets as in Fig. 5. In this manner, the rangef possibly differentially expressed genes can be inter-ctively narrowed down to a desired number of candi-ate genes and accompanying risk of false-positives.y combining the settings in the tab-marked histo-rams, a range of fold differences can be selected, so aso be able to indicate, for example, those entries thatre between three- and fivefold different and lie in theignificance interval above 0.98. The transcripts thatannot be classified are displayed in a separate virtualhip, as described above, with the same options forisplay of abundance differences. This provides a basisor identifying differentially expressed genes that haveo matches in the currently used database.

bservations in the Test Data Sets

The data sets shown in Figs. 2, 5, and 6 were selectedo demonstrate the features of ExProView using pub-icly available EST data. Although a more profoundiscussion and biological interpretation of the results

rom these particular data sets are beyond the scope of

this article, certain general observations are notewor-thy.

In Fig. 2, the category containing the most abundantmRNAs is the group of ribosomal proteins. Overall, theimpression is that the identified genes are distributedamong all the functional categories. Only three genesare found in the highest tag count range (red color),which in this analysis corresponds to approximately1% of the total tags. One of these encodes an expectedtissue-associated factor, prostate secreted seminalplasma protein, which was selected in the figure fromthe “unclassified” category and is expressed at the1.15% level. Figure 5 shows that this particular gene isexpressed at a similar level in the invasive cancersample, although in this comparative display, mostdots are green or red since the stringent criterion “ex-actly equal” was set for the yellow color-coding. In thetwo data sets, 238 and 229 genes, respectively, haveinfinite ratios due to their complete absence from ei-ther of the samples, while 248 genes reside in the rangeof equal or up to 15-fold difference. In Fig. 6, the com-plexity of the comparison has been drastically reducedby applying a statistical significance criterion. In thisexample, only 27 genes remain scored as differentiallyexpressed, and the 2 prostate-associated genes selectedin Fig. 5 are now yellow-coded and hidden from view.The differentially expressed genes reside in all themajor “cellular role” categories with some emphasis onribosomal proteins. It should be noted that a smallerrelative difference will score as statistically significantfor genes that are more abundantly expressed, as werethe ribosomal genes in this example.

From the input data of approximately 11,000 ESTs,the range of candidate genes in this comparison hasthus been limited to fewer than 30 functionally classi-fied genes in the well-characterized EGAD, with theaid of the ExProView software tool.

DISCUSSION

The current rapid increase in available data concern-ing global mRNA expression patterns leads to a needfor software that can accept and process large data setsfrom various public and proprietary sources for de-scriptive, comparative, and explorative analyses.

The following design principles have governed thedevelopment of ExProView: (i) visualization of the en-tire data set in one image, (ii) the possibility to view thesame data set with different layouts/classificationschemes, (iii) the ability to import experimental data ofdifferent methodological origins, (iv) immediate accessto relevant background information for individual datapoints, and (v) options for display of differences inexpression levels.

The complex data sets that are to be analyzed intranscriptome research create a need for display for-mats at higher levels of organization, where relatedgenes are grouped together according to different cri-

teria. The virtual chip design in ExProView allows the
Page 10: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

FIG

.5.

Dis

play

oftw

oE

ST

data

sets

,fr

omn

orm

alpr

osta

tean

din

vasi

vepr

osta

tetu

mor

tiss

ue,

that

hav

ebe

ense

arch

edag

ain

stE

GA

D.

350 LARSSON ET AL.

Page 11: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

FIG

.6.

Dis

play

ofst

atis

tica

lsig

nifi

can

ce.T

he

sam

etw

oda

tase

tsas

inF

ig.5

are

her

esh

own

afte

rse

lect

ion

ofa

con

fide

nce

inte

rval

thre

shol

dof

0.95

for

the

diff

eren

ces.

Th

en

onsi

gnifi

can

tm

atch

es,i

ndi

cate

din

yell

owin

the

his

togr

am,h

ave

been

chos

enn

otto

besh

own

inth

evi

rtu

alch

ipdi

spla

y.

351ExProView

Page 12: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

D

D

E

352 LARSSON ET AL.

researcher to view all qualitative and quantitative as-pects of the data in one image, which may make itpossible to perceive patterns that would not be obviousin a list printout or abundance scatter plot. Further-more, the same data set can be projected on differentvirtual chip layouts representing various conceptualclassification schemes and thereby indicate patterns ofgene function suggested by the expression data.

With the emergence of several different methods foranalysis of global gene expression patterns, it is ofimportance to be able to study data from varioussources with a common visualization tool. ExProViewis designed to accept data from different types of cDNAsequencing and project them on the same comparisonmatrix. Thus, the laboratory may choose different ex-perimental methods depending on the needs of aproject and also compare results with data obtainedfrom external sources.

Conventional cDNA sequencing yields an extended(.100 bp) sequence that allows a reliable identificationof the clone, but with current sequencing technologies,it is difficult to obtain data from a large enough num-ber of clones to allow a quantitative comparison be-tween libraries. At least five to seven occurrences of agiven mRNA sequence tag are needed to ensure a sta-tistically significant difference from 0 (95 and 99%confidence intervals, respectively) in a comparisonwith a sample from another cDNA library (Audic andClaverie, 1997). This aspect can be overcome in theSAGE (Velculescu et al., 1995) and pyrosequencing(Ronaghi et al., 1998) approaches, in which shortertags are used to extract information from a muchlarger number of mRNA molecules per sample, allow-ing a quantitative comparison for a large proportion ofthe expressed genes. The SAGE method is based ondigestion of the EST library at a given restriction en-zyme site (the anchoring enzyme, AE) and capture ofthe sequence immediately downstream of the AE siteclosest to the poly(A) tail. A certain loss of informationis inherent in this approach, since a fraction of themRNA pool will not contain the AE site or will have ittoo close to the poly(A) tail, and the short sequence tagmay not always be unique, e.g., among members of agene family. In pyrosequencing, the original clone isdirectly available for further sequencing, if additionalsequence information is desired, while in the SAGEapproach, longer clones may be recovered from theinitial cDNA library by tag-directed PCR. These ad-vantages and disadvantages of various experimentalmethods make it of interest to enable a direct compar-ison of data obtained from them, as is possible withExProView. Provided that any systematic bias due todifferences between the experimental methods for gen-erating the cDNA libraries is considered to be minor,comparisons of expression levels may be performed atdifferent levels of statistical stringency, depending onthe proportion of false-positives that are acceptable.

ExProView allows a simple representation of fold dif-

ference or a display based on computed statistical sig-nificance, as well as combinations of both.

ExProView can also be used as a tool for explorativevisualization of other types of transcript data in whicha group of mRNA sequences of interest has been iden-tified, e.g., a set of transcripts isolated by differentialdisplay (Liang and Pardee, 1992) or representationaldifference analysis (Hubank and Schatz, 1994). Twoadditional areas of functionality will be added in thenext version of ExProView: multisample comparisonsbased on cluster analysis (Wen et al., 1998) and importof digitized hybridization array data for qualitative orsemiquantitative comparison to EST data.

In summary, ExProView allows a comprehensive,convenient, and customizable display of complex geneexpression data sets and may serve as a starting pointfor explorative functional genomics studies with broadapplicability.

ACKNOWLEDGMENTS

This work was supported by grants from the Swedish ResearchCouncil for Engineering Sciences (TFR), the Swedish Cancer Society,the Swedish Society for Medicine, the Swedish Radiation ProtectionInstitute, the Research Funds of the Karolinska Institute, and EUBiomed 2, BMH4-CT-2284.

REFERENCES

Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C. (1994).Issues in searching molecular sequence databases. Nat. Genet. 6:119–129.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.(1990). Basic local alignment search tool. J. Mol. Biol. 215: 403–410.

Audic, S., and Claverie, J. M. (1997). The significance of digital geneexpression profiles. Genome Res. 7: 986–995.

Bairoch, A., and Apweiler, R. (1999). The SWISS-PROT proteinsequence data bank and its supplement TrEMBL in 1999. NucleicAcids Res. 27: 49–54.

Benson, D. A., Boguski, M. S., Lipman, D. J., Ostell, J., Ouellette,B. F. F., Rapp, B. A., and Wheeler, D. L. (1999). GenBank. NucleicAcids Res. 27: 12–17.

Boguski, M. S., and Schuler, G. D. (1995). ESTablishing a humantranscript map. Nat. Genet. 10: 369–371.

Bowtell, D. D. (1999). Options available—from start to finish—forobtaining expression data by microarray. Nat. Genet. 21: 25–32.

Burks, C. (1999). Molecular biology database list. Nucleic Acids Res.27: 1–19.

Chen, H., Centola, M., Altschul, S. F., and Metzger, H. (1998).Characterization of gene expression in resting and activated mastcells. J. Exp. Med. 188: 1657–1668.eloukas, P., Schuler, G. D., Gyapay, G., Beasley, E. M., Soderlund,C., Rodriguez-Tome, P., Hui, L., Matise, T. C., McKusick, K. B.,Beckmann, J. S., Bentolila, S., Bihoreau, M., Birren, B. B.,Browne, J., Butler, A., Castle, A. B., Chiannilkulchai, N., Clee, C.,Day, P. J., Dehejia, A., Dibling, T., Drouot, N., Duprat, S., Fizames,C., Bentley, D. R., et al. (1998). A physical map of 30,000 humangenes. Science 282: 744–746.eRisi, J. L., Iyer, V. R., and Brown, P. O. (1997). Exploring themetabolic and genetic control of gene expression on a genomicscale. Science 278: 680–686.

ckman, B. A., Aaronson, J. S., Borkowski, J. A., Bailey, W. J.,

Elliston, K. O., Williamson, A. R., and Blevins, R. A. (1998). The
Page 13: Expression Profile Viewer (ExProView): A Software Tool for Transcriptome Analysis

F

G

H

H

L

M

R

S

S

V

V

W

W

353ExProView

Merck Gene Index browser: An extensible data integration systemfor gene finding, gene characterization and EST data mining.Bioinformatics 14: 2–13.

ields, C., Adams, M. D., White, O., and Venter, J. C. (1994). Howmany genes in the human genome? Nat. Genet. 7: 345–346.offeau, A., Barrell, B. G., Bussey, H., Davis, R. W., Dujon, B.,Feldmann, H., Galibert, F., Hoheisel, J. D., Jacq, C., Johnston, M.,Louis, E. J., Mewes, H. W., Murakami, Y., Philippsen, P., Tettelin,H., and Oliver, S. G. (1996). Life with 6000 genes. Science 274: 546.ieter, P., and Boguski, M. (1997). Functional genomics: It’s all howyou read it. Science 278: 601–602.ubank, M., and Schatz, D. G. (1994). Identifying differences inmRNA expression by representational difference analysis ofcDNA. Nucleic Acids Res. 22: 5640–5648.

iang, P., and Pardee, A. B. (1992). Differential display of eukaryoticmessenger RNA by means of the polymerase chain reaction. Sci-ence 257: 967–971.adden, S. L., Galella, E. A., Zhu, J., Bertelsen, A. H., and Beaudry,G. A. (1997). SAGE transcript profiles for p53-dependent growth

regulation. Oncogene 15: 1079–1085.

onaghi, M., Uhlen, M., and Nyren, P. (1998). A sequencing methodbased on real-time pyrophosphate. Science 281: 363.

trausberg, R. L., Dahl, C. A., and Klausner, R. D. (1997). Newopportunities for uncovering the molecular basis of cancer. Nat.Genet. 15 (Spec. No.): 415–416.

utton, G. G., White, O., Adams, M. D., and Kerlavage, A. R. (1995).TIGR assembler: A new tool for assembling large shotgun sequenc-ing projects. Genome Sci. Tech. 1: 9–19.elculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W.(1995). Serial analysis of gene expression. Science 270: 484–487.

elculescu, V. E., Zhang, L., Zhou, W., Vogelstein, J., Basrai, M. A.,Bassett, D. E., Jr., Hieter, P., Vogelstein, B., and Kinzler, K. W.(1997). Characterization of the yeast transcriptome. Cell 88: 243–251.en, X., Fuhrman, S., Michaels, G. S., Carr, D. B., Smith, S., Barker,J. L., and Somogyi, R. (1998). Large-scale temporal gene expres-sion mapping of central nervous system development. Proc. Natl.Acad. Sci. USA 95: 334–339.hite, O., and Kerlavage, A. R. (1996). TDB: New databases for

biological discovery. Methods Enzymol. 266: 27–40.