systematic management and analysis of yeast gene expression data
TRANSCRIPT
Systematic Management and Analysis of YeastGene Expression DataJohn Aach, Wayne Rindone, and George M. Church1
Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115 USA, and the Lipper Center forComputational Functional Genomics, Boston, Massachusetts 02115 USA
We report steps toward the systematic management, standardization, and analysis of functional genomics data.We developed the ExpressDB database for yeast RNA expression data and loaded it with ∼17.5 million pieces ofdata reported by 11 studies with three different kinds of high-throughput RNA assays. A web-based toolsupports queries across the data from these studies. We examined comparability of data by converting datafrom 9 studies (217 conditions) into mRNA relative abundance estimates (ERAs) and by clustering of conditionsby ERAs. We report on generation of ERAs and condition clustering for non-microarray data (5 studies, 63conditions) and describe initial attempts to generate microarray-based ERAs (4 studies, 154 conditions), whichexhibit increased error, on our web site http://arep.med.harvard.edu/ExpressDB. We recommend standards fordata reporting, suggest research into improving comparability of microarray data through quantifying andstandardizing control condition RNA populations, and also suggest research into the calibration of differentRNA assays. We introduce a model for a database that integrates different kinds of functional genomics data,Biomolecule Interaction, Growth and Expression Database (BIGED).
Ever-growing amounts of sequence data for numerousorganisms, combined with readily available technol-ogy for large-scale expression studies on the basis ofoligonucleotide arrays, DNA microarrays, Serial Analy-sis of Gene Expression (SAGE), and other techniques(Velculescu et al. 1995; Lockhart et al. 1996; DeRisi etal. 1997), has led to the rapid accumulation of largeexpression data sets and the development of the fieldof functional genomics. Functional genomics has beencontrasted as having a systematic, genome-wide ap-proach to the collection and analysis of biological datacompared with more traditional methods, which focusin depth on particular genes, proteins, or pathways(Hieter and Boguski 1997). But despite rapid strides,capped by a string of successful studies (see Table 1),functional genomics has yet to develop a highly inte-grated system of tools and methods such those used insequence analysis and structural genomics (Hieter andBoguski 1997); for that, we must await development ofthe three following components: general databases,data standards, and integrated general-purpose analy-sis tools. A comparison of the status of these fields withrespect to these components is given in Table 2. Notethat throughout this article sets of related experimentsand conditions assayed within them will be denoted byseries codes, which are defined in Table 1.
As these three components are dependent on eachother and must coevolve, the right mixture will onlycome together with experience. We hope to jump start
the process by describing here the working prototypesof parts of the required systems and examples of whatcan be done with them. Specifically, we describeExpressDB, a general database for RNA expression datathat has been loaded with data from 11 yeast studiesusing three different kinds of high-throughput RNAlevel assays (see Table 1). We also describe EXD, anintegrated web-based application that supports userqueries of ExpressDB data. ExpressDB and EXD differfrom existing research-specific databases (see web siteson Table 1) in that they represent and manage datafrom multiple studies, and complement databases suchas ArrayDB (Ermolaeva et al. 1998) by managing datafrom multiple kinds of RNA level assays.
In each study whose data were loaded intoExpressDB, data were collected and prepared in waysappropriate to the study’s particular experimental de-sign. Because designs and methods are not generallycoordinated across studies, data from different studiesare not always easily compared. This has no impact onthe success of each study individually, but data com-parability assumes increased importance in a databasecontext in which comparability improvements cantranslate into simpler and more meaningful queries,more efficient database structures, and opportunitiesfor more effective data mining. To gauge comparabilityof currently available data, we therefore formulatedwhat we considered to be an attainable ideal and as-sessed what would be required for the data on Ex-pressDB to meet it. We propose specific recommenda-tions on the basis of this assessment (see Discussion).We defined our ideal state of data comparability to be:
1Corresponding author.E-MAIL [email protected]; FAX (617) 432-7663.
Article
10:431–445 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.org Genome Research 431www.genome.org
Tab
le1.
Yea
stR
NA
Exp
ress
ion
Dat
aSe
tsIn
corp
ora
ted
into
Dat
aA
nal
ysis
Refa
Met
hod
bSe
ries
code
c
Raw
data
serie
sob
tain
ed
Nor
m.
serie
sp
rodu
ced
Dat
aco
desd
Exp
erim
enta
lpre
par
atio
nse
Mic
roar
ray
cont
rolp
rep
arat
ions
eRe
mar
ks
1S
Vel
33
YPH
499
(MAT
a)in
YPD
rich
med
ium
at30
°Cas
abov
e+
100
mM
hydr
oxyu
rea
asab
ove
+15
µg/m
lnoc
odaz
ole
N/A
early
log
pha
se
G1/S
arre
stG
2/M
arre
st2
MD
er_d
iaux
148
T,I
DBY
7286
(MAT
a)in
YPD
at30
°Cin
itial
time
poi
ntdi
auxi
csh
iftD
er_t
up2
2C
TUP1
D(T
UP1
::LEU
2)in
rich
med
ium
+gl
ucos
ew
ild-t
ype
stra
in
Der
_yap
22
CYA
P1++
(pla
smid
,G
AL1-
10p
rom
oter
)in
gala
ctos
ew
ild-t
ype
stra
in+
cont
rolp
lasm
id3
MC
hu_s
po
148
T,I
YSC
328
(MAT
a/a
)in
spor
ulat
ion
indu
cing
med
ium
initi
altim
ep
oint
spor
ulat
ion
Chu
_ndt
80_d
el4
3T,
IYS
C33
0(M
ATa/
and
t80:
:LEU
2)in
spor
ulat
ion
indu
cing
med
ium
initi
altim
ep
oint
spor
ulat
ion
mut
ant
Chu
_gal
_ndt
802
2C
YSC
553
(MAT
aG
AL1-
10p
rom
oter
::ND
T80)
inga
lact
ose
GAL
1-10
pro
mot
er::U
RA3
stra
in
spor
ulat
ion
cont
rol
mut
ant
4A
Cho
1717
K344
5(c
dc28
-13)
,K2
994
(cdc
15-2
)sy
nchr
oniz
edat
37°C
and
then
grow
nat
25°C
N.A
.sy
nchr
oniz
edce
llcu
lture
5A
Rot
44
FY4
(MAT
a)in
2%gl
ucos
eat
30°C
N.A
.gl
ucos
eas
abov
e,th
en39
°Che
atsh
ock
heat
shoc
kas
abov
ein
2%ga
lact
ose
at30
°Cga
lact
ose
FY5
(MAT
a)
in2%
gluc
ose
at30
°Cm
atin
gty
pe
6A
Hol
4242
stra
ins:
N.A
.co
mp
aris
onof
RNA
•11
exp
erim
enta
lstr
ains
with
mut
ant
RNA
pol
ymer
ase
subu
nit
gene
s•
8co
ntro
lstr
ains
with
wild
-typ
eRN
Ap
olym
eras
esu
buni
tge
nes
pol
ymer
ase
subu
nit
mut
ants
with
isog
enic
wild
typ
est
rain
sgr
own
unde
rid
entic
alco
nditi
ons.
cond
ition
s:YP
Dat
30°C
tom
id-lo
gp
hase
.Fo
rts
mut
ant/
wild
-typ
ep
airs
,he
atsh
ock
atre
stric
tive
tem
per
atur
efo
r45
min
befo
reas
sayi
ng
alle
xper
imen
tsre
plic
ated
and
rep
licat
edda
tare
por
ted
exce
pt
for
KIN
287
MSp
e_al
pha
3636
T,D
DBY
8724
(MAT
a)ce
llcy
cle
sync
hron
ized
bya
fact
orw
hich
isth
enre
mov
edce
ntrif
ugal
ly
asyn
chro
nous
cultu
resy
nchr
oniz
edce
llcu
lture
s
Spe_
elut
2828
T,D
DBY
7286
(MAT
a)ce
llcy
cle
sync
hron
ized
byel
utria
tion
asyn
chro
nous
cultu
re
Spe_
cdc1
548
46f
T,D
DBY
8728
(MAT
acd
c15-
2ts)
cell
cycl
esy
nchr
oniz
edby
tem
per
atur
em
anip
ulat
ion
asyn
chro
nous
cultu
re
Spe_
cln3
43
T,I
DBY
8725
(MAT
acl
n1::H
IS3
cln2
::TRP
1cd
c34-
2tsur
a3-G
AL–C
LN3)
atre
stric
tive
tem
per
atur
ein
gala
ctos
e
noga
lact
ose,
entir
eco
ntro
lcul
ture
harv
este
dat
time
0
exp
erim
enta
lper
form
edtw
ice
Aach et al.
432 Genome Researchwww.genome.org
Tab
le1.
(Con
tinue
d)
Refa
Met
hod
bSe
ries
code
c
Raw
data
serie
sob
tain
ed
Nor
m.
serie
sp
rodu
ced
Dat
aco
desd
Exp
erim
enta
lpre
par
atio
nse
Mic
roar
ray
cont
rolp
rep
arat
ions
eRe
mar
ks
Spe_
clb2
43
T,I
DBY
8726
(MAT
acl
b1::U
RA3
clb2
::LEU
2cl
b3::T
RP1
clb4
::HIS
3G
AL–C
LB2)
inD
MSO
and
noco
dazo
lein
gala
ctos
e
noga
lact
ose,
entir
eco
ntro
lcul
ture
harv
este
dat
time
0
exp
erim
ent
per
form
edtw
ice
Spe_
wtg
al2
2C
DBY
8727
(MAT
a)in
gala
ctos
eno
gala
ctos
eco
ntro
lfor
cln3
,cl
b2se
ts8
MM
ye4
4C
MG
107
(MAT
am
ed2D
1::T
RP1)
inga
lact
ose,
unde
rhe
atsh
ock
MG
106
(MAT
aM
ED2)
unde
rsa
me
cond
ition
sm
edia
tor
com
ple
x
9A
Coh
44
Yap
1vs
.w
ildty
pe,
with
and
with
out
per
oxid
eox
idat
ive
stre
ss
TOTA
L23
421
7
Info
rmat
ion
pro
vide
din
clud
esth
elit
erat
ure
refe
renc
ean
dco
rres
pon
ding
web
site
,met
hod
used
toas
say
RNA
(Met
hod)
,the
num
ber
ofra
wda
tase
ries
obta
ined
thro
ugh
pub
licac
cess
orp
erso
nalc
omm
unic
atio
nfr
omth
ere
fere
nce,
num
ber
ofno
rmal
ized
data
serie
sge
nera
ted
usin
gm
etho
dsde
scrib
edin
the
text
,and
sum
mar
ies
ofst
rain
san
dco
nditi
ons
used
.N
ote
that
data
obta
ined
may
only
rep
rese
nta
sele
ctio
nof
the
tota
ldat
aco
llect
edan
ddi
scus
sed
byth
ere
fere
nce,
asso
me
data
may
not
have
been
rep
orte
d.Ex
pre
ssD
Bda
taba
seal
soco
ntai
nsda
tafr
omEi
sen
etal
.(1
998)
and
Mar
ton
etal
.(1
998)
.aRe
fere
nce
for
sets
ofex
per
imen
tsco
nsid
ered
byth
isst
udy
and
asso
ciat
edw
ebsi
te,
whe
reav
aila
ble.
(1)
Velc
ules
cuet
al.
(199
7),
ftp
://g
enom
e-ft
p.st
anfo
rd.e
du/p
ub/y
east
/ta
bles
/SA
GE_
Dat
a/;
(2)
DeR
isi
etal
.(1
997)
,ht
tp:/
/cm
gm.s
tanf
ord.
edu/
pbr
own/
exp
lore
/ind
ex.h
tml;
(3)
Chu
etal
.(1
998)
,ht
tp:/
/cm
gm.s
tanf
ord.
edu/
pbr
own/
spor
ulat
ion/
inde
x.ht
ml;
(4)
Cho
etal
.(1
998)
,ht
tp:/
/gen
omic
s.st
anfo
rd.e
du/y
east
/ful
l_da
ta.h
tml;
(5)
Roth
etal
.(1
998)
,ht
tp:/
/are
p.m
ed.h
arva
rd.e
du;
(6)
Hol
steg
eet
al.
(199
8),
http
://
ww
w.w
i.mit.
edu/
youn
g/ex
pre
ssio
n.ht
ml;
(7)
Spel
lman
etal
.(1
998)
,ht
tp:/
/gen
ome-
ww
w.s
tanf
ord.
edu/
cellc
ycle
;(8
)M
yers
etal
.(1
999)
,ht
tp:/
/cm
gm.s
tanf
ord.
edu/
pbr
own/
med
2;(9
)B.
Coh
en,
(unp
ubl.)
.b(S
)SA
GE
(Vel
cule
scu
etal
.19
95);
(M)
DN
Am
icro
arra
y(D
eRis
iet
al.
1997
);(A
)A
ffym
etrix
olig
onuc
leot
ide
arra
y(L
ockh
art
etal
.19
96;
Wod
icka
etal
.19
97)
c Abb
revi
atio
nus
edin
text
for
ase
tof
rela
ted
exp
erim
ents
.Se
ries
code
sar
eba
sed
onth
efir
stth
ree
lett
ers
ofth
ep
rimar
yau
thor
nam
eof
the
sour
ceda
talit
erat
ure
refe
renc
e.dPr
oper
ties
ofth
eda
tase
ries
and
thei
rha
ndlin
g:(C
)A
sing
leco
ntro
lp
rep
arat
ion
(see
belo
w)
desc
ribed
inth
eM
icro
arra
yco
ntro
lp
rep
arat
ion
colu
mn
isus
edw
ithev
ery
exp
erim
enta
lpre
par
atio
n.Th
isco
ntro
lpre
par
atio
nis
nota
nin
itial
time
poi
ntof
atim
ese
ries
ofex
per
imen
talm
easu
rem
ents
.(D
)Diff
eren
tcon
trol
pre
par
atio
nsar
eus
edw
ithev
ery
exp
erim
enta
lpre
par
atio
nfo
ra
mic
roar
ray-
base
dse
ries
ofex
per
imen
talm
easu
rem
ents
.(I)
Asi
ngle
cont
rolp
rep
arat
ion
isus
edw
ithev
ery
exp
erim
enta
lpre
par
atio
n.Th
isco
ntro
lp
rep
arat
ion
isan
initi
altim
ep
oint
ofa
time
serie
sof
exp
erim
enta
lm
easu
rem
ents
.(T
)Ex
per
imen
tal
pre
par
atio
nsar
ea
time
serie
sof
asi
ngle
cultu
regr
owin
gun
der
defin
edco
nditi
ons.
Mic
roar
ray
cont
rol
data
serie
sco
ded
asC
and
Iar
eag
greg
ated
and
thei
rco
rres
pon
ding
exp
erim
enta
lm
easu
rem
ents
are
calib
rate
das
par
tof
the
pro
cedu
refo
rco
mp
utin
gm
icro
arra
y-de
rived
estim
ated
rela
tive
abun
danc
esde
scrib
edon
our
web
site
.ePr
epar
atio
nis
here
defin
edas
asa
mp
leof
acu
lture
ofa
defin
edye
ast
stra
ingr
own
unde
rde
fined
envi
ronm
enta
lcon
ditio
nsfo
ra
defin
edp
erio
dof
time
that
has
been
extr
acte
dfo
rp
urp
oses
ofan
RNA
exp
ress
ion
leve
lass
ay.S
trai
nsan
dco
nditi
ons
are
desc
ribed
unde
rEx
per
imen
talp
rep
arat
ions
for
ever
yse
tof
exp
erim
ents
exce
pt
for
Hol
(see
the
web
site
asso
ciat
edw
ithth
ere
fere
nce
for
Hol
).W
hen
usin
gm
icro
arra
ys,m
easu
rem
ents
ofRN
Ale
vels
for
exp
erim
enta
lpre
par
atio
nsar
ega
ther
edw
ithm
easu
rem
ents
ofRN
Ale
vels
for
aco
ntro
lpre
par
atio
nw
hich
isde
scrib
edun
derM
icro
arra
yco
ntro
lpre
par
atio
ns.O
nly
rele
vant
diffe
renc
esfr
omth
eex
per
imen
talp
rep
arat
ion
are
desc
ribed
inth
eco
ntro
lpre
par
atio
nco
lum
n.In
thes
etw
oco
lum
nsde
tails
ofth
em
ediu
man
dge
noty
pe
are
omitt
edun
less
they
are
cent
ralt
oth
eex
per
imen
t,e.
g.,c
omm
onau
xotr
ophi
cm
utat
ions
such
asur
a3-5
2,ly
s2-8
01,
ade2
-101
,le
u2-D
1,hi
s3-D
200,
and
trp1
-D63
alon
gw
ithco
rres
pon
ding
amin
oac
idsu
pp
lem
ents
tom
edia
are
omitt
ed.
(N.A
.)N
otap
plic
able
.f T
wo
serie
sre
por
ted
for
time
poi
nts
120
and
160.
The
seco
ndse
ries
was
used
inea
chca
sein
pre
fere
nce
toth
efir
stfo
rco
mp
utin
gth
em
icro
arra
y-de
rived
estim
ated
rela
tive
abun
danc
esde
scrib
edon
our
web
site
.
Gene Expression Database and Analysis
Genome Research 433www.genome.org
(1) All RNA expression data are provided in the form ofestimated relative abundances (ERAs) of a defined setof functionally distinguishable RNA fragments (FDRs)which, in the present case, we take to be RNAs corre-sponding to ORFs. An ORF ERA represents the frac-tional abundance of the ORF’s RNA with respect to thetotal population of ORF RNAs in cells in a particularexperimental condition (defined by cell strain and en-vironmental history). (2) Analytical tools used to mea-sure data comparability confirm that similar condi-tions have similar RNA expression profiles, regardlessof which RNA assays are used for data collection. Therationale for (1) is that ERAs are intuitive and unam-biguous measures of RNA level that are theoreticallydirectly comparable across conditions regardless of ex-perimental methodologies.
In assessing ExpressDB data against this ideal, weconverted as much data on ExpressDB as possible tothe form of ERAs, and explored clustering of condi-tions by ORF expression profiles as a tool for analyti-cally investigating data comparability. We undertookthese steps for data generated from oligonucleotide ar-rays (4 studies, 60 conditions) SAGE (1 study, 3 condi-tions), and microarrays (4 studies, 154 conditions),generating a set of ORF ERAs for 217 conditions. Issueswith microarray data currently make it difficult tocompute ERAs from them and our best effort resultedin microarray ERA values that exhibit increased vari-ability compared with corresponding ratios (coefficientof variation of ERAs = 3.3 times that of ratios) (seeResults). We focus here on methods and results forAffymetrix and SAGE data. Readers interested in ourmicroarray results may consult the supplemental ma-terial on our web site http://arep.med.harvard.edu/ExpressDB. A file containing ERAs for 213 conditions—
all except for four not previously published (Coh)–maybe downloaded from the same web site. The ERA filehas not been loaded into ExpressDB, which containsonly original data, but has been loaded into a separatedatabase on our web site to allow it to be queried easily.
RESULTS
DatabaseExpressDB is a relational database for RNA expressiondata. We implemented it using Sybase SQL Server11.0.3 on a shared DEC 3000 server running DEC Unix4.0D. We conceive of ExpressDB as a generalized two-dimensional table that can subsume individual tablesof expression data reported by researchers. We providea high-level logical data model for the ExpressDB da-tabase and an example of how it operates as a gener-alized two-dimensional table in Figure 1. That figurealso presents names of ExpressDB tables that will beused throughout this article. Note that names of thesetables are always capitalized (e.g., Measure).
We developed a utility program EDBUpdate toload data from individual tab-delimited files (load files)into ExpressDB. Load files must present a systemati-cally collected set of measurements or descriptive in-formation for a series of ORFs, in which each line of thefile presents information for an ORF and each columna particular measurement or information field. Ex-amples of measurements are numerical values repre-senting ORF mRNA abundance and data-quality indi-cators. An ORF description field would be an exampleof an information field. Measurement columns are rep-resented by ExpressDB Measure records. Using EDBUp-date, we performed loads of all available data files as-
Table 2. Comparison of Status of Functional Genomics
Functional genomics Sequence analysis/structural genomics
Databases flat files or databases developed by individualresearchers for each study. Centrally manageddatabases in initial development.a
centraly managed, global sequence databases usedby entire research communities
“Straw man” database standards proposed(Bassett et al. 1999)
databases and standards established; e.g., GenBank,PDB, SwissProt (Bernstein et al. 1977; Bairoch andApweiler 1999; Benson et al. 1999)
Data standards no uniform standards for data reporting (Bassettet al. 1999)
uniform standards for sequence and structuresubmission, partially or completely automated(e.g., BankIt for GenBank, ADIT for PDB)
different data reported by differentmethodologies (see text)
reported sequence data independent of sequencingmethodologies
Analysis tools clustering and fold change analysis emerging asstandard tools, but tools only partly integratedwith databases
standard search tools such as Blast (Altschul et al.1990) fully integrated with databases
Comparison with sequence analysis/structural genomics regarding integration of databases, data standards, and computer analysistools.aNational Center for Biotechnology Information (1999); European Bioinformatics Institute (1999).
Aach et al.
434 Genome Researchwww.genome.org
sociated with the studies described in Table 1 into thedatabase as well as data from two others (Eisen et al.1998; Marton et al. 1998). ExpressDB contains >17.5million pieces of information. Other statistics on thedatabase are given in Table 3.
Load files downloaded from public sources oftenrequired minor editing to put them into the properformat for loading. For instance, files presenting datacollected with Affymetrix oligonucleotide arrays oftengive the ORF and common gene name in the same
column separated by a (/), and we had to separate theseinto distinct columns. More extensive work was re-quired to load the SAGE-based expression data fromVel because data from SAGE is in the form of counts oftag sequences in cDNAs, whereas ExpressDB imposes asa structural requirement that data be indexed by ORF.A key issue for this indexing is that some SAGE tagsequences cannot be assigned to a unique ORF. As aresult, for each SAGE condition, we computed andloaded into ExpressDB both a minimum and a maxi-
Figure 1 Logical data model of ExpressDB and its usage as a generalized two-dimensional table. (a) High level logical data model.Rectangles represent tables in the database and connecting lines represent database table relationships as described in the legend andas per Teorey (1994). (b) Use of Express DB as a generalized two-dimensional table. Database load files are usually in the form of tableswith each line dedicated to an ORF or control probe and each column to a measurement, computed value, or descriptive field. TheExpressDB ORF table contains records for all ORFs or control used by any load file, and each measurement column from each load fileis represented by a Measure record. Measure records associated with a related set of experiments, especially measures reported by a singleliterature reference, are all linked via a database relationship to a common Expression Data Set record, although multiple Expression DataSet records are used for some references. Measurement column values for each ORF for an experimental data series are given in ExpressionData Point records that are linked to their ORF and Measure records via database relationships. To accommodate different formats ofcolumn values, five different Expression Data Point tables are available, each supporting a different format.
Gene Expression Database and Analysis
Genome Research 435www.genome.org
mum tag count for each ORF, in which the minimumcount includes only counts of tags uniquely assignableto the ORF (which we call unambiguous tags) and themaximum count includes these plus the counts oftags shared with other ORFs (which we call ambiguoustags). Additional details may be found on the Ex-pressDB database in the Expression Data Set record forthe Vel experiments.
Database Query ApplicationOur web-based query interface for ExpressDB, the EXDsystem, can be accessed at http://arep.med.harvard.edu/ExpressDB/. A JavaScript 2.1-supporting webbrowser such as Microsoft Internet Explorer 4.0+ orNetscape Navigator 4.0+ is required. The logical flow ofthe EXD system is depicted in Figure 2a. The main lineof this logic is that the user is prompted for successivelymore detailed specifications concerning the query,starting with the Expression Data Sets (see Fig. 1) ofinterest, moving on to the Measures of interest withinthese Expression Data Sets, and finally to conditionsthat must be satisfied by the ORFs or their Measurevalues. ExpressDB allows Expression Data Set and Mea-sure records to be marked private and these are notoffered for user selection by EXD; this option has beenused for the Coh set of experiments that are not yetpublished. The query conditions offered for user speci-fication are sensitive to the data format of the Mea-sures; thus the user is prompted for text matchingspecifications when a Measure has a character format,and with numerical equalities and inequalities for nu-merical formats. Statistical specifications may also beindicated for numerical measures, for example, it ispossible to ask for all ORFs for which the value of ameasurement is greater than two standard deviationsfrom the mean. It is also possible to ask for only thoseORFs that are either in or not in a group of ORFs de-fined in the ExpressDB ORF Group table. To demon-
strate this capability, we loadedthis table with 207 functionalgroupings of yeast ORFs de-fined on the Munich Informa-tion Center for Protein Sciences(MIPS) database (Mewes et al.1999). The output of ExpressDBis given in either a tab-delim-ited or formatted form. Tab-delimited output can be copiedfrom the screen and pasted intodesktop applications like Excel(Microsoft, Redmond, WA) forfurther analysis. We considerthis a rudimentary form of in-tegration with and pipeliningto downstream computer
analysis tools. Additional information on querying thedatabase can be found on the web site mentionedabove.
Figure 2b provides an impression of what it is liketo use EXD to perform a typical ExpressDB query. Onentry to the system, the user is presented with a formlisting data sets available on the database (Fig. 2b, step1). The user selects one or more data sets; here theDer_diaux and Der_tup data sets have been selected(see Table 1). On clicking the Submit button, the user isbrought to the next form (Fig. 2b, step 2) whichpresents information fields and Measures available onthe database for the selected data sets, and the userchooses the ones he or she wishes to see. Here, the userhas asked to see the information field SGDID (Saccha-romyces Genome Database identifier) (Cherry et al.1999) and all of the ratios from the two selected datasets (two from Der_tup and seven from Der_diaux) thatrepresent an ORF’s fold change of mRNA abundance inan experimental condition relative to its control con-dition. On clicking the Submit button, the user is nextbrought to a form (Fig. 2b, step 3), which allows entryof query specifications. In this example, the only speci-fication provided is that the microarray ratio from thelast Der_diaux condition must be >1 S.D. above themean for this Measure. This will cause EXD to displayall selected information fields and Measures for onlythose ORFs meeting this specification. The biologicalmeaning of this particular specification is that the userwishes to see data for only those ORFs that are at leastmoderately induced in ethanol, as the last Der_diauxcondition represents the end of a diauxic shift timeseries during which the yeast cells have consumed allof the glucose originally available in the medium andare growing on ethanol at the end of the shift (DeRisiet al. 1997). This illustrates the level of knowledge thata user must have of the meaning of the data sets andMeasures on the database to make effective use of theEXD system. The design of the database allows descrip-
Table 3. ExpressDB Statistics
Expression Data Set records 30References supplying load files 11Measure records 2503Expression Data Point records 17515209ORF records 7614ORFs with multiple values for at least
one Measure2277
ORF statistics • 7084 SGD-recognized ORFs• 530 others (including non-yeast controls)
Database size 800.7 Mb (570.8 MB data + 229.9 MB indices)
Statistics are as of June 4, 1999. The references from which load files were obtained comprisethose cited in Table 1 plus two others (Eisen et al. 1998; Marton et al. 1998). Several of thesewere recorded under multiple Expression Data Set records. ORF records comprise more thanactual yeast ORFs but also represent control features for which data is reported by RNAcollection methods as well as some non-ORF entities loaded with ORF Group and Saccharo-myces Genome Database (SGD) information (Cherry et al. 1999).
Aach et al.
436 Genome Researchwww.genome.org
Figure 2 Logical flow through the EXD query system and ex-ample of EXD query. (a) Logical flow through EXD query system.Boxes correspond to forms and web pages produced by the sys-tem. Arrows correspond to movements through the system avail-able through hotlinks and buttons on the forms and pages. Num-ber in circles indicated forms and web pages depicted in theexample of an EXD query in b of this figure. (b) Illustration of anEXD query provided to give an impression of EXD system usage.(1) A user selects two data sets from the database, (2) selects aninformation field (SGDID) and nine Measures from these datasets, (3) specifies a query condition, and (4) obtains results. Formsand web pages shown in diagram are versions of actual EXDpages that have been reduced in size, abridged, and edited tohighlight key features. Details pertaining to the example aregiven in the text.
Gene Expression Database and Analysis
Genome Research 437www.genome.org
tive information about data sets and Measures to bestored so that users may obtain this knowledge directlyfrom the database itself. When the user clicksSubmit on the query specification form, the EXD sys-tem processes the query. This may take several min-utes, during which processing messages are displayedby the system (not illustrated in Fig. 2b). When pro-cessing is complete, the user clicks on a hotlink to seethe results (Fig. 2b, step 4).
We believe the EXD system to be the first querysystem that allows users to query simultaneously anyof the expression data reported by experiments associ-ated with different literature references and return theresults collated by ORF name. Fundamentally, this de-rives from the fact that all of the data has been col-lected in one database, but it is also supported byEXD’s ability to navigate ExpressDB’s generalized two-dimensional table structure. At this time, however, werecommend use of EXD only for relatively simple que-ries involving ∼10 or fewer Measures over all or a groupof ORFs, partly because of performance issues withmore complex queries and the database’s shared com-puter environment, and partly because we need to de-velop an interface that makes it easier for users to finddata items of interest from a set of >2000 availableMeasures and then specify query conditions for them.
Generation of ERAsGeneration of ERAs is straightforward for data derivedfrom Affymetrix oligonucleotide arrays and SAGE (seeMethods), but microarray-derived data present a sig-nificant issue. Microarray-based experiments simulta-neously collect intensity levels of fluorescently labeledcDNAs derived from an experimental condition, andintensity levels of cDNAs, labeled with a different fluo-rophore, derived from a control condition. The twocDNA preparations are hybridized in parallel to thesame probe sequence spots on the array (DeRisi et al.1997). Ratios between background-subtracted experi-mental and control condition intensities are used fordata reporting and analysis because they compensatefor several sources of bias and noise in intensity results,including ORF-to-ORF variations in labeled nucleotideincorporation (bias), ORF-to-ORF variations in effi-ciency of the PCR reactions used to generate the probesequences spotted onto the arrays (bias and noise), andspot size and shape variations (noise). Microarray ra-tios differ from ERAs in that they are fold changes of anORF RNA level in an experimental condition relative toa control condition, whereas ERAs are fractional abun-dances of an RNA in a single condition. Moreover, mi-croarray ratios cannot be converted into ERAs; whentotal RNA levels are roughly constant, microarray ra-tios may be identified with ratios of experimental con-dition ERAs to control condition ERAs, neither of
which can be determined from the ratios without priorknowledge of the other. Affymetrix- and SAGE-basedexperiments are not subject to this difficulty becauseneither require measurements relative to a control con-dition. Experiments using Affymetrix arrays typicallyreport ORF expression level measurements as averagesof differences in background-subtracted intensities offluorescently labeled sample cRNA or cDNA hybridizedto probe sets of ∼20 perfect match (PM) and mismatch(MM) oligonucleotide probes for the ORF (average PM–MM). Sequence rules for probes and the precision ofthe oligonucleotide synthesis process control for noiseand bias, reducing the need for reporting relative to acontrol condition RNA sample; mismatch probes,which differ from their perfect match counterparts bya single base, control for cross-hybridization (Lockhartet al. 1996). SAGE experiments rely on sequencing ofcDNA tag sequences rather than hybridization, elimi-nating probe-, hybridization-, and label-based varia-tion, and use procedures that control for tag sequenceamplification biases (Velculescu et al. 1995, 1997).Again, only a single condition’s RNA sample is re-quired. As noted above, although we computed ERAsfor data sets derived using all three RNA assays, we onlyreport here on generation of ERAs for Affymetrix- andSAGE-derived data. A discussion of microarray-derivedresults may be found on our web site.
Other issues that complicate both generation ofERAs and comparisons of data generally include (1) thefrequent reporting (see Table 3) of multiple measure-ment values for an ORF from single experimental orcontrol conditions derived from multiple spots for anORF on a microarray or multiple probe sets for an ORFon an Affymetrix array, which raises the question ofhow values should be combined or selected for furtheranalysis, and (2) use of different ORF names across dif-ferent sets of experiments, making matching of ORFdata across experiments difficult.
In Affymetrix-based experiments, multiple valuesfor an ORF arise from distinct probe sets for the ORFthat are distinguished by their probes being located todifferent exons or other general probe set characteris-tics. We found different types of probe sets to havedifferent properties. For instance, we computed an ag-gregated measure (see Methods) of the ratio of averagePM–MM values of exon 1 probe sets against those ofexon 2 probe sets from values given in the Hol and Chosets of experiments, and found that in both cases, exon1 probe set values were, in aggregate, ∼0.6 of exon 2probe set values (Hol: N = 90 ratios = 96 ratios–6 outli-ers; Cho: N = 93 ratios = 100 ratios–7 outliers; S.D. = 0.4for each distribution). When presented with a choiceof expression measurements for the same ORF withdifferent values, one would ideally like to identify anduse the measure of highest quality, but the fact thatexon 1 probe sets have the property of yielding smaller
Aach et al.
438 Genome Researchwww.genome.org
measurement values than exon 2 probe sets does notimply that exon 1 probe sets are of less quality. Ourstrategy for consolidating multiple Affymetrix probeset values therefore focused on consistency. Becausemost probes for ORFs with single probe sets are takenfrom the 38 ends of ORF sequences, we decided tohandle ORFs with probe sets for multiple exons by us-ing the exon 2 values instead of exon 1 values. We alsoavoided probe sets with special feature set indicatorswhere possible (see Methods). Affymetrix GeneChipsoftware returns a “presence call” that describes whena gene product may be considered to be present, mar-ginally present, or absent in an RNA sample (Lockhartet al. 1996). An alternative strategy would have been toconsolidate multiple probe set values by taking aver-ages of all probe set values called as present for an ORF.However, we found exon 1 probe sets to be called ab-sent significantly more often than exon 2 probe sets:The average of number of absence calls, 5 the S.D. ofthe average, for an exon 1 probe set over 42 Hol con-ditions = 7.6 5 1.1 versus 4.4 5 1.0 for exon 2 probesets. Our exon-based consolidation strategy thereforealready, at least partially, takes presence calls into ac-count while avoiding complications that would arisewith presence call-based strategy: (1) Not all Af-fymetrix-based experiments report presence calls, (2) itoften happens that multiple probe sets for an ORF areall marked absent whenever any one of them is (40% of1594 multiple probe set values across the 42 Hol ex-periments).
In the case of SAGE, multiple measurements for anORF arise from counts for distinct ORF SAGE tags. Thesum of counts for unambiguous tags for an ORF, main-tained as minimum tag counts on ExpressDB, can besafely attributed to RNA expression by that ORF, butcounts for ambiguous tags included in maximum tagcounts cannot be safely attributed to that ORF, as theymay have come from the RNAs of different ORFs thathappen to share the same tag (Velculescu et al. 1997).To assure the most accurate possible ERAs for SAGEconditions, we therefore only computed them forORFs, all of whose tags were unambiguous. The num-ber of ORFs for which we computed ERAs in the threeVel SAGE conditions (2122, 2211, and 2218) is thusconsiderably smaller than the number we computedfor other conditions (5803 5 585 for all 217 condi-tions). The ultimate origin of tag ambiguity in SAGE,sequence similarity between genes, also affects oligo-nucleotide array and microarray measurement of geneexpression through cross-hybridization of cRNAs andcDNAs to ORF probe sets or spots.
In the end, we produced a file containing ERAs forall ORFs for which there were usable data for 217 con-ditions (60 Affymetrix, 3 SAGE, and 154 microarray).The number of ORFs (identified by name) for whichdata are provided in at least one condition is 6293. The
process of generating ERAs included steps to resolvedifferent names for the same ORF (see Methods), andof these 6293 all but 94 could be identified withSGDIDs . Wherever data for an ORF was not reported ina condition, or an ERA could not be computed for anORF, a “null” value (empty field) is included in thetable for that ORF and condition. Because the maxi-mum number of ORFs for which ERAs are reported in acondition is 6221, all conditions contain some nullvalues; some conditions, like the SAGE conditionsnoted above, contain large numbers of null values. Asnoted above, the version of the file that may be down-loaded from our web site contains only 213 conditions(56 Affymetrix) because four conditions (Coh) havenot been published previously.
Clustering of Experimental ConditionsAlthough clustering of ORFs on the basis of expressionlevels over sets of conditions has often been reported(Cho et al. 1998; Eisen et al. 1998; Wen et al. 1998;Tavazoie et al. 1999), clustering of conditions is lesscommon but of increasing interest, in part because ofits potential for classifying tumors (Weinstein et al.1997; Alon et al. 1999; Perou et al. 1999). Here we usedcondition clustering to investigate its potential as ameasure of comparability of data. The motivation forconsidering condition clustering in this role is thatcells of similar strains in similar environmental condi-tions should exhibit similar ORF RNA abundances, andtherefore similar conditions should yield high correla-tion coefficients and small Euclidean distances be-tween their ORF abundance profiles. Condition clus-tering provides a convenient way of seeing such rela-tionships in correlations and distances for largenumbers of conditions. We therefore hypothesizedthat we should be able to find instances in whichclearly similar conditions were clustered together andclearly dissimilar conditions were separated into differ-ent clusters. We also used condition clustering to assessthe preliminary ERA values we generated for microar-ray conditions and to compare them against microar-ray ratios. Clustering of microarray ERA data is affectedby the increased variability of these preliminary values.However, we found indications that condition cluster-ing of microarray ratio data may be subject to biaseswhen clustering conditions from sets of experimentsusing different control conditions. These biases did notappear in the clustering of corresponding microarrayERAs. We discuss these results on our web site.
When clustering ERA data, we should generally ex-pect that conditions will tend to segregate into clustersaccording to related series of experiments for two rea-sons: First, conditions in related series frequently usethe same or similar strains and cell environments. Sec-ond, differences in technique and equipment used indifferent studies may have the effect of weighting in-
Gene Expression Database and Analysis
Genome Research 439www.genome.org
dividual ORF abundances from different series differ-ently. Both of these factors will tend to make condi-tions in a related series more similar in ORF ERA profilethan conditions from different series. A diagram de-picting the highest level 14 clusters of 217 conditionsgrouped by similarity of pairwise correlation coeffi-cients over transformed ORF estimated relative abun-dances (see Methods) is shown in Figure 3. It is evidentthat conditions cluster mainly with other conditionsin the same related sets of experiments. To confirmthat this and other observations below are not simplyartifacts of the clustering algorithm, we also performedclustering by an alternative method, the clustering ofconditions directly by transformed ORF ERAs rather
than pairwise correlation coefficients of conditionsover their ERAs (see Fig. 4 in the supplemental mate-rials on our web site). In both exercises, we clusteredsubsets of high-expressing ORFs that showed evidenceof induction across conditions, rather than clusteringover all ORFs, to reduce noise that might be introducedfrom large numbers of low-expressing ORFs (see Meth-ods). Despite some shuffling of clusters at the highestlevels, it remains true that conditions in the same re-lated sets of experiments are found to be closer to eachother than to conditions in other sets. Details may befound on our web site.
To assess the ability of condition clustering tocapture similarities and differences between experi-
Figure 3 Results of clustering 217 conditions by Pearson correlation coefficients over 1078 ORFs, plus high-level dendrogram showingthe 14 highest level condition clusters and their relationships in the clustering hierarchy. The 1078 ORFs exhibited high median relativeabundance over all conditions and showed evidence of induction or repression (see Methods). Conditions are presented symmetricallyas lines and columns of cells with each cell representing the correlation coefficient between two conditions over the log10 relativeabundances of the ORFs, expressed in standard units for the ORF over all conditions. Red values indicate negative correlation coefficientsand green values indicate positive correlation coefficients. Brighter red (green) values indicated more negative (positive) correlation.Diagonal entries all represent correlation coefficients = 1 and are bright green. Dendrogram branch heights from top of the tree indicaterelative locations of the join creating the subcluster in the sequence of subcluster agglomerations that created the tree; thus, clusters atthe end of longer branches may be considered more similar than clusters at the end of shorter branches from the perspective of theclustering algorithm. Arrows indicate branches that had to be truncated for this diagram. Cluster symbols are series codes (see Table 1)except for the following: Coh/R = Coh + Rot (R). Hol+ts3 = all Hol conditions involving temperature-sensitive mutants except forHol_med6_ts_1, which is in the Hol∼ts3 group. (A replicate condition Hol_med6_ts_2, however, is in Hol+ts.) Hol∼ts3 = all Hol conditionsthat do not involve temperature-sensitive mutants excepts for Hol_med6_ts_1. This cluster contains all control series as well as allnon-temperature-sensitive mutants. Other labels are for microarray-derived data sets. These are discussed on our web site.
Aach et al.
440 Genome Researchwww.genome.org
ments, we examined the Hol set of 42 conditions.This set comprises 21 experiments with RNA polymer-ase complex mutants and 21 corresponding wild-typecontrols. Within this set of experiments, (1) the 21experiments contain 10 pairs of replicated experi-ments, and likewise the 21 controls contain 10 pairs ofreplicates, making a total of 20 pairs of replicated con-ditions. Nine of these 20 pairs of replicated condi-tions are clustered at the leaf level in Figure 3(P = 20⁄216 2 19⁄214 2 ? ? ? 2 12⁄200 = 8.4 2 10111), and12 of these 20 are clustered at the leaf level in thatdepicted in Figure 4 (P = 20⁄212 2 19⁄210 2 ? ? ? 2 9⁄190 =1.4 2 10114). The significance of two conditions clus-tering at the leaf level is that they are more similar toeach other than to any other conditions. (2) The Holseries contains 13 conditions involving temperature-sensitive RNA polymerase complex mutants that weremaintained at 37°C prior to RNA assay. All other mu-tants and all control conditions were maintained at30°C (see Hol web site listed in Table 1). In the cluster-ing of Figure 3, 12 of the temperature-sensitive mutantconditions segregate into their own cluster apart fromthe rest of the Hol series. In Figure 4 (on our web site),all 13 temperature-sensitive mutant conditions segre-gate into a cluster of 15 that also contains two controlconditions. These clusters evidently reflect tempera-ture effects. Together, these observations indicate thatcondition clustering is effective at identifying similar(here replicated) conditions, and that it is likewise ef-fective at segregating conditions on the basis of impor-tant environmental variables such as temperature.
DISCUSSIONThe database and query tool described here representpreliminary versions of tools required in an integratedtool kit for exploring expression data. They can bemodified to make them more sophisticated and com-plete. Some improvements involve relatively simpletechnical fixes. The current version of ExpressDB isyeast specific, but the design changes required to gen-eralize it are small and an organism-general versionwill soon be available. The key changes allow results tobe recorded for FDRs other than ORF RNAs, such asESTs, cDNAs, and noncoding RNAs, that are frequentlyreported for higher organisms, and allow different setsof FDRs to be registered for different organisms. TheEXD query system can also be modified to automati-cally pipeline results to downstream analysis tools suchas clustering by gene and condition. Other technicalissues, such as system performance, will require ongo-ing management. Whereas database software and ap-plication tuning and equipment upgrades can improveExpressDB’s current performance, its current 17.5 mil-lion records, resulting from only 11 sources, clearlyonly represent the tiny beginnings of an anticipatedflood of expression data. Over time, more efficient da-
tabase technologies and algorithms will need to be ex-plored to ensure maintenance of performance levels.Standardization of data formats and contents (see be-low) will also help improve performance by providingopportunities to structure the database more effi-ciently.
More involved issues raised by data comparabilityconcerns must be addressed through standards and ad-ditional research. Here we propose several directionsfor development on the basis of our results. Becausethese directions apply beyond of the case of yeast, wephrase them in terms of FDRs rather than ORFs
Develop Methods that Will Allow Setsof Microarray-Derived Expression Datato Be Directly Compared with Each Otherand with Sets of Expression Data ObtainedUsing Other MethodologiesBy dint of its flexibility, relatively low cost, and publicavailability, microarray technology has made a hugecontribution to both the science of functional genom-ics as a whole and to the number of RNA expressiondata sets available for analysis; but the full potential ofthese data will not be realized until methods are devel-oped that allow microarray-derived ratios of FDR levelsin experimental conditions relative to control condi-tions to be easily and directly compared with microar-ray-derived ratios on the basis of different control con-ditions, and with the results of other high-throughputRNA assays. One possibility would be to encourage thedevelopment of standard microarray control condi-tions. If the RNA species in such standards are quanti-fied for abundance, it would then also be possible togenerate ERAs from microarray-derived data. Someideas for this are discussed on our web site.
Test Different RNA Expression Assays on CommonRNA Samples to Determine Whether They ProduceEquivalent Results, and Develop StandardCalibrations Where They Do NotIt is not enough that expression data collected withdifferent methodologies be expressible in a commonform such as ERAs; the actual data values must beshown to be equivalent regardless of their methodol-ogy of origin. We foresee a research project in whichRNA extracts from several test combinations of strainsand conditions are assayed on all key RNA expressionassays. Condition clustering of results by test sampleregardless of assay may be one good indicator of com-parability of results and of the correctness of calibra-tions. Protocols for sample preparation and labelingmay also need to be considered, as these, too, mayinfluence comparability of results. For instance, amongAffymetrix-based experiments, Cho generated labeleddouble-stranded cDNAs, whereas Hol, Rot, and Cohgenerated labeled single-stranded cRNAs. With the
Gene Expression Database and Analysis
Genome Research 441www.genome.org
cDNA protocol, cDNAs from nearby adjacent or over-lapping ORFs, especially convergent ORFs with 38-endoverlaps, could hybridize to both ORF probe sets, caus-ing signal from one ORF to be reflected in both,whereas this would not arise in the cRNA case. Thiscould be sufficient for data gathered from the samesample RNAs using the two different protocols to seg-regate into different clusters.
Establish Standards for Reporting Data that CoverAll RNA Expression AssaysOn the basis of issues that arose in generating the ERAfile, we propose that researchers publish versions ofdata files with the following characteristics:
1. Data are reported at the FDR level. Here this provednon-trivial for SAGE. Our ExpressDB representationof SAGE data in terms of minimum and maximumSAGE tag counts provides one example of how FDRlevel reporting may be accomplished for method-ologies where the entities for which experimentaldata is collected may not be uniquely assignable tothe functional RNA units chosen for data reporting(see Methods). Corresponding uncertainties men-tioned above about cross-hybridization of paralogsfor Affymetrix and microarray experiments affectthe accuracy of FDR-assigned values and may ulti-mately be addressed by calibrating hybridizationwith known family members.
2. Expression data is reported by publicly recognizedstable identifiers for FDRs rather than names (in thiscase SGDIDs vs. ORF names). This would avoid theneed for name resolution when combining datafrom different sources.
3. Data values considered to be in error (frequentlyreported in microarray data) are excluded, and mul-tiple nonerroneous values are consolidated into abest estimate. This makes it easier to compare datasets and also easier on the database itself. Mainte-nance of multiple data values for an FDR in a con-dition incurs substantial database overhead and canlead to unexpected results in processing queries (seethe hotlink for Multiple ORF Rows in the help docu-ment available from the EXD web query applica-tion). Ideally, experiments should be replicated andboth central tendencies and variances of best esti-mates should be reported. This could have beendone here for the Hol group of experiments.
4. Provide clear documentation on all reported mea-sures describing their data sources, formulas used tocompute them, and their proper usage.
5. Provide clear documentation on strains and envi-ronmental conditions for all data.
Some of these suggestions reaffirm the straw man stan-dards of (Bassett et al. 1999) and, we hope, providesome concrete directions for following them. We em-
phasize that we do not suggest that this should be theonly version of data that is published, but that such aversion be prepared for to support easy comparisonwith other data sets. Availability of less processedforms will allow other researchers to explore errorthresholds, characteristics of different probes for anFDR, etc.
As we noted previously, improvements in datacomparability through establishment of standards forexpression data collection, preparation, and reportingwill make databases more useful. We emphasize thatthis pertains not just to ExpressDB but to any RNAexpression database as the fundamental issue concernslimitations on the ability to compare data meaning-fully, not the computer structures by which it may bestored and managed. Such improvements will alsohelp streamline and focus databases. Taking ExpressDBas a case in point, in the absence of such standards,ExpressDB has both too much and too little data. Onthe too much side, a large number of the 2503 Measurerecords defined in ExpressDB and several million asso-ciated Expression Data Point (see Fig. 1) records are oflittle general scientific interest. They cannot be ignoredbecause they are sometimes found to be essential tointerpreting the data (e.g., microarray spot quality in-dicators); from there, general unclarity about datafields and their potential use, plus often sketchy docu-mentation that makes it hard to distinguish potentiallyimportant from likely unimportant fields, offers nopractical alternative to loading all reported data. Onthe too little side, information that is critical to inter-preting expression profiles, especially strain and con-dition descriptions, is maintained in ExpressDB only inunformatted text. As functional genomics develops, adatabase will be required that maintains strain andcondition information in a structured form that can bequeried precisely for such characteristics as the pres-ence of particular alleles in the strain, certain com-pounds in the medium, or treatments of the cell cul-ture (e.g., heat shock). Moreover, ExpressDB’s indexingof data at the RNA level, currently being generalizedfrom ORFs to FDRs, will require further generalization.Not only are protein levels now being gathered on ahigh throughput basis (Link et al. 1997; Futcher et al.1999; Gygi et al. 1999a,b; Page et al. 1999), but func-tional data on genomic features that do not generateRNA may need to be gathered, such binding affinitiesof proteins to regulatory DNA (Wang and Church1992; Tavazoie and Church 1998). Phenotype informa-tion such as growth rates of mutant strains will also beof interest. To record such data will require indexing byproteins (including modified proteins), regulatoryDNA, and strains in addition to FDRs. We have devel-oped a logical model and some prototype componentsof a Biomolecule Interaction Growth and ExpressionDatabase (BIGED) that will enable many of these data
Aach et al.
442 Genome Researchwww.genome.org
to be integrated (J. Aach and W. Rindone, unpubl.).The logical model may be downloaded from our website. ExpressDB and BIGED represent different tradeoffsbetween flexibility and biological meaning. WhereasExpressDB flexibly allows any Measure cited in a loadfile to be recorded on the database regardless of itsrelation to and comparability with any other Measure,at the cost of their unstructured proliferation, BIGEDdefines structures for particular measures and biologi-cal features and relationships between them that cor-respond to their biological meanings, but requiresmore standardization of data contents and formats.We believe that both kinds of databases will be re-quired as functional genomics advances.
METHODS
Database Model and DefinitionsThe ExpressDB database model and definitions were gener-ated using PowerDesigner 6.1 (PowerSoft, Concord, MA). Thefull model is available at our web site http://arep.med.harvard.edu/ExpressDB/.
Database LoadingEDBUpdate was written in Perl and accesses the databaseusing the sybperl interface. (See http://www.mbay.net/∼mpeppler/ for information on sybperl.) We edited load fileswhere necessary using text editors and Perl scripts to put theminto the required tab-delimited format with ORF names in adedicated column. ORF names were converted to upper case.In some cases, we eliminated records from load files thatcould not be identified as representing either ORFs or con-trols. To enhance queriability of the data, we converted emptycolumn positions in ORF rows (null values) to non-null de-fault values, where meaningful defaults could be clearly iden-tified; otherwise, null values were loaded as null databasefields. To load SAGE data from Vel, we located SAGE tag se-quences in yeast genome sequence downloaded from SGDand assigned them to ORFs on the basis of SGD ORF tables(both sequence and tables downloaded February 15, 1999)following rules for ORF and strand matching from (Velculescuet al. 1997). Because SAGE tags could not always be assigneduniquely to a single ORF, we computed and loaded minimumand maximum counts for each ORF as described in the text.
EXD Query SystemEXD is a collection of Common Gateway Interface (CGI)(Gundavaram 1996) modules written in Perl with the sybperlinterface. Some EXD CGI programs generate Javascript 1.2code.
Generation of ERA FileHere we report methods used for generation of Affymetrixand SAGE ERAs; we discuss generation of microarray ERAs onour web site. We extracted average PM–MM values (Af-fymetrix) and ORF tag counts (SAGE) from ExpressDB for allAffymetrix and SAGE data sets listed in Table 1, along withany relevant qualifiers (e.g., Affymetrix feature set identifiers).We used a specially written batch extract routine for all Af-fymetrix database extracts; for SAGE data, we used EXD toextract only those ORFs for which counts were entirely un-
ambiguous (minimum count = maximum count) for all con-ditions and only considered counts of 1 or more. We applieda standard sequence of processing steps to each individualload file, with variations as appropriate, to handle name andSGDID resolution, multiple ORF value consolidation, and Af-fymetrix threshold processing.
We standardized ORF names and assigned SGDIDs with aprogram that matched load file ORF names against an extractof the Name table of a prototype version of BIGED (J. Aachand W. Rindone, unpubl.; see web site), which had beenloaded with all primary and alternate ORF names and all as-sociations between ORF names and SGDIDs published on theSGD database since August, 1998. Name resolution also de-tected and matched hyphenation variants for some ORFnames. The name resolution program added additional col-umns to the output name-resolved files that preserve an audittrail of the different names consolidated to their target stan-dardized names.
For Affymetrix-derived files, the aggregate measure of theratio of exon 1-based probe set values to exon 2-based probeset values mentioned in the text was the outlier-excluded av-erage, over all ORFs with both exon 1 and exon 2 probe sets,of the ratio, for each such ORF, of the average over all condi-tions (n = 42 for Hol, n = 17 for Cho), of the average PM–MMvalues from an ORF’s exon 1 probe set, to the correspondingaverage for the ORF’s exon 2 probe set. We consolidated Af-fymetrix-based multiple probe set values for an ORF by exam-ining Affymetrix probe set names and averaging the values ofwhichever group of an ORF’s probe sets came first in the fol-lowing sequence: (1) the probe set name is simply a genename unqualified by exon or special feature set indicator (_i,_r, _f), (2) the probe set name is a gene name with an exon 2designation with no special indicators, (3) the probe set nameis a gene name with an exon 1 designation and no specialindicators, (4) any other probe sets. Affymetrix probe set in-dicators such as _i, _r, and _f indicate that a probe set departsfrom desirable target rules for oligonucleotide probe sequenceor probe set selection (Affymetrix Technical Help Desk, pers.comm.). The Rot Affymetrix-derived load file had already con-solidated multiple ORF values and was exempted from thisconsolidation step.
Following multiple ORF value consolidation, Affymetrix-derived files with the exception of Rot were threshold ad-justed to remove negative average PM–MM values. By andlarge, these correspond to ORFs considered absent by Af-fymetrix software (Lockhart et al. 1996); however, absenceindicators were not available for all load files and we ignoredthem generally in favor of threshold adjustment. We per-formed threshold processing by computing, for each condi-tion, the fifth percentile (P5) of all consolidated ORF averagePM–MM values, and replacing any consolidated ORF averagePM–MM values for the condition <P5 with the P5 value; how-ever, except for Cho many of the P5 values were still negativeand we used the fifth percentile of all positive values for theminstead.
We collated all individual normalized load files into asingle file using the standardized ORF names, combining allname audit trail information from each individual file. Thisfile contained cleansed intensity and SAGE count values foreach ORF for each experiment, but intensities are still on dif-ferent scales for each condition. Finally, we produced a con-solidated ERA file by dividing each non-SAGE-derived ORFintensity value by the total intensity value computed for itscolumn; for SAGE-derived values, we computed ERAs by di-
Gene Expression Database and Analysis
Genome Research 443www.genome.org
viding each ORF count by the total number of non-rejectedSAGE tags counted for the condition (Velculescu et al. 1997).
Condition Cluster AnalysisWe transformed ERA data in preparation for clustering usinga variant of procedures in (Tavazoie and Church 1998). Forthe entire collection of 217 conditions for which ERAs werecomputed (including microarray ERAs), we converted non-null ERAs for each ORF in each condition to log10 values, firstadding the small value 0.000005 (equal to half the minimumprecision non-zero value in the file) to all non-null values toeliminate zero values. We used logarithms to assure that in-duction and repression were on the same scale in so far asthese are assessable through fold changes across conditions. Ifan ORF with expression level l in condition 1 is induced atfold change f>1 in condition 2 and repressed at the same foldchange level in condition 3, then the difference between lev-els in 2 and 1 versus 1 and 3 is l(f 1 1) versus l(1 1 1/f), andwhereas the former term grows linearly with f and is un-bounded, the latter term is hyperbolic and bounded withinthe interval (0,l). Because both correlation coefficients anddirect clustering are based on difference terms, direct use ofestimated relative abundance values risks underrepresentingthe effects of repression (see also Eisen et al. 1998).
We then converted non-null log10 relative abundancesfor each ORF to standard units across all conditions to gener-ate standard unit log10 relative abundances (SULRA). We per-formed the clustering of Figure 3 on Pearson correlation co-efficients over ORF SULRAs between all pairs of our 217 con-ditions. However, many ORFs have relative abundances at orbelow the level of measurement noise and variation of theirSULRAs over conditions can be expected to reflect noise asmuch as change in expression level. Also, some ORFs withhigher relative abundance varied so little across conditionsthat difference terms between condition levels could alsoreflect noise. We therefore considered subsets of ORFs exhib-iting high relative abundance levels and evidence of sig-nificant induction or repression as defined by two criteria:(1) the median ERA of the ORF over all conditions >= a per-centile threshold p of the median ERAs of all ORFs, and (2) atleast 10% of ratios of ERAs for the ORF over all pairs ofconditions >= a threshold r, in which this latter was evaluatedby ensuring that the ratio of the kth largest and kthsmallest relative abundance level for the ORF $ r, wherek = ceiling[sqrt(n(n11)/20)], where n = number of conditionswith non-null relative abundance values. We looked for sub-sets with ∼1000 ORFs. We used a subset of 1078 ORFs selectedwith P = 60 and r = 3 for Figure 3. Correlation coefficientswere clustered using trace clustering (Ward’s algorithm) inSPLUS 4.5 (MathSoft, Seattle, Wa.) We color coded the dia-gram using in Excel 97 (Microsoft, Redmond, Wa.)
ACKNOWLEDGMENTSWe thank Pat Brown, Paul Spellman, Vishwanath Iyer, JosephDeRisi, Michael Eisen, and Rick Young for their assistance inhelping us understand data and procedures from experimentsincluded in this analysis. Within the Church Laboratory, wethank Barak Cohen for use of unpublished data, and BarakCohen, Rob Mitra, Saeed Tavazoie, Martin Steffen, and others,for helpful discussions on data cleansing, analysis, clustering,and for critical comments on this manuscript. We also thankfour anonymous reviewers for very helpful critical comments.Finally, we thank the Lipper Foundation, Hoechst Marion
Roussel, DOE grant DE-FG02-87ER60565, and HowardHughes Medical Institute for their funding of this work.
NOTE ADDED IN PROOFThe update of the design of ExpressDB to be organism-independent, mentioned above, is now complete. De-tails on the new design are on our web site.
REFERENCESAlon, U., N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack,
and A.J. Levine. 1999. Broad patterns of gene expression revealedby clustering analysis of tumor and normal colon tissues probedby oligonucleotide arrays. Proc. Natl. Acad. Sci. 96: 6745–6750.
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman.1990. Basic local alignment search tool. J. Mol. Biol.215: 403–410.
Bairoch, A. and R. Apweiler. 1999. The SWISS-PROT proteinsequence data bank and its supplement TrEMBL in 1999. NucleicAcids Res. 27: 49–54.
Bassett, D.E., Jr., M.B. Eisen, and M.S. Boguski. 1999. Geneexpression informatics–it’s all in your mine. Nat. Genet.21: 51–55.
Benson, D.A., M.S. Boguski, D.J. Lipman, J. Ostell, B.F. Ouellette,B.A. Rapp, and D.L. Wheeler. 1999. GenBank. Nucleic Acids Res.27: 12–17.
Bernstein, F.C., T.F. Koetzle, G.J. Williams, E.F. Meyer, Jr., M.D.Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi.1977. The Protein Data Bank. A computer-based archival file formacromolecular structures. Eur. J. Biochem. 80: 319–324.
Cherry, J.M., C. Ball, S. Chervitz, K. Dolinski, S. Dwight, M. Harris,E. Hester, G. Juvik, A. Malekian, T. Roe, S. Weng, and D.Botstein. 1999. Saccharomyces Genome Database.http://genome-www.stanford.edu/Saccharomyces/.
Cho, R.J., M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway, L.Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J.Lockhart, and R.W. Davis. 1998. A genome-wide transcriptionalanalysis of the mitotic cell cycle. Mol. Cell 2: 65–73.
Chu, S., J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P.O. Brown,and I. Herskowitz. 1998. The transcriptional program ofsporulation in budding yeast. Science 282: 699–705.
DeRisi, J.L., V.R. Iyer, and P.O. Brown. 1997. Exploring themetabolic and genetic control of gene expression on a genomicscale. Science 278: 680–686.
Eisen, M.B., P.T. Spellman, P.O. Brown, and D. Botstein. 1998.Cluster analysis and display of genome-wide expression patterns.Proc. Natl. Acad. Sci. 95: 14863–14868.
Ermolaeva, O., M. Rastogi, K.D. Pruitt, G.D. Schuler, M.L. Bittner, Y.Chen, R. Simon, P. Meltzer, J.M. Trent, and M.S. Boguski. 1998.Data management and analysis for gene expression arrays. Nat.Genet. 20: 19–23.
European Bioinformatics Institute. 1999. ArrayExpress.http://www.ebi.ac.uk/arrayexpress/.
Futcher, B., G.I. Latter, P. Monardo, C.S. McLaughlin, and J.I.Garrels. 1999. A sampling of the yeast proteome. Mol. Cell. Biol.19: 7357–7368.
Gundavaram, S. 1996. CGI programming on the World Wide Web.O’Reilly & Associates, Inc., Cambridge, MA.
Gygi, S.P., B. Rist, S.A. Gerber, F. Turecek, M.H. Gelb, and R.Aebersold. 1999a. Quantitative analysis of complex proteinmixtures using isotope-coded affinity tags. Nat. Biotechnol.17: 994–999.
Gygi, S.P., Y. Rochon, B.R. Franza, and R. Aebersold. 1999b.Correlation between protein and mRNA abundance in yeast. Mol.Cell. Biol. 19: 1720–1730.
Hieter, P. and M. Boguski. 1997. Functional genomics: It’s all howyou read it. Science 278: 601–602.
Holstege, F.C., E.G. Jennings, J.J. Wyrick, T.I. Lee, C.J. Hengartner,M.R. Green, T.R. Golub, E.S. Lander, and R.A. Young. 1998.Dissecting the regulatory circuitry of a eukaryotic genome. Cell
Aach et al.
444 Genome Researchwww.genome.org
95: 717–728.Link, A.J., K. Robison, and G.M. Church. 1997. Comparing the
predicted and observed properties of proteins encoded in thegenome of Escherichia coli K-12. Electrophoresis 18: 1259–1313.
Lockhart, D.J., H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S.Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Horton, andE.L. Brown. 1996. Expression monitoring by hybridization tohigh-density oligonucleotide arrays. Nat. Biotechnol.14: 1675–1680.
Marton, M.J., J.L. DeRisi, H.A. Bennett, V.R. Iyer, M.R. Meyer, C.J.Roberts, R. Stoughton, J. Burchard, D. Slade, H. Dai et al. 1998.Drug target validation and identification of secondary drugtarget effects using DNA microarrays. Nat. Med. 4: 1293–1301.
Mewes, H., K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. Stocker,and D. Frishman. 1999. MIPS: A database for genomes andprotein sequences. Nucleic Acids Res. 27: 44–48.
Myers, L.C., C.M. Gustafsson, K.C. Hayashibara, P.O. Brown, andR.D. Kornberg. 1999. Mediator protein mutations that selectivelyabolish activated transcription. Proc. Natl. Acad. Sci. 96: 67–72.
National Center for Biotechnology Information. 1999. Geneexpression omnibus. http://www.ncbi.nlm.nih.gov/geo/.
Page, M.J., B. Amess, R.R. Townsend, R. Parekh, A. Herath, L.Brusten, M.J. Zvelebil, R.C. Stein, M.D. Waterfield, S.C. Davies,and M.J. O’Hare. 1999. Proteomic definition of normal humanluminal and myoepithelial breast cells purified from reductionmammoplasties. Proc. Natl. Acad. Sci. 96: 12589–12594.
Perou, C.M., S.S. Jeffrey, M. van de Rijn, C.A. Rees, M.B. Eisen, D.T.Ross, A. Pergamenschikov, C.F. Williams, S.X. Zhu, J.C. Lee et al.1999. Distinctive gene expression patterns in human mammaryepithelial cells and breast cancers. Proc. Natl. Acad. Sci.96: 9212–9217.
Roth, F.P., J.D. Hughes, P.W. Estep, and G.M. Church. 1998. FindingDNA regulatory motifs within unaligned noncoding sequencesclustered by whole-genome mRNA quantitation. Nat. Biotechnol.16: 939–945.
Spellman, P.T., G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B.Eisen, P.O. Brown, D. Botstein, and B. Futcher. 1998.Comprehensive identification of cell cycle-regulated genes of theyeast Saccharomyces cerevisiae by microarray hybridization. Mol.Biol. Cell 9: 3273–3297.
Tavazoie, S. and G. Church. 1998. Quantitative whole-genomeanalysis of DNA-protein interactions by in vivo methylaseprotection in E. coli. Nat. Biotechnol. 16: 566–571.
Tavazoie, S., J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M.Church. 1999. Systematic determination of genetic networkarchitecture. Nat. Genet. 22: 281–285.
Teorey, T.J. 1994. Database modeling and design: The fundamentalprinciples. Morgan Kaufmann Publishers, Inc., San Francisco, CA.
Velculescu, V.E., L. Zhang, B. Vogelstein, and K.W. Kinzler. 1995.Serial analysis of gene expression. Science 270: 484–487.
Velculescu, V.E., L. Zhang, W. Zhou, J. Vogelstein, M.A. Basrai, D.E.Bassett, Jr., P. Hieter, B. Vogelstein, and K.W. Kinzler. 1997.Characterization of the yeast transcriptome. Cell 88: 243–251.
Wang, M.X. and G.M. Church. 1992. A whole genome approach toin vivo DNA-protein interactions in E. coli. Nature 360: 606–610.
Weinstein, J.N., T.G. Myers, P.M. O’Connor, S.H. Friend, A.J.Fornace, Jr., K.W. Kohn, T. Fojo, S.E. Bates, L.V. Rubinstein, N.L.Anderson et al. 1997. An information-intensive approach to themolecular pharmacology of cancer. Science 275: 343–349.
Wen, X., S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J.L. Barker,and R. Somogyi. 1998. Large-scale temporal gene expressionmapping of central nervous system development. Proc. Natl.Acad. Sci. 95: 334–339.
Wodicka, L., H. Dong, M. Mittmann, M.H. Ho, and D.J. Lockhart.1997. Genome-wide expression monitoring in Saccharomycescerevisiae. Nat. Biotechnol. 15: 1359–1367.
Received July 21, 1999; accepted in revised form February 16, 2000.
Gene Expression Database and Analysis
Genome Research 445www.genome.org