technical report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · technical report...
TRANSCRIPT
Systematic Evaluation of Normalization Methods for Gene Expression Data
Technical Report
Department of Computer Science
and Engineering
University of Minnesota
4-192 EECS Building
200 Union Street SE
Minneapolis, MN 55455-0159 USA
TR 07-015
Systematic Evaluation of Normalization Methods for Gene Expression
Data
Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael
Steinbach, and Vipin Kumar
June 08, 2007
Systematic Evaluation of Normalization Methods for Gene
Expression Data
Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach and Vipin Kumar
Department of Computer Science and Engineering
University of Minnesota, Twin Cities, USA
Contact: [email protected]
Abstract
Motivation: Even after an experimentally prepared gene expression data set has been pre-processed
”to adjust for effects which arise from variationin the microarray technology”, there may be incon-
sistencies between the scales of measurements in different conditions. A variety of normalization and
transformation methods have been used for addressing these scale differences in different studies on
the analysis gene expression data sets. However, a quantitative estimation of their relative perfor-
mance has been lacking in this domain.
Results: In this paper, we report an extensive evaluation of normalization and transformation meth-
ods for their effectiveness, with respect to the important application of protein function prediction.
We consider several such commonly used methods for gene expression data, such as z-score and
quantile normalization, and diff transformation, and two novel normalization methods, sigmoid and
double sigmoid, that have not been used previously in this domain. Using a variety of evaluation
methodologies based on the prediction of protein function from gene expression data, we conduct an
extensive evaluation of these methods, and show that their performance can vary significantly across
different data sets. We also provide evidence that the two types of gene expression data, namely
temporal and non-temporal, need different types of analyses in order to use them effectively for un-
covering functional information.
Availability: All source code is available on request from the authors.
1 Introduction
Gene expression experiments are a method to quantitatively measure the transcription phase of protein
synthesis from several genes under a given condition [NAWC02]. Due to the ability of these experiments
to provide an insight into the simulatenous activity of thousand of genes, the data produced by these
experiments, also known as microarray data, is used for a variety of biological studies [Slo02]. For instance,
the similarity of the expression values of a pair of genes over a set of related conditions, referred to as
their expression profiles, indicates that they are expressed or repressed under the same set of conditions
approximately, and thus are expected to be functionally related.
1
Conditions
Gen
es
10 20 30 40 50 60 70
200
400
600
800
1000
1200
1400
1600
1800
2000 −5
0
5
Figure 1: Visualization of Eisen et al [ESBB98]’s data set shows a set of conditions on different scale
A necessary step for the effective analysis of gene expression data is its pre-processing [Qua02], where
the data is processed to reduce the ill-effects of various data corruption phenomena, such as noise and
missing values as ”to adjust for effects which arise from variation in the microarray technology” [SS03]
used to generate the gene expression data sets. Indeed, several such pre-processing methods have been
devised for gene expression data, that are focused towards removing the biases and variances coming
from different components of the microarray experimental procedure, such as LOWESS [YDL+02] and
SNOMAD [CHZP02]. Each of these methods are appropriate for different scenarios, and some studies
have evaluated them under several such scenarios [PYK+03, WXM+05].
Even after an experimentally prepared gene expression data set has been pre-processed using the
above methods, there may be inconsistencies between the scales of measurements in different conditions.
For example, consider [ESBB98]’s data set, which is composed of eight different temporal expression
measurements, and is one of the widely used expression data sets for S. cerevisiae (budding yeast). As
can be seen in Figure 1, in this data set, conditions 48-58 have an entirely different scale of values as
compared to the rest. This inconsistency in scale is expected to affect the analysis of this data, e.g., the
correlation between two gene expression profiles, and thus needs to be addressed via further processing. A
variety of normalization and transformation methods have been used for this task in different studies on
the analysis gene expression data sets [TPS+99, CCV+03, BIB03, Bol01, LHM+03, BHWK04]. However,
a quantitative estimation of their relative performance has been lacking in this domain. In this paper, we
evaluate several such methods for their effectiveness with respect to the important application of protein
function prediction, for which gene expression data has been widely used [ESBB98, BDSY99, BGL+00,
MDJ+02, PKS06].
2
1.1 Contributions
This paper makes the following contributions:
1. We consider several commonly used normalization and transformation methods for gene expres-
sion data, such as z-score and quantile normalization, and diff transformation, and two novel
normalization methods, sigmoid and double sigmoid, that have not been used in this domain.
2. Using a variety of evaluation methodologies based on the prediction of protein function from gene
expression data, we conduct an extensive evaluation of these methods, and show that their perfor-
mance can vary significantly across different data sets.
3. We provide evidence that the two types of gene expression data, namely temporal and non-temporal,
need different types of analyses in order to use them effectively for uncovering functional information.
2 Materials and Methods
This section details the gene expression data sets used in this study, the normalization and transformation
methods applied to them, and the methodologies used to evaluate these methods.
2.1 Data Sets
Gene expression data can be represented as a data matrix where the rows represent genes and the columns
represent either individual conditions or the measurement under a particular condition at different times.
For the effective analysis and normalization of gene expression data sets, it is essential to distinguish
between these two types of data sets—temporal and non-temporal—which we now define more fully.
1. Temporal: The experiments in these data sets measure the expression behavior of genes that have
been exposed to a certain condition at different instances of time. Thus, there is a well defined
relationship between consecutive columns in these data sets.
2. Non-temporal: These expression data sets are prepared by combining results from experiments
that do not have a temporal relationship with each other. Although they may be related because
they provide a comprehensive view of the phenomenon they are designed to study, they can also be
analyzed independently.
Even though some studies have lately started addressing the temporal aspect of gene expression [BJ04,
HKSL01], most microarray data analysis studies ignore this distinction, and apply the same normalization
methods to all data sets. Since there is a fundamental difference between the nature of these data, we
believe that different analysis techniques need to be applied to them, and thus distinguish between them
in our study. We also provide evidence supporting this observation in Section 4.
In accordance with the above distinction, we selected several data sets of the two types for S. cerevisiae
(budding Yeast), which are summarized in Tables 1 and 2. We chose this organism since substantial
information is available about the functions of its genes. Most of these data sets were obtained from
the webpage of Yona et al [YDRL06]’s study 1 and SMD [S+01]. The KNNImpute program [TCS+01]
1http://biozon.org/tools/expression/instructions.html
3
Reference #Genes #Conditions
Gerber et al [GHB04] 6303 10Hughes et al [H+00] 6316 300Iyer et al [IHK+99] 6251 12
Saldanha et al [SBB04] 6314 24
Table 1: Summary of non-temporal gene expression data sets used in this study
Reference #Genes #Time Points #Time Series
Zhu et al [ZSV+00] 5714 26 2Shapira et al [SSB04] 4771 70 4
Table 2: Summary of temporal gene expression data sets used in this study
was used to impute missing values. Also, it should be noted that all the temporal data sets that we
used consisted of several expression time series experiments, as detailed by the last column of Table 2.
Section 2.3 discusses how each of these time series is processed, and the results combined to generate a
processed version of the complete temporal data set.
2.2 Normalization Methods for Gene Expression Data
In what follows, we assume that all the data sets have been pre-processed using standard methods
to handle experimental errors, and will focus on normalization approaches that handle the problem of
differences in scale. This problem may manifest itself in a gene expression data set in forms like the
following:
• In the case of data sets constructed by concatenating several experiments, some components may
be at a different scale than the others. For instance, as mentioned earlier, conditions 48-58 of Eisen
et al [ESBB98]’s data set, shown in Figure 1 have a different scale than the others.
• There might be extreme or outlying values in the data set, which may affect the data analysis
adversely.
• Some genes may express more easily than others, thus creating a bias in the expression values. For
instance, Myers et al [MBH+06] showed that a significant fraction of the set of most co-expressed
genes in their data set belonged to the ribosome pathway, thus suggesting that genes belonging to
this pathway may express more easily than others.
Following is a brief description of normalization methods that attempt to reduce the effect of these scale
issues on the analysis of gene expression data. All these methods are applied to the columns of the gene
expression matrix, except for Quantile normalization, which does this implicitly.
2.2.1 Unitnorm Normalization
In linear algebra, a common way of bringing a set of vectors to the same scale is to transform them to
unit vectors by dividing each component of a vector X by its length as indicated in Equation 1.
4
Unitnorm(X) =X
||X||2(1)
This normalization has been used in various domains, such as text mining, where this method is used
to reduce the effect of varying document lengths [SBM96].
2.2.2 Z-score-based Normalization
In statistics, a common method of making different data vectors comparable is by shifting the values
in a vector by the mean of their values, and scaling them by their standard deviation. This converts
each value into a deviation from the mean. If the mean of a data vector is 0, then this normalization
is the same as the previous normalization, except for a constant. The mathematical formula for this
normalization, which we refer to as Znorm, is shown in Equation 2.
Znorm(X) =X − X̄
σX
(2)
This approach, which is known as standardization in statistics, also has many biological applications,
e.g., the evaluation of protein structure alignment scores [HP00]. Indeed, z-scores are also commonly
used for the normalization of gene expression data [TPS+99, CCV+03]. Also, some other normalizations,
such as those used by Bergmann et al [BIB03], are mathematically equivalent to Znorm.
2.2.3 Quantile Normalization
This is another popular normalization method for gene expression data [Bol01, Bol04]. It attempts to
transform data from two different distributions to a common distribution by making the quantiles of the
distributions equal. This has the potential for adjusting for differences beyond those of location and scale.
Bolstad [Bol01] has suggested the following algorithm for using this method for normalizing a given set
of vectors, assuming that there are no missing values.
Quantile Normalization Algorithm
1. Given n vectors of length p, form X of dimension p × n, where each vector is a column in X .
2. Sort each column of X to give Xsort.
3. Take the means across rows of Xsort and assign this mean to each element in the row to get a
quantile equalized X′
sort.
4. Get Xnormalized by rearranging each column of X′
sort to have the same ordering as the original X .
Various other types of quantile-based data normalization have been proposed [WJJ+02]. For ease
of implementation, we used the quantilenorm function in the MATLAB bioinformatics toolbox2, which
implements Bolstad’s algorithm detailed above.
2http://www.mathworks.com/products/bioinfo/
5
−4 −2 0 2 4 6 80
200
400
600
800
1000
1200
Expression values
Fre
quen
cy
Figure 2: Normal-like distributions of expression values in column 5 of Gerber et al [GHB04]’s data set
2.2.4 Sigmoid Family of Normalizations
The final normalization considered, takes a distribution-oriented approach, which has been reported by
several authors as being important for effective processing of the set of values obtained from a gene
expression experiment [LRTS04, YDRL06]. For instance, Figure 2 shows the distribution of expression
values in the 5th experiment (column) of Gerber et al [GHB04]’s data set (matrices) respectively. In
addition to the very normal-like distribution shown by this histograms, it can be observed that there are
several outlying values in this vector, such as those lying outside of the range [−2, 2]. It is important
to consider both the underlying distribution and the presence of outliers in order to develop an effective
normalization method.
Here, we first introduce a normalization method that take into account the consideration that ex-
treme or outlying values should not distort the data analysis significantly. This family is based on the
sigmoid function, defined in Equation 3, which is commonly used to introduce non-linearity in neural
networks [DHS00].
Sigmoid(x) =1 − e−x
1 + e−x(3)
An interesting property of this function is that the extreme values in the input are bounded by ±1. This
implicitly eliminates the effects of extreme values in the data, thus making it suitable for normalizing gene
expression data. However, the Sigmoid(x) function has the significant weakness that it takes only the
value of x, and not the background distribution of x, into account when determining the final normalized
value. Another important point to consider is that in the case of gene expression data, a value of 0 in the
log-ratio matrix of the original red and green intensities is defined as the expression value of a gene that is
6
−10 −5 0 5 10−1
−0.5
0
0.5
1
x
dsig
moi
d(x)
Figure 3: Double sigmoid function
completely neutral under the given condition. However, due to noise factors, this value may be distorted
by a small amount. Thus, it may be useful for some experiments that values in a small neighborhood of
zero be treated effectively as zero. The standard Sigmoid function doesn’t do this.
These considerations indicate that the Sigmoid function should be modified so that it takes the
distribution of values into account, and also does a local smoothing around the center of the distribution.
The second factor can be accounted for by breaking the sigmoid function into two ranges [−1, 0) and
(0, 1] and defining separate modifications of the Sigmoid function on them, thus converting the original
function into the double sigmoid (Dsigmoid) function. Though this function has been mathematically
formulated and used in other studies [JNR05], we chose the formulation of Equation 43, since it enables
us to incorporate the first factor also:
Dsigmoid(x) = sign(x − d)(1 − exp(−(x − d
s)2))) (4)
Here d and s are the centering and the steepness factors of the function respectively. A sample output of
the Dsigmoid function is shown in Figure 3. It can be observed from this plot and Equation 4 that our
formulation of the double sigmoid function is very close to the probability distribution function of the
normal distribution N(X̄, σ2X
), if d = X̄ and s = σ2X
for a vector X. This observation further supports the
use of Dsigmoid for normalizing gene expression data, since there is evidence that the distribution of gene
expression values is approximately normal [STG+01, SYK03]. This perspective is further strengthened
by Figure 2, which illustrates that the distribution of expression values under a certain condition is close
to normal. Thus, due to these considerations, we used the double sigmoid normalization as one of the
normalization methods in this study. The method was implemented by transforming each value Xi in a
vector X to Dsigmoid(Xi) using Equation 4.
3http://en.wikipedia.org/wiki/Sigmoid function
7
2.3 Transformation Methods for Temporal Expression Data
As can be seen from the descriptions above, most normalization methods normalize each vector indepen-
dent of every other. This assumption is valid for non-temporal expression data sets, since there is no
relationship between different columns of the data matrix. However, in temporal expression data sets, the
expression of genes is measured at consecutive time points in an experiment, and thus there are inherent
relationships among the different columns of a temporal expression data matrix. The interrelationships
between them need to be factored into the normalization process.
A common method for this task is to use the values at a set of consecutive time points to derive a new
time series that is better suited for the desired application. We refer to this process as a transformation,
and investigated the following methods for this task:
1. Smoothing by moving average: A popular method of analyzing time series data is by smoothing
the time series values in a sliding window of duration k by averaging them. This is known as the
moving average (MA) method [Ham94]. This method has not been used for gene expression data,
but is popular method in many domains for analyzing time series data, e.g., climate data [PTS+03].
The transformation from X1...n to X′
1...n−2, both being time series, is defined by Equation 5. This
method provides several advantages, such as smoothing, the reduction of the effect of outliers and
noise, and compensation for small phase lags between two time series.
X ′
i =1
k
t=i+k−1∑
t=i
Xt (5)
We used the commonly used value of k = 3 in our implementation.
2. Differences between consecutive points: Another method of taking the structure of time series
into account is to define new features whose values are the differences between the values at two
consecutive points in a time series. Doing this removes the effect of the magnitude of the time
series,but preserves the magnitude of the changes from one time period to another.
This method transforms the original time series vector X1...n into a new vector X′
1...(n−1) using the
simple formula X ′
i = Xi+1 − Xi, and thus takes only the trend of change between the time points
into account, and not the absolute values. This helps reduce the effect of offsets in the values that
may be due to experimental error or other factors, and has been used for the functional classification
of temporal gene expression data [HKSL01, LHM+03]. We refer to this method as the Diff method
in the subsequent discussion.
3. Z-score: In many instances, time series are best compared by considering only deviations from the
average. This can be accomplished by using the Z − score, which was given in Equation 2. Indeed,
this method has been shown to be useful in several time series applications, such as the mining
of Earth science data [ZSK+05], and has also been used for the transformation of temporal gene
expression data for subsequent analysis [TPS+99, BHWK04]. Thus, we used this method in our
analysis, and it is referred to as the Ztrans method in the following discussion.
Finally, it should be noted that the temporal expression data sets used in this study contained multiple
time series as detailed in Table 2. Since each of the above transformations is defined only for a single
8
time series, each transformation is applied separately to each time series, and the final transformed
gene expression matrix is obtained by concatenating the transformed vector for each gene individually.
This process makes sense from a data analysis perspective since each of the time series may follow
different distributions of values and thus need to be analyzed separately, followed by a combination of the
results. Notably, this is in contrast to the approach adopted by most analyses of time series expression
data [SSB04, GSK+00], which normalize the entire expression profile of a gene as a whole.
In summary, the temporal gene expression data sets, say M , are normalized using the following
two-step algorithm:
1. Transform M using the above three transformations.
2. Apply the normalization methods discussed in Section 2.2 to the columns of each of the three
matrices produced above, as well as to the raw data matrix.
This algorithm produces 3×6 = 18 normalized matrices corresponding to each transformation-normalization
combination. In addition, we also included all normalized matrices produced by applying only the nor-
malization methods described in Section 2.2 to M .
2.4 Similarity Measure
An important component of our study is the measure used to estimate the proximity, i.e., similarity or
distance, between the expression profiles of two genes. For this purpose, we used Pearson’s correlation
coefficient, which has been commonly used for gene expression data [DP05]. In our experiments also,
correlation produced consistently better results than other measures such as cosine similarity and Eu-
clidean distance. We also considered the recently proposed mass-distance measure [YDRL06], which has
shown promising results for some expression data sets. However, due to the lack of a readily usable
implementation of this algorithm, we have left its use for future work.
3 Evaluation Methodology
Suppose an expression matrix M has been normalized using a method A (with or without the row
transformation step) to produce MA. In order to evaluate the method A in terms of their effectiveness
in magnifying the available functional information in M , we examined the pairwise links between genes
ranked by their expression similarity in MA to see if the most highly ranked links tend to connect
genes with similar function. The functional evidence is derived from three sources, namely Yona et
al [YDRL06]’s data set of relationships, SwissProt keywords [BBA+03] and the FunCat classification
scheme [R+04], each of which represents a different form of the definition of protein function. We chose
to perform our evaluation with this variety of sources of functional relationships since protein function
is not a precisely defined concept. This evaluation process is applied to each of the normalized versions
of M , and the results are compiled in order to judge the relative performances of different normalization
methods. The following subsections provide the details of the three types of functional information and
how they are used for the evaluation.
9
3.1 Recovery of Observed Functional Relationships (ObservedFuncRels)
This knowledge consists of experimentally observed interactions between pairs of proteins in an organism.
We used a set of 41902 such pairwise interactions used by Yona et al [YDRL06] in their evaluation study
of the most appropriate similarity measure for gene expression profiles. This set is constructed using
four types of interactions, namely (i) physical protein-protein interactions, (ii) metabolic pathway co-
membership of enzymes, (iii) co-regulation by the same promoter, and (iv) co-membership in sequence
homology clusters.
For this set of functional relationships, the evaluation methodology used was also the same as that
adopted by Yona et al [YDRL06]. Here, the pairwise similarities are calculated among all expression
profiles in the given data set, and the protein pairs are sorted in descending order according to their
corresponding expression similarities. Now, starting from the most similar gene pair, the total number
of pairs which are known to be functionally related according to the above set, are cumulatively added.
These numbers can then be used to produce a plot of the number of true functional relationships recovered
versus the number of protein pairs analyzed in the order of decreasing similarity.
3.2 SwissProt Keyword Recovery (KeywordRecovery)
The next category of functional information we used was derived from keywords assigned to protein
entries in the SwissProt database [BBA+03]. Each protein in this database is assigned a set of keywords,
such as hydrolase, DNA binding and repressor, that indicate a form of functional information about it.
We considered only 179 of these keywords that had an annotation frequency of between 10 and 600 among
the proteins of S. cerevisiae. This selection eliminates the keywords that are over-abundant or very rare
and may distort a frequency-based analysis.
In this methodology, the well-known keyword recovery (KR) metric is used for evaluation. This is a
commonly used metric when the result of the prediction algorithm are pairwise functional links between
proteins [MPT+99, SMP+03]. Given a query protein P , and the set of proteins directly linked to it
Nbd(P ), the keyword recovery performance of an algorithm is computed using Equation 6.
KR =1
N
N∑
i=1
x∑
j=1
Fij
Fi
(6)
Here, Fi is the total frequency of all the keywords appearing in Nbd(P ), Fij is the frequency of the jth
keyword of P in Nbd(P ), x is the total number of keywords of P , and N is the total number of proteins in
the network. Thus, KR measures the recall of the original keyword annotations in the resulting protein
relationship graph.
The use of this method for evaluating normalization methods is straightforward. Once the gene
expression matrix M has been normalized using a method A, the k most similar gene pairs in terms of their
normalized expression profiles are treated as edges between two nodes, and a graph GA,k is constructed
by combining all the edges so found. The quality of this graph is evaluated using the KR metric defined
in Equation 6. Now, increasing the value of k from numedgesmin to numedgesmax produces a set of
keyword recovery values for method A, and this process is repeated for all the normalization methods
considered. Finally, the different methods are compared by plotting these sets of KR values together.
10
3.3 Similarity of Functional Labels (SimFuncLabels)
The third type of functional relationships we used are derived from the annotations of yeast proteins
using the FunCat functional classification scheme [R+04]. Since FunCat is a hierarchical scheme, we used
the 81 classes at a depth of two from the root of the hierarchy.
Now, for the proteins annotated with these classes, an annotation vector is constructed for each
protein, containing 1 for the classes that the protein is annotated with, and 0 for the others. Once this
set of vectors has been constructed for all the proteins, it is partitioned using the CLUTO clustering
toolkit [Kar02]. Then, functional links can be derived from these clusters by simply treating each cluster
as a clique, and two proteins are considered functionally related if they are part of the same clique. In
order to avoid spurious relationships, we chose the minimum size of each cluster to be 20 in order for
it to be used for extracting functional relationships, since we observed that most of the small data sets
had a low internal similarity, and thus were not very tight. The remaining clusters had a high internal
similarity (usually over 0.8), which implies that a large majority of the protein in these clusters had an
identical set of labels, thus justifying their use for extracting functional relationships.
Finally, once the set of relationships has been derived, the evaluation methodology is completely
identical to that adopted in the ObservedFuncRels methodology, since the experimentally observed rela-
tionships there could be conveniently replaced by these functional label-based relationships.
4 Results
In this section, we discuss the application of the overall evaluation methodology to several gene expression
data sets. We present results for the two types of expression data, namely non-temporal and temporal
separately due to the conceptual difference between the two types discussed in Section 2.1.
We only show the portions of the complete plots that correspond to the most highly ranked gene
pairs in terms of their expression similarity, as they are the ones expected to include the most function
information. These plots in this section are best viewed in color. Also, in addition to the results of
the normalization methods, we also plot the results obtained from the unnormalized data set, and those
obtained by a random selection of gene pairs. For the ObservedFuncRels and SimFuncLabels evaluation
methods, the random selection results were approximated by the following line:
y = (total # functional links
# possible links for evaluation methodology) × x
On the other hand, results for random selection with the KeywordRecovery method are obtained by
randomly selecting k gene pairs for various values of k using a uniform distribution, and evaluating the
KR metric for each of the graphs so constructed.
4.1 Evaluation of normalization methods for non-temporal gene expression
data
For this evaluation, we applied the following normalizations listed in Section 2.2 to the columns of these
data sets: Unitnorm, Znorm, Sigmoid, Dsigmoid, and Quantile. Figures 4-7 show the results of evalu-
11
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
100
200
300
400
500
600
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(a) Results using ObservedFuncRels
0 200 400 600 800 1000 1200 1400 1600 1800 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Gene pairs ordered by similarity
Key
wor
d re
cove
ry
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(b) Results using KeywordRecovery
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
500
1000
1500
2000
2500
3000
3500
4000
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(c) Results using SimFuncLabels
Figure 4: Evaluation of normalization methods on Gerber et al [GHB04]’s non-temporal data set
12
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
500
1000
1500
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(a) Results using ObservedFuncRels
0 200 400 600 800 1000 1200 1400 1600 1800 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Gene pairs ordered by similarity
Key
wor
d re
cove
ry
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(b) Results using KeywordRecovery
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
500
1000
1500
2000
2500
3000
3500
4000
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(c) Results using SimFuncLabels
Figure 5: Evaluation of normalization methods on Hughes et al [H+00]’s non-temporal data set
13
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
50
100
150
200
250
300
350
400
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(a) Results using ObservedFuncRels
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gene pairs ordered by similarity
Key
wor
d re
cove
ry
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(b) Results using KeywordRecovery
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
100
200
300
400
500
600
700
800
900
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(c) Results using SimFuncLabels
Figure 6: Evaluation of normalization methods on Iyer et al [IHK+99]’s non-temporal data set
14
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
100
200
300
400
500
600
700
800
900
1000
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(a) Results using ObservedFuncRels
20 40 60 80 100 120 140 160 180 2000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gene pairs ordered by similarity
Key
wor
d re
cove
ry
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(b) Results using KeywordRecovery
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
200
400
600
800
1000
1200
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(c) Results using SimFuncLabels
Figure 7: Evaluation of normalization methods on Saldanha et al [SBB04]’s non-temporal data set
15
ation according to the (a) ObservedFuncRels, (b) KeywordRecovery and (c) SimFuncLabels evaluation
methodologies for four non-temporal expression data sets. The following general observations that can be
made from these results. Firstly, for almost all these data sets, almost all the normalizations are able to
extract more accurate functional relationships than those extracted from the raw unnormalization version
of the data sets. This indicates that normalization, even using simple methods, is able to enhance the
functional content of most non-temporal gene expression data sets.
Examining the results more closely, we observe from Figure 4 that the Dsigmoid method performs well
for the corresponding data sets. For instance, when applied to the columns of the data matrix of Gerber
et al [GHB04]’s data set, it is able to outperform substantially almost all other methods in Figures 4(b)
and 4(c), and is close to the top performer in Figure 4(a). In Figures 5(a–c), the Dsigmoid method is
among the top performers.
In another set of results, Figures 6 and 7 show that for some non-temporal expression data sets,
the Unitnorm normalization produces the best results. This observation is supported most strongly by
Figures 6(c) and Figure 7(c), which show the performance of various normalization methods on Iyer
et al [IHK+99]’s and Saldanha et al [SBB04]’s data sets, according to the SimFuncLabels evaluation
methodology. It can also be noticed that for these data sets, Dsigmoid also produces good results, as
shown by Figures 6(a) and 7(a), thus indicating its utility for normalizing various types of non-temporal
expression data sets. We believe that the better performance of Unitnorm as compared to Dsigmoid for
these data sets is because of the relatively smaller fraction of extreme values in their columns, due to
which their norms are not affected adversely.
Last but not the least, we observe that the Quantile, Znorm and Sigmoid normalization methods
also generally produce functionally richer matrices than the raw gene expression data. Thus, although
they do not produce significantly better performance for the variety of non-temporal data sets consid-
ered, it is possible that they are able to outperform the other methods for data sets that have different
characteristics than the ones considered. In summary, the normalization of non-temporal gene expression
data is a desirable pre-processing step that enables the extraction of more accurate information about
protein function than that obtained from the raw data set that may have been processed only to remove
experimental biases and errors.
4.2 Evaluation of normalization methods for temporal gene expression data
We now present the results of our evaluation of normalization methods for temporal expression data sets,
which are applied using the algorithm in Section 2.3. We tried all combinations of row transformation
methods (no transformation (Raw), MA, Diff and Ztrans) and column normalization methods (no nor-
malization Raw, Unitnorm, Znorm, Sigmoid, Dsigmoid and Quantile). However, to simplify presentation,
we show results only for the best column normalization for each row transformation method. The best
methods are identified using the area under the curve of the plots produced by the respective evaluation
methodology.
Figures 8 and 9 show the results produced by the (a) ObservedFuncRels and (b) KeywordRecovery
and (c) SimFuncLabels evaluation methodologies on Zhu et al [ZSV+00]’s and Shapira et al [SSB04]’s
data sets respectively. The following observations can be made from these results:
1. In all the plots, there is at least one transformed and/or normalized version of the raw data set
16
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
100
200
300
400
500
600
700
800
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawMA_ZnormDiff_UnitnormZtrans_DsigmoidRaw_DsigmoidRandom
(a) Results using ObservedFuncRels
0 200 400 600 800 1000 1200 1400 1600 1800 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Gene pairs ordered by similarity
Key
wor
d re
cove
ry
RawMA_QuantileDiff_RawZtrans_RawRaw_SigmoidRandom
(b) Results using KeywordRecovery
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
1000
2000
3000
4000
5000
6000
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawMA_ZnormDiff_RawZtrans_RawRaw_SigmoidRandom
(c) Results using SimFuncLabels
Figure 8: Evaluation of normalization methods on Zhu et al [ZSV+00]’s temporal data set
17
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
100
200
300
400
500
600
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawMA_RawDiff_UnitnormZtrans_UnitnormRaw_UnitnormRandom
(a) Results using ObservedFuncRels
0 200 400 600 800 1000 1200 1400 1600 1800 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Gene pairs ordered by similarity
Key
wor
d re
cove
ry
RawMA_SigmoidDiff_UnitnormZtrans_UnitnormRaw_SigmoidRandom
(b) Results using KeywordRecovery
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
1000
2000
3000
4000
5000
6000
7000
8000
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawMA_SigmoidDiff_UnitnormZtrans_UnitnormRaw_SigmoidRandom
(c) Results using SimFuncLabels
Figure 9: Evaluation of normalization methods on Shapira et al [SSB04]’s temporal data set
18
that produces better results than the raw data set. This shows that normalization can be useful
for enhancing the functional content of temporal data sets.
2. For all the evaluations on both the data sets, some combination of a transformation before nor-
malization produces significantly better results than just applying a normalization method to the
columns of the data matrix. For instance, the Ztrans method (followed by Dsigmoid normalization)
produces the best results in Figure 8(a), while the Diff method (followed by Unitnorm normaliza-
tion) performs the best in Figure 9(a).
3. Almost all the results show that the Ztrans method of time series transformation produces the
best results among all transformation methods, such as in Figures 9(b) and 9(c). Interestingly, the
commonly used Diff, and MA transformations produce worse results than the raw data set in most
of the evaluations.
4. Finally, Figures 8(b) and 8(c) show that the best performing version of the data set is that produced
by applying the Ztrans method to the original matrix, and no normalization subsequently. These
plots show this method produces a small gain over the raw data matrix, which produced equiv-
alent results as those produced by applying a simple z − score-based transformation to an entire
gene expression profile, since correlation is a scale- and shift-invariant measure. These small but
consistent gains, which are also observed for other data sets (results omitted for brevity), indicate
that transforming each constituent time series in a data set separately, and combining the resultant
series, is indeed important for getting more accurate functional information.
In summary, transformation and normalization of temporal gene expression data does indeed enhance
the functional content of temporal gene expression data sets. However, the fact that only one or a few
more simple transformation and normalization methods are able to outperform the raw data set, and
the lack of a consensus on the best normalization method to be used after the Ztrans transformation,
indicates that the simple methods adopted in this study, though useful, may not be able to extract all
possible information from the temporal data sets. Hence, it may be fruitful to use more sophisticated
techniques such as normalized B-splines [BJGG+03] for such analyses.
4.3 Difference between non-temporal and temporal expression data
In the results presented earlier, we handled non-temporal and temporal gene expression data sets dif-
ferently – on non-temporal data, we performed column normalization only, while on temporal data, we
performed row transformation before column normalization. The hypothesis underlying this distinction
is that the set of experiments in a time series expression measurement are related to each other. As
noted in Section 4.2, the best results for temporal data sets are always obtained when a transformation is
applied to the time series (rows) before any of the normalization methods is applied to the experiments
(columns). Furthermore, column normalization methods, used by themselves (i.e., without row transfor-
mation), generally produce comparable or worse results than the raw expression data set. This shows
that using a row transformation is useful for temporal expression data sets, and skipping this step, as in
non-temporal data sets, may lead to loss of information.
19
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
500
1000
1500
2000
2500
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawZtrans_QuantileDsigmoid
(a) Gerber et al [GHB04]’s non-temporal data set normalizedboth as a non-temporal and a temporal data set
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
200
400
600
800
1000
1200
Num
ber
of fu
nctio
nal r
elat
ions
hips
rec
over
ed
RawZtrans_UnitnormUnitnorm
(b) Saldanha et al [SBB04]’s non-temporal data set normalizedboth as a non-temporal and a temporal data set
Figure 10: Comparison of the best normalizations for non-temporal and temporal expression data sets
Now, in order to show that row transformation is not useful for non-temporal expression data sets,
we analyzed the effect of using a row transformation followed by column normalization of a non-temporal
data set, and compared it to the best column normalization method for it. In accordance with the
definitions of the transformation methods used, we treated the entire expression profile of a gene as a
time series and the individual measurements as consecutive time points.
Figure 10 illustrates the results of this analysis for two data sets, namely those of Gerber et al [GHB04]
and Saldanha et al [SBB04], performed using the SimFuncLabels evaluation methodology. For brevity, we
only show the results of the best normalization and the best transformation-normalization combination
for these data sets. Both the plots show that for these non-temporal data sets, using only a normalization
method produces significantly better results than those produced by applying a row transformation before
20
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
50
100
150
200
250
300
350
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawUnitnormZnormQuantileSigmoidDsigmoidRandom
(a) Comparison of normalization methods on a random half ofGerber et al [GHB04]’s data set
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 104
0
50
100
150
200
250
300
350
Gene pairs ordered by similarity
Num
ber
of fu
nctio
nal r
elat
ions
hips
unc
over
ed
RawMA_SigmoidDiff_UnitnormZtrans_UnitnormRaw_UnitnormRandom
(b) Comparison of transformation and normalization methodson a random half of Shapira et al [SSB04]’s data set
Figure 11: Illustration of learning an appropriate normalization method for a gene expression data set ina training-testing setting
column normalization, particularly in Figure 10(a).
4.4 Learning the best normalization
The previous sections showed that while normalization is a useful pre-processing operation for gene
expression data sets, the most appropriate normalization method may vary for different data sets. This
raises the important question of how does one pre-determine the best normalization method for a data
set that is to be analyzed. A possible approach for this is to use data for which the ground truth is
currently known to determine the best normalization method, and use it for the rest of the data. To
test this hypothesis, we applied the ObservedFuncRels evaluation methodology on a subset of the data
21
set, and tested the consistency of these results with those obtained from the entire data set. We used
the ObservedFuncRels methodology, since the ground truth here consists of pre-specified protein-protein
interactions, and these do not depend on the data set for which normalizations are being evaluated.
Figures 11(a) and 11(b) show the results of evaluating different normalizations for random halves of
Gerber et al [GHB04]’s non-temporal and Shapira et al [SSB04]’s temporal data sets. These plots show
that even using only half of the data set, the relative ordering of the performance of different normalization
methods is almost identical to that obtained from the entire data sets, shown in Figures 4(a) and 9(a)
respectively. Similar results were obtained for other random trials on these and other data sets. This
indicates that a good normalization method can be accurately learned from a subset of the data.
5 Conclusions and Future Work
In this paper, we reported an extensive evaluation of normalization and transformation methods for
gene expression data, using three novel methodologies for evaluating their effectiveness at the task of
discovering functional information from these data sets. Following are the main results that were derived
from this evaluation study:
1. The performance of different normalization (and transformation) schemes may vary significantly
over data sets and types of functional information being predicted.
2. For non-temporal data, most of the commonly used normalizations improve the performance, but
some improve the performance a lot more than others. In particular, the double sigmoid function-
based Dsigmoid method performs significantly better than others for several combinations of data
sets and functional information. This is interesting, as this normalization has not been used by the
bioinformatics community for gene expression analysis.
3. While some transformations (followed by normalizations), such as Ztrans, do improve the prediction
performance for temporal data, some commonly used transformations (followed by any normaliza-
tions) perform worse than raw data. This indicates that great care needs to be taken in the selection
of the right transformation - particularly for temporal data.
4. For temporal data where the expression profile of a gene is constructed by combining several time
series expression measurements, it is advantageous to transform each constituent time series sepa-
rately than to transform the entire collection of series combined.
5. The transformations/normalization combinations that work best for temporal data tend to be
different than those for non-temporal data. For temporal data, it is often useful to first perform
a row transformation and then a column normalization, while for non-temporal data, it is best to
just perform a normalization. More generally, this indicates that temporal and non-temporal data
may need different types of analyses.
6. The best transformations and normalizations for a data set-functional evidence combination can
be learned from a training set for which the ground truth is known. In our study, this hypothesis
is validated by illustrating that the relative performance of normalizations on half of the available
labeled data is similar to their relative performance on the entire data set.
22
Evaluation of different normalziation methods was done in the context of protein function prediction.
Similar evaluation can be done for other contexts, such as the identification of genes involved in cancer,
in order to obtain a better understanding of the characteristics of data sets that make them suitable for
specific normalization and transformation methods.
Acknowledgement
This work was supported by NSF grants #CRI-0551551, #IIS-0308264, and #ITR-0325949. Access to
computing facilities was provided by the Minnesota Supercomputing Institute.
We thank Chad Myers, Fumiaki Katagiri and Judith Berman for their insightful comments on the
paper. We also thank Golan Yona for making their protein-protein interaction data available to us.
References
[BBA+03] Brigitte Boeckmann, Amos Bairoch, Rolf Apweiler, Marie-Claude Blatter, Anne Estreicher,
Elisabeth Gasteiger, Maria J. Martin, Karine Michoud, Claire O’Donovan, Isabelle Phan,
Sandrine Pilbout, and Michel Schneider. The SWISS-PROT protein knowledgebase and its
supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365–370, 2003.
[BDSY99] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. J.
Computational Biology, 6(3-4):281–297, 1999.
[BGL+00] Michael P.S. Brown, William Noble Grundy, David Lin, Nello Cristiniani, Charles Walsh
Sugnet, Terrence S. Furey, Manuel Ares, Jr., and David Haussler. Knowledge based analysis
of microarray gene expression data by using support vector machines. Proc Natl Acad Sci,
97(1):262–267, 2000.
[BHWK04] Rajarajeswari Balasubramaniyan, Eyke Hullermeier, Nils Weskamp, and Jorg Kamper. Clus-
tering of gene expression data using a local shape-based similarity measure. Bioinformatics,
21(7):1069–1077, 2004.
[BIB03] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the anal-
ysis of large-scale gene expression data. Phys Rev E, 67:031902, 2003.
[BJ04] Z. Bar-Joseph. Analyzing time series gene expression data. Bioinformatics, 20(16):2493–
2503, 2004.
[BJGG+03] Ziv Bar-Joseph, Georg Gerber, David K. Gifford, Tommi S. Jaakkola, and Itamar Simon.
Continuous representations of time series gene expression data. J Comput Biol., 10(3–4):341–
356, 2003.
[Bol01] Benjamin M. Bolstad. Probe level quantile normalization of high density oligonucleotide
array data. Unpublished. Available at http://bmbolstad.com/stuff/qnorm.pdf, 2001.
23
[Bol04] Benjamin M. Bolstad. Low Level Analysis of High-density Oligonucleotide Array Data: Back-
ground, Normalization and Summarization. PhD thesis, University of California, Berkeley,
2004.
[CCV+03] Cheadle, Chris, Vawter, Marquis P., Freed, William J., Becker, and Kevin G. Analysis of
Microarray Data Using Z Score Transformation. J Mol Diagn, 5(2):73–81, 2003.
[CHZP02] Carlo Colantuoni, George Henry, Scott Zeger, and Jonathan Pevsner. Snomad (standard-
ization and normalization of microarray data): web-accessible gene expression data analysis.
Bioinformatics, 18(11):1540–1541, 2002.
[DHS00] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition).
Wiley-Interscience, 2000.
[DP05] D’haeseleer and Patrik. How does gene expression clustering work? Nat Biotech, 23:1499–
1501, 2005.
[ESBB98] Michael B. Eisen, Paul T. Spellman, Patrick O. Browndagger, and David Botstein. Clus-
ter analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U.S.A.,
95(25):14863–14868, 1998.
[GHB04] Andre P. Gerber, Daniel Herschlag, and Patrick O. Brown. Extensive association of func-
tionally and cytotopically related mrnas with puf family rna-binding proteins in yeast. PLoS
Biology, 2(3):E79, 2004.
[GSK+00] Audrey P. Gasch, Paul T. Spellman, Camilla M. Kao, Orna Carmel-Harel, Michael B. Eisen,
Gisela Storz, David Botstein, and Patrick O. Brown. Genomic Expression Programs in the
Response of Yeast Cells to Environmental Changes. Mol. Biol. Cell, 11(12):4241–4257, 2000.
[H+00] Timothy R. Hughes et al. Functional discovery via a compendium of expression profiles.
Cell, 102(1):109–126, 2000.
[Ham94] James Douglas Hamilton. Time Series Analysis. Princeton University Press, 1994.
[HKSL01] T.R. Hvidsten, J. Komorowski, A.K. Sandvik, and A. Laegreid. Predicting gene function
from gene expressions and ontologies. In Proc. Pacific Symposium on Biocomputing (PSB),
pages 299–310, 2001.
[HP00] Liisa Holm and Jong Park. DaliLite workbench for protein structure comparison. Bioinfor-
matics, 16(6):566–567, 2000.
[IHK+99] Vishwanath R. Iyer, Christine Horak, Laurent Kuras, Peter Kosa, Charles Scafe, David
Botstein, Kevin Struhl, Michael Snyder, and Patrick O. Brown. Genome-wide maps of
DNA-protein interactions using a yeast ORF and intergenic microarray. Nature Genetics,
23:53, 1999.
[JNR05] Anil K. Jain, Karthik Nandakumar, and Arun Ross. Score normalization in multimodal
biometric systems. Pattern Recognition, 38(12):2270–2285, 2005.
24
[Kar02] George Karypis. CLUTO - a clustering toolkit. Technical Report #02-017, University of
Minnesota, Department of Computer Science, 2002.
[LHM+03] Astrid Laegreid, Torgeir R. Hvidsten, Herman Midelfart, Jan Komorowski, and Arne K.
Sandvik. Predicting gene ontology biological process from temporal gene expression patterns.
Genome Research, 13(5):965–979, 2003.
[LRTS04] Jorge Lepre, John Jeremy Rice, Yuhai Tu, and Gustavo Stolovitzky. Genes@work: an efficient
algorithm for pattern discovery and multivariate feature selection in gene expression data.
Bioinformatics, 20(7):1033–1044, 2004.
[MBH+06] Chad L Myers, Daniel R Barrett, Matthew A Hibbs, Curtis Huttenhower, and Olga G
Troyanskaya. Finding function: evaluation methods for functional genomic data. BMC
Genomics, 7:187, 2006.
[MDJ+02] Alvaro Mateos, Joaquin Dopazo, Ronald Jansen, Yuhai Tu, Mark Gerstein, and Gustavo
Stolovitzky. Systematic learning of gene functional classes from dna array expression data
by using multilayer perceptrons. Genome Research, 12(11):1703–1715, 2002.
[MPT+99] Edward M. Marcotte, Matteo Pellegrini, Michael J. Thompson, Todd O. Yeates, and David
Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature,
402(6757):83–86, 1999.
[NAWC02] Danh V. Nguyen, A. Bulak Arpat, Naisyin Wang, and Raymond J. Carroll. DNA microarray
experiments: biological and technological aspects. Biometrics, 58(4):701–717, 2002.
[PKS06] Gaurav Pandey, Vipin Kumar, and Michael Steinbach. Computational approaches for protein
function prediction: A survey. Technical Report 06-028, Department of Computer Science
and University of Minnesota, October 2006.
[PTS+03] Christopher Potter, Pang-Ning Tan, Michael Steinbach, Steven Klooster, Vipin Kumar,
Ranga Myeni, and Vanessa Genovese. Major disturbance events in terrestrial ecosystems
detected using global satellite data sets. Global Change Biology, 9(7):1005–1021, 2003.
[PYK+03] Taesung Park, Sung-Gon Yi, Sung-Hyun Kang, SeungYeoun Lee, Yong-Sung Lee, and
Richard Simon. Evaluation of normalization methods for microarray data. BMC Bioin-
formatics, 4:33, 2003.
[Qua02] John Quackenbush. Microarray data normalization and transformation. Nature Genetics,
32:496–501, 2002.
[R+04] Andreas Ruepp et al. The FunCat, a functional annotation scheme for systematic classifica-
tion of proteins from whole genomes. Nucleic Acids Research, 32(18):5539–5545, 2004.
[S+01] Gavin Sherlock et al. The Stanford Microarray Database. Nucleic Acids Research, 29(1):152–
155, 2001.
[SBB04] Alok J. Saldanha, Matthew J. Brauer, and David Botstein. Nutritional Homeostasis in Batch
and Steady-State Culture of Yeast. Mol. Biol. Cell, 15(9):4089–4104, 2004.
25
[SBM96] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization.
In Proc. 19th ACM SIGIR Conference, pages 21–29, 1996.
[Slo02] Donna K. Slonim. From patterns to pathways: gene expression data analysis comes of age.
Nature Genetics, 32(Suppl):502–508, 2002.
[SMP+03] Michael Strong, Parag Mallick, Matteo Pellegrini, Michael J. Thompson, and David Eisen-
berg. Inference of protein function and protein linkages in Mycobacterium tuberculosis based
on prokaryotic genome organization: a combined computational approach. Genome Biology,
4(9):R59, 2003.
[SS03] Gordon K. Smyth and Terry Speed. Normalization of cDNA microarray data. Methods,
31(4):265–273, 2003.
[SSB04] Michael Shapira, Eran Segal, and David Botstein. Disruption of Yeast Forkhead-associated
Cell Cycle Transcription by Oxidative Stress. Mol. Biol. Cell, 15(12):5659–5669, 2004.
[STG+01] Eran Segal, Benjamin Taskar, Audrey Gasch, Nir Friedman, and Daphne Koller. Rich
probabilistic models for gene expression. In ISMB (Supplement of Bioinformatics), pages
243–252, 2001.
[SYK03] Eran Segal, R. Yelensky, and Daphne Koller. Genome-wide discovery of transcriptional
modules from dna sequence and gene expression. In ISMB (Supplement of Bioinformatics),
pages 273–282, 2003.
[TCS+01] Olga G. Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert
Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation methods for dna
microarrays. Bioinformatics, 17(6):520–525, 2001.
[TPS+99] Tamayo, Pablo, Slonim, Donna, Mesirov, Jill, Zhu, Qing, Kitareewan, Sutisak, Dmitrovsky,
Ethan, Lander, Eric S., Golub, and Todd R. Interpreting patterns of gene expression with
self-organizing maps: Methods and application to hematopoietic differentiation. PNAS,
96(6):2907–2912, 1999.
[WJJ+02] Christopher Workman, Lars Jensen, Hanne Jarmer, Randy Berka, Laurent Gautier, Henrik
Nielser, Hans-Henrik Saxild, Claus Nielsen, Soren Brunak, and Steen Knudsen. A new non-
linear normalization method for reducing variability in dna microarray experiments. Genome
Biology, 3(9):research0048.1–research0048.16, 2002.
[WXM+05] Wei Wu, Eric P. Xing, Connie Myers, I. Saira Mian, and Mina J. Bissell. Evaluation of
normalization methods for cdna microarray data by k-nn classification. BMC Bioinformatics,
6:191, 2005.
[YDL+02] Yee Hwa Yang, Sandrine Dudoit, Percy Luu, David M. Lin, Vivian Peng, John Ngai, and
Terence P. Speed. Normalization for cDNA microarray data: a robust composite method
addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4):e15,
2002.
26
[YDRL06] Golan Yona, William Dirks, Shafquat Rahman, and David M. Lin. Effective similarity
measures for expression profiles. Bioinformatics, 22(13):1616–1622, 2006.
[ZSK+05] Pusheng Zhang, Michael Steinbach, Vipin Kumar, Shashi Shekhar, Pang-Ning Tan, Steve
Klooster, , and Chris Potter. Discovery of patterns of Earth science data using data mining.
In Jozef Zurada and Medo Kantardzic, editors, Next Generation of Data Mining Applications.
Wiley-IEEE Press, 2005.
[ZSV+00] G. Zhu, P. T. Spellman, T. Volpe, P. O. Brown, D. Botstein, T. N. Davis, and B. Futcher.
Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature, 406:90–
94, 2000.
27