technical report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · technical report...

29
Systematic Evaluation of Normalization Methods for Gene Expression Data Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA TR 07-015 Systematic Evaluation of Normalization Methods for Gene Expression Data Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach, and Vipin Kumar June 08, 2007

Upload: others

Post on 18-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

Systematic Evaluation of Normalization Methods for Gene Expression Data

Technical Report

Department of Computer Science

and Engineering

University of Minnesota

4-192 EECS Building

200 Union Street SE

Minneapolis, MN 55455-0159 USA

TR 07-015

Systematic Evaluation of Normalization Methods for Gene Expression

Data

Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael

Steinbach, and Vipin Kumar

June 08, 2007

Page 2: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building
Page 3: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

Systematic Evaluation of Normalization Methods for Gene

Expression Data

Gaurav Pandey, Lakshmi Naarayanan Ramakrishnan, Michael Steinbach and Vipin Kumar

Department of Computer Science and Engineering

University of Minnesota, Twin Cities, USA

Contact: [email protected]

Abstract

Motivation: Even after an experimentally prepared gene expression data set has been pre-processed

”to adjust for effects which arise from variationin the microarray technology”, there may be incon-

sistencies between the scales of measurements in different conditions. A variety of normalization and

transformation methods have been used for addressing these scale differences in different studies on

the analysis gene expression data sets. However, a quantitative estimation of their relative perfor-

mance has been lacking in this domain.

Results: In this paper, we report an extensive evaluation of normalization and transformation meth-

ods for their effectiveness, with respect to the important application of protein function prediction.

We consider several such commonly used methods for gene expression data, such as z-score and

quantile normalization, and diff transformation, and two novel normalization methods, sigmoid and

double sigmoid, that have not been used previously in this domain. Using a variety of evaluation

methodologies based on the prediction of protein function from gene expression data, we conduct an

extensive evaluation of these methods, and show that their performance can vary significantly across

different data sets. We also provide evidence that the two types of gene expression data, namely

temporal and non-temporal, need different types of analyses in order to use them effectively for un-

covering functional information.

Availability: All source code is available on request from the authors.

1 Introduction

Gene expression experiments are a method to quantitatively measure the transcription phase of protein

synthesis from several genes under a given condition [NAWC02]. Due to the ability of these experiments

to provide an insight into the simulatenous activity of thousand of genes, the data produced by these

experiments, also known as microarray data, is used for a variety of biological studies [Slo02]. For instance,

the similarity of the expression values of a pair of genes over a set of related conditions, referred to as

their expression profiles, indicates that they are expressed or repressed under the same set of conditions

approximately, and thus are expected to be functionally related.

1

Page 4: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

Conditions

Gen

es

10 20 30 40 50 60 70

200

400

600

800

1000

1200

1400

1600

1800

2000 −5

0

5

Figure 1: Visualization of Eisen et al [ESBB98]’s data set shows a set of conditions on different scale

A necessary step for the effective analysis of gene expression data is its pre-processing [Qua02], where

the data is processed to reduce the ill-effects of various data corruption phenomena, such as noise and

missing values as ”to adjust for effects which arise from variation in the microarray technology” [SS03]

used to generate the gene expression data sets. Indeed, several such pre-processing methods have been

devised for gene expression data, that are focused towards removing the biases and variances coming

from different components of the microarray experimental procedure, such as LOWESS [YDL+02] and

SNOMAD [CHZP02]. Each of these methods are appropriate for different scenarios, and some studies

have evaluated them under several such scenarios [PYK+03, WXM+05].

Even after an experimentally prepared gene expression data set has been pre-processed using the

above methods, there may be inconsistencies between the scales of measurements in different conditions.

For example, consider [ESBB98]’s data set, which is composed of eight different temporal expression

measurements, and is one of the widely used expression data sets for S. cerevisiae (budding yeast). As

can be seen in Figure 1, in this data set, conditions 48-58 have an entirely different scale of values as

compared to the rest. This inconsistency in scale is expected to affect the analysis of this data, e.g., the

correlation between two gene expression profiles, and thus needs to be addressed via further processing. A

variety of normalization and transformation methods have been used for this task in different studies on

the analysis gene expression data sets [TPS+99, CCV+03, BIB03, Bol01, LHM+03, BHWK04]. However,

a quantitative estimation of their relative performance has been lacking in this domain. In this paper, we

evaluate several such methods for their effectiveness with respect to the important application of protein

function prediction, for which gene expression data has been widely used [ESBB98, BDSY99, BGL+00,

MDJ+02, PKS06].

2

Page 5: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

1.1 Contributions

This paper makes the following contributions:

1. We consider several commonly used normalization and transformation methods for gene expres-

sion data, such as z-score and quantile normalization, and diff transformation, and two novel

normalization methods, sigmoid and double sigmoid, that have not been used in this domain.

2. Using a variety of evaluation methodologies based on the prediction of protein function from gene

expression data, we conduct an extensive evaluation of these methods, and show that their perfor-

mance can vary significantly across different data sets.

3. We provide evidence that the two types of gene expression data, namely temporal and non-temporal,

need different types of analyses in order to use them effectively for uncovering functional information.

2 Materials and Methods

This section details the gene expression data sets used in this study, the normalization and transformation

methods applied to them, and the methodologies used to evaluate these methods.

2.1 Data Sets

Gene expression data can be represented as a data matrix where the rows represent genes and the columns

represent either individual conditions or the measurement under a particular condition at different times.

For the effective analysis and normalization of gene expression data sets, it is essential to distinguish

between these two types of data sets—temporal and non-temporal—which we now define more fully.

1. Temporal: The experiments in these data sets measure the expression behavior of genes that have

been exposed to a certain condition at different instances of time. Thus, there is a well defined

relationship between consecutive columns in these data sets.

2. Non-temporal: These expression data sets are prepared by combining results from experiments

that do not have a temporal relationship with each other. Although they may be related because

they provide a comprehensive view of the phenomenon they are designed to study, they can also be

analyzed independently.

Even though some studies have lately started addressing the temporal aspect of gene expression [BJ04,

HKSL01], most microarray data analysis studies ignore this distinction, and apply the same normalization

methods to all data sets. Since there is a fundamental difference between the nature of these data, we

believe that different analysis techniques need to be applied to them, and thus distinguish between them

in our study. We also provide evidence supporting this observation in Section 4.

In accordance with the above distinction, we selected several data sets of the two types for S. cerevisiae

(budding Yeast), which are summarized in Tables 1 and 2. We chose this organism since substantial

information is available about the functions of its genes. Most of these data sets were obtained from

the webpage of Yona et al [YDRL06]’s study 1 and SMD [S+01]. The KNNImpute program [TCS+01]

1http://biozon.org/tools/expression/instructions.html

3

Page 6: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

Reference #Genes #Conditions

Gerber et al [GHB04] 6303 10Hughes et al [H+00] 6316 300Iyer et al [IHK+99] 6251 12

Saldanha et al [SBB04] 6314 24

Table 1: Summary of non-temporal gene expression data sets used in this study

Reference #Genes #Time Points #Time Series

Zhu et al [ZSV+00] 5714 26 2Shapira et al [SSB04] 4771 70 4

Table 2: Summary of temporal gene expression data sets used in this study

was used to impute missing values. Also, it should be noted that all the temporal data sets that we

used consisted of several expression time series experiments, as detailed by the last column of Table 2.

Section 2.3 discusses how each of these time series is processed, and the results combined to generate a

processed version of the complete temporal data set.

2.2 Normalization Methods for Gene Expression Data

In what follows, we assume that all the data sets have been pre-processed using standard methods

to handle experimental errors, and will focus on normalization approaches that handle the problem of

differences in scale. This problem may manifest itself in a gene expression data set in forms like the

following:

• In the case of data sets constructed by concatenating several experiments, some components may

be at a different scale than the others. For instance, as mentioned earlier, conditions 48-58 of Eisen

et al [ESBB98]’s data set, shown in Figure 1 have a different scale than the others.

• There might be extreme or outlying values in the data set, which may affect the data analysis

adversely.

• Some genes may express more easily than others, thus creating a bias in the expression values. For

instance, Myers et al [MBH+06] showed that a significant fraction of the set of most co-expressed

genes in their data set belonged to the ribosome pathway, thus suggesting that genes belonging to

this pathway may express more easily than others.

Following is a brief description of normalization methods that attempt to reduce the effect of these scale

issues on the analysis of gene expression data. All these methods are applied to the columns of the gene

expression matrix, except for Quantile normalization, which does this implicitly.

2.2.1 Unitnorm Normalization

In linear algebra, a common way of bringing a set of vectors to the same scale is to transform them to

unit vectors by dividing each component of a vector X by its length as indicated in Equation 1.

4

Page 7: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

Unitnorm(X) =X

||X||2(1)

This normalization has been used in various domains, such as text mining, where this method is used

to reduce the effect of varying document lengths [SBM96].

2.2.2 Z-score-based Normalization

In statistics, a common method of making different data vectors comparable is by shifting the values

in a vector by the mean of their values, and scaling them by their standard deviation. This converts

each value into a deviation from the mean. If the mean of a data vector is 0, then this normalization

is the same as the previous normalization, except for a constant. The mathematical formula for this

normalization, which we refer to as Znorm, is shown in Equation 2.

Znorm(X) =X − X̄

σX

(2)

This approach, which is known as standardization in statistics, also has many biological applications,

e.g., the evaluation of protein structure alignment scores [HP00]. Indeed, z-scores are also commonly

used for the normalization of gene expression data [TPS+99, CCV+03]. Also, some other normalizations,

such as those used by Bergmann et al [BIB03], are mathematically equivalent to Znorm.

2.2.3 Quantile Normalization

This is another popular normalization method for gene expression data [Bol01, Bol04]. It attempts to

transform data from two different distributions to a common distribution by making the quantiles of the

distributions equal. This has the potential for adjusting for differences beyond those of location and scale.

Bolstad [Bol01] has suggested the following algorithm for using this method for normalizing a given set

of vectors, assuming that there are no missing values.

Quantile Normalization Algorithm

1. Given n vectors of length p, form X of dimension p × n, where each vector is a column in X .

2. Sort each column of X to give Xsort.

3. Take the means across rows of Xsort and assign this mean to each element in the row to get a

quantile equalized X′

sort.

4. Get Xnormalized by rearranging each column of X′

sort to have the same ordering as the original X .

Various other types of quantile-based data normalization have been proposed [WJJ+02]. For ease

of implementation, we used the quantilenorm function in the MATLAB bioinformatics toolbox2, which

implements Bolstad’s algorithm detailed above.

2http://www.mathworks.com/products/bioinfo/

5

Page 8: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

−4 −2 0 2 4 6 80

200

400

600

800

1000

1200

Expression values

Fre

quen

cy

Figure 2: Normal-like distributions of expression values in column 5 of Gerber et al [GHB04]’s data set

2.2.4 Sigmoid Family of Normalizations

The final normalization considered, takes a distribution-oriented approach, which has been reported by

several authors as being important for effective processing of the set of values obtained from a gene

expression experiment [LRTS04, YDRL06]. For instance, Figure 2 shows the distribution of expression

values in the 5th experiment (column) of Gerber et al [GHB04]’s data set (matrices) respectively. In

addition to the very normal-like distribution shown by this histograms, it can be observed that there are

several outlying values in this vector, such as those lying outside of the range [−2, 2]. It is important

to consider both the underlying distribution and the presence of outliers in order to develop an effective

normalization method.

Here, we first introduce a normalization method that take into account the consideration that ex-

treme or outlying values should not distort the data analysis significantly. This family is based on the

sigmoid function, defined in Equation 3, which is commonly used to introduce non-linearity in neural

networks [DHS00].

Sigmoid(x) =1 − e−x

1 + e−x(3)

An interesting property of this function is that the extreme values in the input are bounded by ±1. This

implicitly eliminates the effects of extreme values in the data, thus making it suitable for normalizing gene

expression data. However, the Sigmoid(x) function has the significant weakness that it takes only the

value of x, and not the background distribution of x, into account when determining the final normalized

value. Another important point to consider is that in the case of gene expression data, a value of 0 in the

log-ratio matrix of the original red and green intensities is defined as the expression value of a gene that is

6

Page 9: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

−10 −5 0 5 10−1

−0.5

0

0.5

1

x

dsig

moi

d(x)

Figure 3: Double sigmoid function

completely neutral under the given condition. However, due to noise factors, this value may be distorted

by a small amount. Thus, it may be useful for some experiments that values in a small neighborhood of

zero be treated effectively as zero. The standard Sigmoid function doesn’t do this.

These considerations indicate that the Sigmoid function should be modified so that it takes the

distribution of values into account, and also does a local smoothing around the center of the distribution.

The second factor can be accounted for by breaking the sigmoid function into two ranges [−1, 0) and

(0, 1] and defining separate modifications of the Sigmoid function on them, thus converting the original

function into the double sigmoid (Dsigmoid) function. Though this function has been mathematically

formulated and used in other studies [JNR05], we chose the formulation of Equation 43, since it enables

us to incorporate the first factor also:

Dsigmoid(x) = sign(x − d)(1 − exp(−(x − d

s)2))) (4)

Here d and s are the centering and the steepness factors of the function respectively. A sample output of

the Dsigmoid function is shown in Figure 3. It can be observed from this plot and Equation 4 that our

formulation of the double sigmoid function is very close to the probability distribution function of the

normal distribution N(X̄, σ2X

), if d = X̄ and s = σ2X

for a vector X. This observation further supports the

use of Dsigmoid for normalizing gene expression data, since there is evidence that the distribution of gene

expression values is approximately normal [STG+01, SYK03]. This perspective is further strengthened

by Figure 2, which illustrates that the distribution of expression values under a certain condition is close

to normal. Thus, due to these considerations, we used the double sigmoid normalization as one of the

normalization methods in this study. The method was implemented by transforming each value Xi in a

vector X to Dsigmoid(Xi) using Equation 4.

3http://en.wikipedia.org/wiki/Sigmoid function

7

Page 10: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

2.3 Transformation Methods for Temporal Expression Data

As can be seen from the descriptions above, most normalization methods normalize each vector indepen-

dent of every other. This assumption is valid for non-temporal expression data sets, since there is no

relationship between different columns of the data matrix. However, in temporal expression data sets, the

expression of genes is measured at consecutive time points in an experiment, and thus there are inherent

relationships among the different columns of a temporal expression data matrix. The interrelationships

between them need to be factored into the normalization process.

A common method for this task is to use the values at a set of consecutive time points to derive a new

time series that is better suited for the desired application. We refer to this process as a transformation,

and investigated the following methods for this task:

1. Smoothing by moving average: A popular method of analyzing time series data is by smoothing

the time series values in a sliding window of duration k by averaging them. This is known as the

moving average (MA) method [Ham94]. This method has not been used for gene expression data,

but is popular method in many domains for analyzing time series data, e.g., climate data [PTS+03].

The transformation from X1...n to X′

1...n−2, both being time series, is defined by Equation 5. This

method provides several advantages, such as smoothing, the reduction of the effect of outliers and

noise, and compensation for small phase lags between two time series.

X ′

i =1

k

t=i+k−1∑

t=i

Xt (5)

We used the commonly used value of k = 3 in our implementation.

2. Differences between consecutive points: Another method of taking the structure of time series

into account is to define new features whose values are the differences between the values at two

consecutive points in a time series. Doing this removes the effect of the magnitude of the time

series,but preserves the magnitude of the changes from one time period to another.

This method transforms the original time series vector X1...n into a new vector X′

1...(n−1) using the

simple formula X ′

i = Xi+1 − Xi, and thus takes only the trend of change between the time points

into account, and not the absolute values. This helps reduce the effect of offsets in the values that

may be due to experimental error or other factors, and has been used for the functional classification

of temporal gene expression data [HKSL01, LHM+03]. We refer to this method as the Diff method

in the subsequent discussion.

3. Z-score: In many instances, time series are best compared by considering only deviations from the

average. This can be accomplished by using the Z − score, which was given in Equation 2. Indeed,

this method has been shown to be useful in several time series applications, such as the mining

of Earth science data [ZSK+05], and has also been used for the transformation of temporal gene

expression data for subsequent analysis [TPS+99, BHWK04]. Thus, we used this method in our

analysis, and it is referred to as the Ztrans method in the following discussion.

Finally, it should be noted that the temporal expression data sets used in this study contained multiple

time series as detailed in Table 2. Since each of the above transformations is defined only for a single

8

Page 11: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

time series, each transformation is applied separately to each time series, and the final transformed

gene expression matrix is obtained by concatenating the transformed vector for each gene individually.

This process makes sense from a data analysis perspective since each of the time series may follow

different distributions of values and thus need to be analyzed separately, followed by a combination of the

results. Notably, this is in contrast to the approach adopted by most analyses of time series expression

data [SSB04, GSK+00], which normalize the entire expression profile of a gene as a whole.

In summary, the temporal gene expression data sets, say M , are normalized using the following

two-step algorithm:

1. Transform M using the above three transformations.

2. Apply the normalization methods discussed in Section 2.2 to the columns of each of the three

matrices produced above, as well as to the raw data matrix.

This algorithm produces 3×6 = 18 normalized matrices corresponding to each transformation-normalization

combination. In addition, we also included all normalized matrices produced by applying only the nor-

malization methods described in Section 2.2 to M .

2.4 Similarity Measure

An important component of our study is the measure used to estimate the proximity, i.e., similarity or

distance, between the expression profiles of two genes. For this purpose, we used Pearson’s correlation

coefficient, which has been commonly used for gene expression data [DP05]. In our experiments also,

correlation produced consistently better results than other measures such as cosine similarity and Eu-

clidean distance. We also considered the recently proposed mass-distance measure [YDRL06], which has

shown promising results for some expression data sets. However, due to the lack of a readily usable

implementation of this algorithm, we have left its use for future work.

3 Evaluation Methodology

Suppose an expression matrix M has been normalized using a method A (with or without the row

transformation step) to produce MA. In order to evaluate the method A in terms of their effectiveness

in magnifying the available functional information in M , we examined the pairwise links between genes

ranked by their expression similarity in MA to see if the most highly ranked links tend to connect

genes with similar function. The functional evidence is derived from three sources, namely Yona et

al [YDRL06]’s data set of relationships, SwissProt keywords [BBA+03] and the FunCat classification

scheme [R+04], each of which represents a different form of the definition of protein function. We chose

to perform our evaluation with this variety of sources of functional relationships since protein function

is not a precisely defined concept. This evaluation process is applied to each of the normalized versions

of M , and the results are compiled in order to judge the relative performances of different normalization

methods. The following subsections provide the details of the three types of functional information and

how they are used for the evaluation.

9

Page 12: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

3.1 Recovery of Observed Functional Relationships (ObservedFuncRels)

This knowledge consists of experimentally observed interactions between pairs of proteins in an organism.

We used a set of 41902 such pairwise interactions used by Yona et al [YDRL06] in their evaluation study

of the most appropriate similarity measure for gene expression profiles. This set is constructed using

four types of interactions, namely (i) physical protein-protein interactions, (ii) metabolic pathway co-

membership of enzymes, (iii) co-regulation by the same promoter, and (iv) co-membership in sequence

homology clusters.

For this set of functional relationships, the evaluation methodology used was also the same as that

adopted by Yona et al [YDRL06]. Here, the pairwise similarities are calculated among all expression

profiles in the given data set, and the protein pairs are sorted in descending order according to their

corresponding expression similarities. Now, starting from the most similar gene pair, the total number

of pairs which are known to be functionally related according to the above set, are cumulatively added.

These numbers can then be used to produce a plot of the number of true functional relationships recovered

versus the number of protein pairs analyzed in the order of decreasing similarity.

3.2 SwissProt Keyword Recovery (KeywordRecovery)

The next category of functional information we used was derived from keywords assigned to protein

entries in the SwissProt database [BBA+03]. Each protein in this database is assigned a set of keywords,

such as hydrolase, DNA binding and repressor, that indicate a form of functional information about it.

We considered only 179 of these keywords that had an annotation frequency of between 10 and 600 among

the proteins of S. cerevisiae. This selection eliminates the keywords that are over-abundant or very rare

and may distort a frequency-based analysis.

In this methodology, the well-known keyword recovery (KR) metric is used for evaluation. This is a

commonly used metric when the result of the prediction algorithm are pairwise functional links between

proteins [MPT+99, SMP+03]. Given a query protein P , and the set of proteins directly linked to it

Nbd(P ), the keyword recovery performance of an algorithm is computed using Equation 6.

KR =1

N

N∑

i=1

x∑

j=1

Fij

Fi

(6)

Here, Fi is the total frequency of all the keywords appearing in Nbd(P ), Fij is the frequency of the jth

keyword of P in Nbd(P ), x is the total number of keywords of P , and N is the total number of proteins in

the network. Thus, KR measures the recall of the original keyword annotations in the resulting protein

relationship graph.

The use of this method for evaluating normalization methods is straightforward. Once the gene

expression matrix M has been normalized using a method A, the k most similar gene pairs in terms of their

normalized expression profiles are treated as edges between two nodes, and a graph GA,k is constructed

by combining all the edges so found. The quality of this graph is evaluated using the KR metric defined

in Equation 6. Now, increasing the value of k from numedgesmin to numedgesmax produces a set of

keyword recovery values for method A, and this process is repeated for all the normalization methods

considered. Finally, the different methods are compared by plotting these sets of KR values together.

10

Page 13: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

3.3 Similarity of Functional Labels (SimFuncLabels)

The third type of functional relationships we used are derived from the annotations of yeast proteins

using the FunCat functional classification scheme [R+04]. Since FunCat is a hierarchical scheme, we used

the 81 classes at a depth of two from the root of the hierarchy.

Now, for the proteins annotated with these classes, an annotation vector is constructed for each

protein, containing 1 for the classes that the protein is annotated with, and 0 for the others. Once this

set of vectors has been constructed for all the proteins, it is partitioned using the CLUTO clustering

toolkit [Kar02]. Then, functional links can be derived from these clusters by simply treating each cluster

as a clique, and two proteins are considered functionally related if they are part of the same clique. In

order to avoid spurious relationships, we chose the minimum size of each cluster to be 20 in order for

it to be used for extracting functional relationships, since we observed that most of the small data sets

had a low internal similarity, and thus were not very tight. The remaining clusters had a high internal

similarity (usually over 0.8), which implies that a large majority of the protein in these clusters had an

identical set of labels, thus justifying their use for extracting functional relationships.

Finally, once the set of relationships has been derived, the evaluation methodology is completely

identical to that adopted in the ObservedFuncRels methodology, since the experimentally observed rela-

tionships there could be conveniently replaced by these functional label-based relationships.

4 Results

In this section, we discuss the application of the overall evaluation methodology to several gene expression

data sets. We present results for the two types of expression data, namely non-temporal and temporal

separately due to the conceptual difference between the two types discussed in Section 2.1.

We only show the portions of the complete plots that correspond to the most highly ranked gene

pairs in terms of their expression similarity, as they are the ones expected to include the most function

information. These plots in this section are best viewed in color. Also, in addition to the results of

the normalization methods, we also plot the results obtained from the unnormalized data set, and those

obtained by a random selection of gene pairs. For the ObservedFuncRels and SimFuncLabels evaluation

methods, the random selection results were approximated by the following line:

y = (total # functional links

# possible links for evaluation methodology) × x

On the other hand, results for random selection with the KeywordRecovery method are obtained by

randomly selecting k gene pairs for various values of k using a uniform distribution, and evaluating the

KR metric for each of the graphs so constructed.

4.1 Evaluation of normalization methods for non-temporal gene expression

data

For this evaluation, we applied the following normalizations listed in Section 2.2 to the columns of these

data sets: Unitnorm, Znorm, Sigmoid, Dsigmoid, and Quantile. Figures 4-7 show the results of evalu-

11

Page 14: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

100

200

300

400

500

600

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(a) Results using ObservedFuncRels

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gene pairs ordered by similarity

Key

wor

d re

cove

ry

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(b) Results using KeywordRecovery

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

500

1000

1500

2000

2500

3000

3500

4000

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(c) Results using SimFuncLabels

Figure 4: Evaluation of normalization methods on Gerber et al [GHB04]’s non-temporal data set

12

Page 15: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

500

1000

1500

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(a) Results using ObservedFuncRels

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Gene pairs ordered by similarity

Key

wor

d re

cove

ry

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(b) Results using KeywordRecovery

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

500

1000

1500

2000

2500

3000

3500

4000

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(c) Results using SimFuncLabels

Figure 5: Evaluation of normalization methods on Hughes et al [H+00]’s non-temporal data set

13

Page 16: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

50

100

150

200

250

300

350

400

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(a) Results using ObservedFuncRels

20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gene pairs ordered by similarity

Key

wor

d re

cove

ry

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(b) Results using KeywordRecovery

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

100

200

300

400

500

600

700

800

900

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(c) Results using SimFuncLabels

Figure 6: Evaluation of normalization methods on Iyer et al [IHK+99]’s non-temporal data set

14

Page 17: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

100

200

300

400

500

600

700

800

900

1000

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(a) Results using ObservedFuncRels

20 40 60 80 100 120 140 160 180 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Gene pairs ordered by similarity

Key

wor

d re

cove

ry

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(b) Results using KeywordRecovery

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(c) Results using SimFuncLabels

Figure 7: Evaluation of normalization methods on Saldanha et al [SBB04]’s non-temporal data set

15

Page 18: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

ation according to the (a) ObservedFuncRels, (b) KeywordRecovery and (c) SimFuncLabels evaluation

methodologies for four non-temporal expression data sets. The following general observations that can be

made from these results. Firstly, for almost all these data sets, almost all the normalizations are able to

extract more accurate functional relationships than those extracted from the raw unnormalization version

of the data sets. This indicates that normalization, even using simple methods, is able to enhance the

functional content of most non-temporal gene expression data sets.

Examining the results more closely, we observe from Figure 4 that the Dsigmoid method performs well

for the corresponding data sets. For instance, when applied to the columns of the data matrix of Gerber

et al [GHB04]’s data set, it is able to outperform substantially almost all other methods in Figures 4(b)

and 4(c), and is close to the top performer in Figure 4(a). In Figures 5(a–c), the Dsigmoid method is

among the top performers.

In another set of results, Figures 6 and 7 show that for some non-temporal expression data sets,

the Unitnorm normalization produces the best results. This observation is supported most strongly by

Figures 6(c) and Figure 7(c), which show the performance of various normalization methods on Iyer

et al [IHK+99]’s and Saldanha et al [SBB04]’s data sets, according to the SimFuncLabels evaluation

methodology. It can also be noticed that for these data sets, Dsigmoid also produces good results, as

shown by Figures 6(a) and 7(a), thus indicating its utility for normalizing various types of non-temporal

expression data sets. We believe that the better performance of Unitnorm as compared to Dsigmoid for

these data sets is because of the relatively smaller fraction of extreme values in their columns, due to

which their norms are not affected adversely.

Last but not the least, we observe that the Quantile, Znorm and Sigmoid normalization methods

also generally produce functionally richer matrices than the raw gene expression data. Thus, although

they do not produce significantly better performance for the variety of non-temporal data sets consid-

ered, it is possible that they are able to outperform the other methods for data sets that have different

characteristics than the ones considered. In summary, the normalization of non-temporal gene expression

data is a desirable pre-processing step that enables the extraction of more accurate information about

protein function than that obtained from the raw data set that may have been processed only to remove

experimental biases and errors.

4.2 Evaluation of normalization methods for temporal gene expression data

We now present the results of our evaluation of normalization methods for temporal expression data sets,

which are applied using the algorithm in Section 2.3. We tried all combinations of row transformation

methods (no transformation (Raw), MA, Diff and Ztrans) and column normalization methods (no nor-

malization Raw, Unitnorm, Znorm, Sigmoid, Dsigmoid and Quantile). However, to simplify presentation,

we show results only for the best column normalization for each row transformation method. The best

methods are identified using the area under the curve of the plots produced by the respective evaluation

methodology.

Figures 8 and 9 show the results produced by the (a) ObservedFuncRels and (b) KeywordRecovery

and (c) SimFuncLabels evaluation methodologies on Zhu et al [ZSV+00]’s and Shapira et al [SSB04]’s

data sets respectively. The following observations can be made from these results:

1. In all the plots, there is at least one transformed and/or normalized version of the raw data set

16

Page 19: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

100

200

300

400

500

600

700

800

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawMA_ZnormDiff_UnitnormZtrans_DsigmoidRaw_DsigmoidRandom

(a) Results using ObservedFuncRels

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Gene pairs ordered by similarity

Key

wor

d re

cove

ry

RawMA_QuantileDiff_RawZtrans_RawRaw_SigmoidRandom

(b) Results using KeywordRecovery

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

1000

2000

3000

4000

5000

6000

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawMA_ZnormDiff_RawZtrans_RawRaw_SigmoidRandom

(c) Results using SimFuncLabels

Figure 8: Evaluation of normalization methods on Zhu et al [ZSV+00]’s temporal data set

17

Page 20: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

100

200

300

400

500

600

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawMA_RawDiff_UnitnormZtrans_UnitnormRaw_UnitnormRandom

(a) Results using ObservedFuncRels

0 200 400 600 800 1000 1200 1400 1600 1800 20000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Gene pairs ordered by similarity

Key

wor

d re

cove

ry

RawMA_SigmoidDiff_UnitnormZtrans_UnitnormRaw_SigmoidRandom

(b) Results using KeywordRecovery

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

1000

2000

3000

4000

5000

6000

7000

8000

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawMA_SigmoidDiff_UnitnormZtrans_UnitnormRaw_SigmoidRandom

(c) Results using SimFuncLabels

Figure 9: Evaluation of normalization methods on Shapira et al [SSB04]’s temporal data set

18

Page 21: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

that produces better results than the raw data set. This shows that normalization can be useful

for enhancing the functional content of temporal data sets.

2. For all the evaluations on both the data sets, some combination of a transformation before nor-

malization produces significantly better results than just applying a normalization method to the

columns of the data matrix. For instance, the Ztrans method (followed by Dsigmoid normalization)

produces the best results in Figure 8(a), while the Diff method (followed by Unitnorm normaliza-

tion) performs the best in Figure 9(a).

3. Almost all the results show that the Ztrans method of time series transformation produces the

best results among all transformation methods, such as in Figures 9(b) and 9(c). Interestingly, the

commonly used Diff, and MA transformations produce worse results than the raw data set in most

of the evaluations.

4. Finally, Figures 8(b) and 8(c) show that the best performing version of the data set is that produced

by applying the Ztrans method to the original matrix, and no normalization subsequently. These

plots show this method produces a small gain over the raw data matrix, which produced equiv-

alent results as those produced by applying a simple z − score-based transformation to an entire

gene expression profile, since correlation is a scale- and shift-invariant measure. These small but

consistent gains, which are also observed for other data sets (results omitted for brevity), indicate

that transforming each constituent time series in a data set separately, and combining the resultant

series, is indeed important for getting more accurate functional information.

In summary, transformation and normalization of temporal gene expression data does indeed enhance

the functional content of temporal gene expression data sets. However, the fact that only one or a few

more simple transformation and normalization methods are able to outperform the raw data set, and

the lack of a consensus on the best normalization method to be used after the Ztrans transformation,

indicates that the simple methods adopted in this study, though useful, may not be able to extract all

possible information from the temporal data sets. Hence, it may be fruitful to use more sophisticated

techniques such as normalized B-splines [BJGG+03] for such analyses.

4.3 Difference between non-temporal and temporal expression data

In the results presented earlier, we handled non-temporal and temporal gene expression data sets dif-

ferently – on non-temporal data, we performed column normalization only, while on temporal data, we

performed row transformation before column normalization. The hypothesis underlying this distinction

is that the set of experiments in a time series expression measurement are related to each other. As

noted in Section 4.2, the best results for temporal data sets are always obtained when a transformation is

applied to the time series (rows) before any of the normalization methods is applied to the experiments

(columns). Furthermore, column normalization methods, used by themselves (i.e., without row transfor-

mation), generally produce comparable or worse results than the raw expression data set. This shows

that using a row transformation is useful for temporal expression data sets, and skipping this step, as in

non-temporal data sets, may lead to loss of information.

19

Page 22: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

500

1000

1500

2000

2500

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawZtrans_QuantileDsigmoid

(a) Gerber et al [GHB04]’s non-temporal data set normalizedboth as a non-temporal and a temporal data set

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

Num

ber

of fu

nctio

nal r

elat

ions

hips

rec

over

ed

RawZtrans_UnitnormUnitnorm

(b) Saldanha et al [SBB04]’s non-temporal data set normalizedboth as a non-temporal and a temporal data set

Figure 10: Comparison of the best normalizations for non-temporal and temporal expression data sets

Now, in order to show that row transformation is not useful for non-temporal expression data sets,

we analyzed the effect of using a row transformation followed by column normalization of a non-temporal

data set, and compared it to the best column normalization method for it. In accordance with the

definitions of the transformation methods used, we treated the entire expression profile of a gene as a

time series and the individual measurements as consecutive time points.

Figure 10 illustrates the results of this analysis for two data sets, namely those of Gerber et al [GHB04]

and Saldanha et al [SBB04], performed using the SimFuncLabels evaluation methodology. For brevity, we

only show the results of the best normalization and the best transformation-normalization combination

for these data sets. Both the plots show that for these non-temporal data sets, using only a normalization

method produces significantly better results than those produced by applying a row transformation before

20

Page 23: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

50

100

150

200

250

300

350

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawUnitnormZnormQuantileSigmoidDsigmoidRandom

(a) Comparison of normalization methods on a random half ofGerber et al [GHB04]’s data set

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

50

100

150

200

250

300

350

Gene pairs ordered by similarity

Num

ber

of fu

nctio

nal r

elat

ions

hips

unc

over

ed

RawMA_SigmoidDiff_UnitnormZtrans_UnitnormRaw_UnitnormRandom

(b) Comparison of transformation and normalization methodson a random half of Shapira et al [SSB04]’s data set

Figure 11: Illustration of learning an appropriate normalization method for a gene expression data set ina training-testing setting

column normalization, particularly in Figure 10(a).

4.4 Learning the best normalization

The previous sections showed that while normalization is a useful pre-processing operation for gene

expression data sets, the most appropriate normalization method may vary for different data sets. This

raises the important question of how does one pre-determine the best normalization method for a data

set that is to be analyzed. A possible approach for this is to use data for which the ground truth is

currently known to determine the best normalization method, and use it for the rest of the data. To

test this hypothesis, we applied the ObservedFuncRels evaluation methodology on a subset of the data

21

Page 24: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

set, and tested the consistency of these results with those obtained from the entire data set. We used

the ObservedFuncRels methodology, since the ground truth here consists of pre-specified protein-protein

interactions, and these do not depend on the data set for which normalizations are being evaluated.

Figures 11(a) and 11(b) show the results of evaluating different normalizations for random halves of

Gerber et al [GHB04]’s non-temporal and Shapira et al [SSB04]’s temporal data sets. These plots show

that even using only half of the data set, the relative ordering of the performance of different normalization

methods is almost identical to that obtained from the entire data sets, shown in Figures 4(a) and 9(a)

respectively. Similar results were obtained for other random trials on these and other data sets. This

indicates that a good normalization method can be accurately learned from a subset of the data.

5 Conclusions and Future Work

In this paper, we reported an extensive evaluation of normalization and transformation methods for

gene expression data, using three novel methodologies for evaluating their effectiveness at the task of

discovering functional information from these data sets. Following are the main results that were derived

from this evaluation study:

1. The performance of different normalization (and transformation) schemes may vary significantly

over data sets and types of functional information being predicted.

2. For non-temporal data, most of the commonly used normalizations improve the performance, but

some improve the performance a lot more than others. In particular, the double sigmoid function-

based Dsigmoid method performs significantly better than others for several combinations of data

sets and functional information. This is interesting, as this normalization has not been used by the

bioinformatics community for gene expression analysis.

3. While some transformations (followed by normalizations), such as Ztrans, do improve the prediction

performance for temporal data, some commonly used transformations (followed by any normaliza-

tions) perform worse than raw data. This indicates that great care needs to be taken in the selection

of the right transformation - particularly for temporal data.

4. For temporal data where the expression profile of a gene is constructed by combining several time

series expression measurements, it is advantageous to transform each constituent time series sepa-

rately than to transform the entire collection of series combined.

5. The transformations/normalization combinations that work best for temporal data tend to be

different than those for non-temporal data. For temporal data, it is often useful to first perform

a row transformation and then a column normalization, while for non-temporal data, it is best to

just perform a normalization. More generally, this indicates that temporal and non-temporal data

may need different types of analyses.

6. The best transformations and normalizations for a data set-functional evidence combination can

be learned from a training set for which the ground truth is known. In our study, this hypothesis

is validated by illustrating that the relative performance of normalizations on half of the available

labeled data is similar to their relative performance on the entire data set.

22

Page 25: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

Evaluation of different normalziation methods was done in the context of protein function prediction.

Similar evaluation can be done for other contexts, such as the identification of genes involved in cancer,

in order to obtain a better understanding of the characteristics of data sets that make them suitable for

specific normalization and transformation methods.

Acknowledgement

This work was supported by NSF grants #CRI-0551551, #IIS-0308264, and #ITR-0325949. Access to

computing facilities was provided by the Minnesota Supercomputing Institute.

We thank Chad Myers, Fumiaki Katagiri and Judith Berman for their insightful comments on the

paper. We also thank Golan Yona for making their protein-protein interaction data available to us.

References

[BBA+03] Brigitte Boeckmann, Amos Bairoch, Rolf Apweiler, Marie-Claude Blatter, Anne Estreicher,

Elisabeth Gasteiger, Maria J. Martin, Karine Michoud, Claire O’Donovan, Isabelle Phan,

Sandrine Pilbout, and Michel Schneider. The SWISS-PROT protein knowledgebase and its

supplement TrEMBL in 2003. Nucleic Acids Research, 31(1):365–370, 2003.

[BDSY99] Amir Ben-Dor, Ron Shamir, and Zohar Yakhini. Clustering gene expression patterns. J.

Computational Biology, 6(3-4):281–297, 1999.

[BGL+00] Michael P.S. Brown, William Noble Grundy, David Lin, Nello Cristiniani, Charles Walsh

Sugnet, Terrence S. Furey, Manuel Ares, Jr., and David Haussler. Knowledge based analysis

of microarray gene expression data by using support vector machines. Proc Natl Acad Sci,

97(1):262–267, 2000.

[BHWK04] Rajarajeswari Balasubramaniyan, Eyke Hullermeier, Nils Weskamp, and Jorg Kamper. Clus-

tering of gene expression data using a local shape-based similarity measure. Bioinformatics,

21(7):1069–1077, 2004.

[BIB03] Sven Bergmann, Jan Ihmels, and Naama Barkai. Iterative signature algorithm for the anal-

ysis of large-scale gene expression data. Phys Rev E, 67:031902, 2003.

[BJ04] Z. Bar-Joseph. Analyzing time series gene expression data. Bioinformatics, 20(16):2493–

2503, 2004.

[BJGG+03] Ziv Bar-Joseph, Georg Gerber, David K. Gifford, Tommi S. Jaakkola, and Itamar Simon.

Continuous representations of time series gene expression data. J Comput Biol., 10(3–4):341–

356, 2003.

[Bol01] Benjamin M. Bolstad. Probe level quantile normalization of high density oligonucleotide

array data. Unpublished. Available at http://bmbolstad.com/stuff/qnorm.pdf, 2001.

23

Page 26: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

[Bol04] Benjamin M. Bolstad. Low Level Analysis of High-density Oligonucleotide Array Data: Back-

ground, Normalization and Summarization. PhD thesis, University of California, Berkeley,

2004.

[CCV+03] Cheadle, Chris, Vawter, Marquis P., Freed, William J., Becker, and Kevin G. Analysis of

Microarray Data Using Z Score Transformation. J Mol Diagn, 5(2):73–81, 2003.

[CHZP02] Carlo Colantuoni, George Henry, Scott Zeger, and Jonathan Pevsner. Snomad (standard-

ization and normalization of microarray data): web-accessible gene expression data analysis.

Bioinformatics, 18(11):1540–1541, 2002.

[DHS00] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification (2nd Edition).

Wiley-Interscience, 2000.

[DP05] D’haeseleer and Patrik. How does gene expression clustering work? Nat Biotech, 23:1499–

1501, 2005.

[ESBB98] Michael B. Eisen, Paul T. Spellman, Patrick O. Browndagger, and David Botstein. Clus-

ter analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U.S.A.,

95(25):14863–14868, 1998.

[GHB04] Andre P. Gerber, Daniel Herschlag, and Patrick O. Brown. Extensive association of func-

tionally and cytotopically related mrnas with puf family rna-binding proteins in yeast. PLoS

Biology, 2(3):E79, 2004.

[GSK+00] Audrey P. Gasch, Paul T. Spellman, Camilla M. Kao, Orna Carmel-Harel, Michael B. Eisen,

Gisela Storz, David Botstein, and Patrick O. Brown. Genomic Expression Programs in the

Response of Yeast Cells to Environmental Changes. Mol. Biol. Cell, 11(12):4241–4257, 2000.

[H+00] Timothy R. Hughes et al. Functional discovery via a compendium of expression profiles.

Cell, 102(1):109–126, 2000.

[Ham94] James Douglas Hamilton. Time Series Analysis. Princeton University Press, 1994.

[HKSL01] T.R. Hvidsten, J. Komorowski, A.K. Sandvik, and A. Laegreid. Predicting gene function

from gene expressions and ontologies. In Proc. Pacific Symposium on Biocomputing (PSB),

pages 299–310, 2001.

[HP00] Liisa Holm and Jong Park. DaliLite workbench for protein structure comparison. Bioinfor-

matics, 16(6):566–567, 2000.

[IHK+99] Vishwanath R. Iyer, Christine Horak, Laurent Kuras, Peter Kosa, Charles Scafe, David

Botstein, Kevin Struhl, Michael Snyder, and Patrick O. Brown. Genome-wide maps of

DNA-protein interactions using a yeast ORF and intergenic microarray. Nature Genetics,

23:53, 1999.

[JNR05] Anil K. Jain, Karthik Nandakumar, and Arun Ross. Score normalization in multimodal

biometric systems. Pattern Recognition, 38(12):2270–2285, 2005.

24

Page 27: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

[Kar02] George Karypis. CLUTO - a clustering toolkit. Technical Report #02-017, University of

Minnesota, Department of Computer Science, 2002.

[LHM+03] Astrid Laegreid, Torgeir R. Hvidsten, Herman Midelfart, Jan Komorowski, and Arne K.

Sandvik. Predicting gene ontology biological process from temporal gene expression patterns.

Genome Research, 13(5):965–979, 2003.

[LRTS04] Jorge Lepre, John Jeremy Rice, Yuhai Tu, and Gustavo Stolovitzky. Genes@work: an efficient

algorithm for pattern discovery and multivariate feature selection in gene expression data.

Bioinformatics, 20(7):1033–1044, 2004.

[MBH+06] Chad L Myers, Daniel R Barrett, Matthew A Hibbs, Curtis Huttenhower, and Olga G

Troyanskaya. Finding function: evaluation methods for functional genomic data. BMC

Genomics, 7:187, 2006.

[MDJ+02] Alvaro Mateos, Joaquin Dopazo, Ronald Jansen, Yuhai Tu, Mark Gerstein, and Gustavo

Stolovitzky. Systematic learning of gene functional classes from dna array expression data

by using multilayer perceptrons. Genome Research, 12(11):1703–1715, 2002.

[MPT+99] Edward M. Marcotte, Matteo Pellegrini, Michael J. Thompson, Todd O. Yeates, and David

Eisenberg. A combined algorithm for genome-wide prediction of protein function. Nature,

402(6757):83–86, 1999.

[NAWC02] Danh V. Nguyen, A. Bulak Arpat, Naisyin Wang, and Raymond J. Carroll. DNA microarray

experiments: biological and technological aspects. Biometrics, 58(4):701–717, 2002.

[PKS06] Gaurav Pandey, Vipin Kumar, and Michael Steinbach. Computational approaches for protein

function prediction: A survey. Technical Report 06-028, Department of Computer Science

and University of Minnesota, October 2006.

[PTS+03] Christopher Potter, Pang-Ning Tan, Michael Steinbach, Steven Klooster, Vipin Kumar,

Ranga Myeni, and Vanessa Genovese. Major disturbance events in terrestrial ecosystems

detected using global satellite data sets. Global Change Biology, 9(7):1005–1021, 2003.

[PYK+03] Taesung Park, Sung-Gon Yi, Sung-Hyun Kang, SeungYeoun Lee, Yong-Sung Lee, and

Richard Simon. Evaluation of normalization methods for microarray data. BMC Bioin-

formatics, 4:33, 2003.

[Qua02] John Quackenbush. Microarray data normalization and transformation. Nature Genetics,

32:496–501, 2002.

[R+04] Andreas Ruepp et al. The FunCat, a functional annotation scheme for systematic classifica-

tion of proteins from whole genomes. Nucleic Acids Research, 32(18):5539–5545, 2004.

[S+01] Gavin Sherlock et al. The Stanford Microarray Database. Nucleic Acids Research, 29(1):152–

155, 2001.

[SBB04] Alok J. Saldanha, Matthew J. Brauer, and David Botstein. Nutritional Homeostasis in Batch

and Steady-State Culture of Yeast. Mol. Biol. Cell, 15(9):4089–4104, 2004.

25

Page 28: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

[SBM96] Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization.

In Proc. 19th ACM SIGIR Conference, pages 21–29, 1996.

[Slo02] Donna K. Slonim. From patterns to pathways: gene expression data analysis comes of age.

Nature Genetics, 32(Suppl):502–508, 2002.

[SMP+03] Michael Strong, Parag Mallick, Matteo Pellegrini, Michael J. Thompson, and David Eisen-

berg. Inference of protein function and protein linkages in Mycobacterium tuberculosis based

on prokaryotic genome organization: a combined computational approach. Genome Biology,

4(9):R59, 2003.

[SS03] Gordon K. Smyth and Terry Speed. Normalization of cDNA microarray data. Methods,

31(4):265–273, 2003.

[SSB04] Michael Shapira, Eran Segal, and David Botstein. Disruption of Yeast Forkhead-associated

Cell Cycle Transcription by Oxidative Stress. Mol. Biol. Cell, 15(12):5659–5669, 2004.

[STG+01] Eran Segal, Benjamin Taskar, Audrey Gasch, Nir Friedman, and Daphne Koller. Rich

probabilistic models for gene expression. In ISMB (Supplement of Bioinformatics), pages

243–252, 2001.

[SYK03] Eran Segal, R. Yelensky, and Daphne Koller. Genome-wide discovery of transcriptional

modules from dna sequence and gene expression. In ISMB (Supplement of Bioinformatics),

pages 273–282, 2003.

[TCS+01] Olga G. Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert

Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation methods for dna

microarrays. Bioinformatics, 17(6):520–525, 2001.

[TPS+99] Tamayo, Pablo, Slonim, Donna, Mesirov, Jill, Zhu, Qing, Kitareewan, Sutisak, Dmitrovsky,

Ethan, Lander, Eric S., Golub, and Todd R. Interpreting patterns of gene expression with

self-organizing maps: Methods and application to hematopoietic differentiation. PNAS,

96(6):2907–2912, 1999.

[WJJ+02] Christopher Workman, Lars Jensen, Hanne Jarmer, Randy Berka, Laurent Gautier, Henrik

Nielser, Hans-Henrik Saxild, Claus Nielsen, Soren Brunak, and Steen Knudsen. A new non-

linear normalization method for reducing variability in dna microarray experiments. Genome

Biology, 3(9):research0048.1–research0048.16, 2002.

[WXM+05] Wei Wu, Eric P. Xing, Connie Myers, I. Saira Mian, and Mina J. Bissell. Evaluation of

normalization methods for cdna microarray data by k-nn classification. BMC Bioinformatics,

6:191, 2005.

[YDL+02] Yee Hwa Yang, Sandrine Dudoit, Percy Luu, David M. Lin, Vivian Peng, John Ngai, and

Terence P. Speed. Normalization for cDNA microarray data: a robust composite method

addressing single and multiple slide systematic variation. Nucleic Acids Research, 30(4):e15,

2002.

26

Page 29: Technical Report - vk.cs.umn.eduvk.cs.umn.edu/gaurav/normalization.pdf · Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building

[YDRL06] Golan Yona, William Dirks, Shafquat Rahman, and David M. Lin. Effective similarity

measures for expression profiles. Bioinformatics, 22(13):1616–1622, 2006.

[ZSK+05] Pusheng Zhang, Michael Steinbach, Vipin Kumar, Shashi Shekhar, Pang-Ning Tan, Steve

Klooster, , and Chris Potter. Discovery of patterns of Earth science data using data mining.

In Jozef Zurada and Medo Kantardzic, editors, Next Generation of Data Mining Applications.

Wiley-IEEE Press, 2005.

[ZSV+00] G. Zhu, P. T. Spellman, T. Volpe, P. O. Brown, D. Botstein, T. N. Davis, and B. Futcher.

Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth. Nature, 406:90–

94, 2000.

27