pathway analysis of microarray data via regression
TRANSCRIPT
JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 15, Number 3, 2008
© Mary Ann Liebert, Inc.
Pp. 269–277
DOI: 10.1089/cmb.2008.0002
Pathway Analysis of Microarray Data via Regression
A.J. ADEWALE,1;� I. DINU,2 J.D. POTTER,3 Q. LIU,2 and Y. YASUI2
ABSTRACT
Pathway analysis of microarray data evaluates gene expression profiles of a priori defined
biological pathways in association with a phenotype of interest. We propose a unified
pathway-analysis method that can be used for diverse phenotypes including binary, multi-
class, continuous, count, rate, and censored survival phenotypes. The proposed method also
allows covariate adjustments and correlation in the phenotype variable that is encountered
in longitudinal, cluster-sampled, and paired designs. These are accomplished by combining
the regression-based test statistic for each individual gene in a pathway of interest into a
pathway-level test statistic. Applications of the proposed method are illustrated with two
real pathway-analysis examples: one evaluating relapse-associated gene expression involving
a matched-pair binary phenotype in children with acute lymphoblastic leukemia; and the
other investigating gene expression in breast cancer tissues in relation to patients’ survival
(a censored survival phenotype). Implementations for various phenotypes are available in
R. Additionally, an Excel Add-in for a user-friendly interface is currently being developed.
Key words: gene clusters, gene expression, statistics.
1. INTRODUCTION
ANALYSIS OF MICROARRAY DATA was focused initially on identifying individual genes that are
differentially expressed between two classes of a phenotype. The focus has since expanded to include
other kinds of phenotypes, such as censored survival and continuous phenotypes, and also to identify
biological pathways (i.e., sets of genes) that are differentially expressed according to a phenotype. The
goal of this paper is to propose and illustrate a unified general analysis method of microarray data for
identifying pathways (or gene sets) whose expressions are associated with a phenotype of any kind. To
accommodate various analysis settings, our method also allows correlation among samples (e.g., paired,
clustered, or longitudinal data) and adjustments for covariates that may be associated with the phenotype
(e.g., age, sex, race), in addition to handling any type of phenotype, censored or uncensored.
Many authors—including Mootha et al. (2003), Goeman et al. (2004, 2005), Mansmann and Meister
(2005), and Dinu et al. (2007)—have proposed methods for pathway analysis for microarray data for
1Merck & Co., Inc., 351 N. Sumneytown Pike, UGIC-36 North Wales, Pennsylvania 19454.2Department of Public Health Sciences, University of Alberta, Edmonton, Alberta, Canada.3Cancer Prevention Program, Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle,
Washington.�This research was conducted when Dr. Adewale was a Postdoctoral Fellow at the University of Alberta, Edmonton,
Canada.
269
270 ADEWALE ET AL.
different types of phenotype. In particular, Goeman et al. (2004) proposed a score test that is based on
random-effects modeling of parameters corresponding to the coefficients of the individual genes in the
pathway. Their method addressed binary and continuous phenotypes as follows. Denoting the biologic
outcome of interest (phenotype) by Y and the pathway expressions by .x1; : : : ; xm/, they proposed the test
statistic
Q D.Y � �/T R.Y � �/
�2
where R D .1=m/XXT , X D .x1; x2; : : : ; xm/ is a matrix with columns of gene expression vectors, Y is
the vector of outcomes, � D Enull.Y/ is the mean outcome under the null hypothesis of no association, and
�2 D Varnull.Y / is the variance of the outcome under the null hypothesis of no association. Later, Goeman
et al. (2005) extended the method to the censored-survival phenotype with use of a modeling framework
incorporating random effects and Cox proportional hazards. In Dinu et al. (2007), we proposed a test
called SAM-GS for assessing differential expression of pathways between two classes of a phenotype. The
SAM-GS approach addressed the issue of the low-variability characteristics of microarray data, using an
adjustment introduced in a popular individual-gene analysis method, significance analysis of microarray
(SAM) (Tusher et al., 2001). The SAMGS statistic for pathway analysis with a binary phenotype is given by:
SAMGS D
mX
pD1
d 2p
where m is the number of genes in the pathway of interest, dp Dxp .1/�xp.2/
spCs0, xp.k/ is the average
expression for the pth gene in the pathway for the kth class of a binary phenotype (k D 1, 2), and sp is
the pooled standard deviation. The constant s0 was added to adjust for the small variability characteristics
of microarray data (Tusher et al., 2001).
In this paper, we provide a particular view of SAM-GS and Goeman et al.’s global test that permits
pathway analysis of diverse phenotypes, including multi-class, continuous, and censored-survival pheno-
types, while allowing covariate adjustments and correlated data. The generality of the proposed method
is achieved by use of regression methods. The proposed approach is a “self-contained hypothesis testing”
method (Goeman and Bühlmann, 2007), which evaluates the association of a pathway of interest with a
phenotype using the expressions of genes in the pathway exclusively: the expressions of genes outside of
the pathway of interest do not influence the testing of the association.
2. PROPOSED METHOD
2.1. Overview
The SAM-GS statistic of Dinu et al. (2007), SAMGS DPm
pD1 d 2p, is a sum of the t-like test statistics
for the individual genes in the pathway:
dp Dxp.1/ � xp.2/
sp C s0
:
The idea of summing univariate test statistics as a basis of testing a multivariate hypothesis was in line
with the work of Dempster (1958, 1960) on the two-sample multivariate mean comparison problem with
small samples where the traditional Hotelling’s T -square test fails.
Goeman et al. (2004) also described the global test’s Q statistic as an average of the m test statistics
calculated as though each of the m individual genes constitute a pathway by itself. That is,
Q D1
m
mX
pD1
1
�2ŒxT
p .Y � �/�2
where Qp D 1� 2 ŒxT
p .Y � �/�2 is the test statistic for a pathway consisting just the pth gene.
PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 271
It is this particular view of combining component-wise test statistic for testing a multivariate hypothesis
that motivated our approach to pathway analysis for various phenotypes. The proposed pathway statistic
is defined as follows:
W D
mX
pD1
�
rp
sp
�2
where rp is any appropriate measure of association between the phenotype Y and the expression xp for the
pth gene in the pathway, sp is the standard error of rp . We propose taking rp as the regression coefficient
from modeling the pth gene as a predictor of the phenotype Y in an appropriate regression framework.
The form of the test statistic is a sum of squares of the Wald statistics for individual genes constituting
the pathway.
Thus, the unified pathway analysis method we propose derives rp and sp from the framework of
regression methods for the outcome variable Y of various types. We describe below the proposed test
statistic by grouping various analysis settings into three categories: (1) uncensored independent phenotype;
(2) uncensored correlated phenotype; and (3) censored phenotype.
2.2. Uncensored independent phenotype
Consider data of the form f.xi1; : : : ; xim/; zi ; Yi gniD1 where, for the i th individual, .xi1; : : : ; xim/ are the
gene expressions in the pathway of interest, zi is the covariate vector for which adjustment is desired,
and Yi is the phenotype of interest. In order to assess the association of the pathway with an uncensored
independent phenotype, we adopt a regression framework that accommodates diverse uncensored outcome
measures—generalized linear models (GLMs). GLMs (Nelder and Wedderburn, 1972; McCullagh and
Nelder, 1989) are a family of regression models for diverse outcome types using an exponential family of
distributions: distributions in this family include Gaussian, gamma, Poisson, binomial, and inverse Gaussian
distributions. Thus, the GLM framework provides a unified approach to modeling independent binary,
count, rate, and non-Gaussian continuous outcomes as well as classical Gaussian continuous outcomes.
The linear predictor is a component of a GLM through which the influences of the predictors are specified.
Our proposed approach is to fit each gene in the pathway, one at a time, along with covariates of interest
to the analysis. That is, we fit a GLM with the linear predictor �i D ˇ0 C ˇ.p/
1 xip C zTi ˛. In computing
the test statistic, W DPm
pD1
r2p
s2p
, the measure of association, rp, between the phenotype and the expression
xp for the pth gene in the pathway is the regression coefficient O.p/1 of the above GLM and sp is its
corresponding standard error. Then, the statistic W is in the form of a sum of m Wald statistics from m
GLM fits where each of m genes in the pathway of interest provides a Wald statistic that is derived with
an adjustment for the covariates.
2.3. Uncensored correlated phenotype
Suppose we have a clustered data, fxij D .xij1; : : : ; xij m/; zij ; Yij gniD1, where xij is the vector of
gene expressions for j th observation from cluster i , zij is the vector of covariates, and Yij denotes the
corresponding phenotype. The cluster might correspond to an individual where the observations within a
cluster are measurements taken over time longitudinally over time or under varying experimental conditions.
The cluster might also consist of related subjects, for example, members of a family, clinic, or community.
Under this setting, the outcome data are no longer independent because observations within a cluster tend
to be more alike (or unalike under some circumstances) compared to observations from different clusters.
Thus, an appropriate model should account for the lack of statistical independence due to clustering.
Generalized linear mixed models (GLMMs) are a natural extension of GLMs that accommodate clus-
tering via use of random effects in the linear predictor. In the simplest form, the linear predictor includes
just a random intercept: �ij D ˇ0 C ˇ.p/
1 xijp C zTij ˛ C ui , where ui is the random intercept, which is
usually assumed to follow a mean-zero Gaussian distribution with an unknown variance. As in the GLM
framework, the measure of association of the pth gene with the phenotype is the estimated regression
coefficient, rp D O.p/
1 , and sp is its standard error.
272 ADEWALE ET AL.
Alternative frameworks for accommodating correlated phenotypes include the generalized estimating
equations (GEE). It is a quasi-likelihood approach which requires the specification of mean response (i.e.,
the specification of the linear predictor as in GLMs and GLMMs), variance function, and a pairwise corre-
lation pattern among observations from the same cluster without fully specifying a particular multivariate
distribution.
Binary matched-pair data are a special case of clustered data where the phenotype is binary and each
cluster is a pair of observations. The conditional logistic regression model can be used for such data (see
an example in Section 4).
2.4. Censored-survival phenotype
Consider censored survival data, f.xi1; : : : ; xim/; zi ; Yi D .Ti ; ci/g, where ci D 1 and 0 denote an
occurrence of the event of interest and censoring, respectively, and Ti is the survival time if ci D 1
or the censoring time if ci D 0. Regression models for censored survival data provide the necessary
elements of W . For example, Cox proportional hazards models specify the hazard of the event at time t
by h.t j xp; z/ D h0.t/ exp.�.xp; z//, where the function h0.t/ is an unspecified baseline hazard function
and the function �.xip ; zi / D ˇ.p/xip C zTi ˛ captures the influence of the pth gene and covariates on the
hazard function at time t (Cox, 1972). The association measure for W can be taken as rp D O.p/, the
parameter estimate of the log-hazard ratio associated with expression of the pth gene and sp being its
corresponding standard error.
We note that other suitable models for censored data can be used. In situations where the proportional
hazards assumption is untenable, for example, a piecewise exponential model which assumes proportional
hazards in a series of consecutive time intervals can be used. For a correlated censored-survival phenotype,
the clustering introduces dependency in the data and frailty models that incorporate random effects into
the linear component of a proportional hazards model or an accelerated failure time model can be applied
(Aalen, 1998; Hougaard, 1995).
2.5. Significance testing
An approximation of the null distribution of the proposed test statistic W , by a scaled chi-squared
distribution, may be plausible, but this option was not pursued here. In fact, distributional approximations
in the context of pathway analysis of microarray data may not be satisfactory: see, for example, Mansmann
et al. (2005) and Liu et al. (2007). Rather, statistical significance for the association between gene expression
in a pathway and a phenotype is assessed by a permutation test. The permutation adopts the approach
of Braun and Feng (2001), which constitute permuting the indices of the (phenotype, covariates) set—
fYi ; zi gniD1—while fixing all regression parameters constant (at the estimates obtained from the original
unpermuted data) in each permutation, except the parameter corresponding to the gene effect. We note
that fixing all other parameters constant in each permutation is required to guarantee the invariance of
the test statistic under the null hypothesis of no gene effect (i.e., no association between the pathway and
phenotype after adjusting for covariate effects).
When the data are clustered, one must distinguish two cases: the case where the association of the path-
way with the phenotype is assessed within clusters (e.g., paired design, cross-over design, or longitudinal
design with time defining the phenotype of interest) and the case where the association is assessed across
clusters (e.g., repeated measures of subjects with exposure status/level as the phenotype). Within-cluster
permutation is suitable for assessing association within clusters, while block-permutation with clusters taken
as “blocks” is appropriate when clusters are independent and each cluster is indexed with a phenotype
label (Good, 1994).
The issue of multiple comparisons is a problem that must be addressed when multiple pathways are to be
assessed for their associations with a phenotype. However, we consider it separately from our development
of a pathway-analysis method because it is not necessary when a single pathway is of interest in pathway
analysis (e.g., a case where the pathway analysis is confirmatory in nature). When the interest is indeed on
multiple pathways, which is often the case where the pathway analysis is exploratory with many pathways
of potential interest, the q-value approach (Storey, 2002; Storey and Tibshirani, 2003; Storey et al., 2004)
or other methods that control for false-discovery rates (FDRs) in multiple comparisons can be employed.
PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 273
3. APPLICATIONS
3.1. Correlated binary phenotype: original versus relapsed childhood acute lymphoblastic
leukemia data
Bhojwani et al. (2006) reported on a dataset consisting of 35 children with childhood acute lymphoblastic
leukemia (ALL) who relapsed after diagnosis and therapy. The gene expression profiles of 35 matched
diagnosis/relapse pairs were analyzed. The objective of the analysis of Bhojwani and colleagues was to
identify biological pathways associated with relapse in childhood ALL. The sample consisted of 23 children
who relapsed early (less than 36 months from diagnosis) and 12 children who relapsed late (after 36 months
from diagnosis and therapy). Bhojwani et al. (2006) addressed their objective by conducting a paired t-test
on individual genes. They then adjusted for multiple testing using Benjamini and Hochberg’s false discovery
rate (FDR) and Hochberg’s Bonferroni p-value adjustment (Hochberg, 1988; Benjamini and Hochberg,
1995). Genes meeting a specified FDR criterion were selected for manual classification into biological
functional groups (pathways). They concluded that their analyses revealed significant differences between
diagnosis and early relapse in the expression of genes involved in cell-cycle regulation, DNA repair,
and apoptosis. Although the conclusion of Bhojwani et al. (2006) may be correct, pathway analysis via
regression offers a more systematic assessment of differential pathway expression between the diagnosis
and relapse samples. We re-analyzed their data with the aim of directly identifying biologic pathways that
are associated with relapse using the proposed method.
The Affymetrix gene identifiers were mapped into a gene ontology database—Genebank. Of the 22,283
probes, 11,401 were successfully mapped to 839 distinct pathways using the C1 and C2 pathway databases
on “Molecular Signature Database,” provided by the Broad Institute (www.broad.mit.edu/gsea). In Subra-
manian et al. (2005), Catalog C1 included 24 sets, one for each of the 24 human chromosomes, and 295 sets
corresponding to cytogenetic bands; and Catalog C2 consisted of 472 sets containing gene sets reported
in manually curated databases and 50 sets containing genes reported in various experimental papers.
As the data were paired diagnosis and relapse samples, we employed the conditional logistic modeling.
Conditional logistic modeling of binary matched pairs entails fitting a no-intercept unconditional logistic
model to discordant pairs using difference of matched covariates as predictors (Breslow et al., 1978;
Chamberlain, 1980). The artificial response for use in the unconditional logistic model is y� D 1 when
.yi1 D �1; yi2 D 1/ and y� D 0 when .yi1 D 1; yi2 D �1/ and the model takes the following form:
logitŒp.y�
i D 1 j x�
pi D xpi2 � xpi1/� D ˇ.p/x�
pi The association measure of each gene in the pathway of
interest that are to be entered into the W statistics are rp D O.p/ and sp D se. O.p//.
One hundred sixty-eight of the 839 pathways have q-values less than 0.01, where q-values were obtained
by the method of Storey (Storey, 2002; Storey et al., 2004; Storey and Tibshirani, 2003). Bhojwani et al.
(2006) conducted a stratified analysis for early versus late relapse cases. We found no pathway that was
differentially expressed with q-value less than 0.01 between diagnosis and relapse using the late-relapsed
cases. On the other hand, 45 pathways had q-values less than 0.01 in the early-relapse cases. These
45 pathways are listed in Table 1. Most of the 45 pathways contain mixture of down- and up-regulated
genes. Note that, in agreement with Bhojwani et al., we found that, in the early-relapse cases, pathways
associated with cell-cycle regulation, DNA repair, and apoptosis are differentially expressed. Moreover,
the pathway analysis via regression provided many other pathways that are differentially expressed.
3.2. Censored survival phenotype: breast cancer survival data
The data for this analysis came from 295 women with primary invasive breast cancer reported in Van de
Vijver et al. (2002). Using the gene-expression profiles of the previously determined 70 marker genes, Van
de Vijver et al. (2002) classified the 295 tumors into good-prognosis and poor-prognosis categories. The
predictive ability of this categorization on time to distant metastases was examined. Here we mapped all
4919 genes into pathways using Genebank’s gene ontology database. We identified 728 pathways (numbers
of genes ranged from 2 to 268 in a pathway). Our objective in this pathway analysis is to identify pathways
that are associated with time to death (overall survival). We fit a proportional hazards model
h.t j xpi / D h0.t/ exp.ˇ.p/xpi /; i D 1; : : : ; n;
274 ADEWALE ET AL.
TABLE 1. LIST OF SIGNIFICANT PATHWAYS FROM CONDITIONAL-LOGISTIC
REGRESSION-BASED PATHWAY ANALYSIS
No. of genes
with rp=sp
Pathway name Set size <0 >0 p-value q-value
chr1p22 28 13 15 0.001 0.006
chr2q12 8 3 5 0.001 0.006
aktPathway 14 11 3 0.001 0.006
cacamPathway 15 8 7 0.001 0.006
cdc25Pathway 9 3 6 0.001 0.006
gcrPathway 16 11 5 0.001 0.006
mrpPathway 4 2 2 0.001 0.006
no1Pathway 35 20 15 0.001 0.006
pepiPathway 7 2 5 0.001 0.006
plk3Pathway 8 4 4 0.001 0.006
rbPathway 12 4 8 0.001 0.006
relaPathway 12 7 5 0.001 0.006
MYC_MUT 4 1 3 0.001 0.006
EMT_DOWN 47 29 18 0.001 0.006
atmPathway 17 9 8 0.002 0.007
cd40Pathway 13 10 3 0.002 0.007
eea1Pathway 10 3 7 0.002 0.007
freePathway 10 4 6 0.002 0.007
g2Pathway 20 11 9 0.002 0.007
il1rPathway 24 14 10 0.002 0.007
MAP00670_One_carbon_pool_by_folate 14 4 10 0.002 0.007
MAP00680Methane_metabolism 8 3 5 0.002 0.007
nfkbPathway 16 9 7 0.002 0.007
ST_Tumor_Necrosis_Factor_Pathway 31 21 10 0.002 0.007
P53_UP 34 21 13 0.002 0.007
Chr13q12 31 17 14 0.004 0.009
Chr19q12 1 0 1 0.004 0.009
Chr6p25 11 6 5 0.004 0.009
Chr16q11 2 0 2 0.004 0.009
cptPathway 2 0 2 0.004 0.009
fibrinolysisPathway 10 7 3 0.003 0.009
fosbPathway 6 5 1 0.004 0.009
MAP00195_Photo synthesis 2 0 2 0.003 0.009
MAP00310_Lysine_degradation 19 4 15 0.004 0.009
p53hypoxiaPathway 12 7 5 0.003 0.009
tnf_&_fas_network 27 12 15 0.004 0.009
tnfr2Pathway 15 12 3 0.003 0.009
GO_ROS 22 4 18 0.004 0.009
FRASOR_ER_UP 30 16 14 0.004 0.009
chr10q26 29 11 18 0.005 0.010
chr11q22 24 10 14 0.005 0.010
atrbrcaPathway 19 3 16 0.005 0.010
dnafragment Pathway 7 1 6 0.005 0.010
longevityPathway 11 7 4 0.005 0.010
LEU_DOWN 166 48 118 0.005 0.010
PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 275
TABLE 2. UNIVARIATE COX REGRESSION OF OUTCOMES WITH DEMOGRAPHIC
AND CLINICAL CHARACTERISTICS
Characteristics
Overall mortality
hazard ratio
(95% CI) p-value
Age 0.94 (0.91, 0.98) 0.004
Diameter 1.04 (1.01, 1.06) 0.001
Lymph-node status 1.07 (0.97, 1.17) 0.170
Mastectomy (yes vs. no) 1.20 (0.77, 1.87) 0.410
Estrogen receptor status (positive vs. negative) 0.30 (0.20, 0.48) <0.001
Tumor grade
Intermediate (vs. good) 4.65 (1.61, 13.40) <0.001
Poor (vs. good) 10.22 (3.69, 28.30)
Chemotherapy (yes vs. no) 0.79 (0.49, 1.26) 0.330
Hormonal therapy (yes vs. no) 0.61 (0.26, 1.40) 0.240
where ˇ.p/ is the regression coefficient corresponding to the pth gene in a particular pathway. The estimated
coefficient from the Cox model for each individual gene was taken as the association measure between the
gene and survival, and entered into W along with its standard error estimate.
Of the 728 pathways examined, 635 were found to be significantly (q-value < 0.01) associated with
overall survival. Further, we adjusted for known demographic and clinical covariates of the overall survival
and re-examined the association of the 728 pathways with overall survival. Demographic and clinical
covariates available in the data included age, diameter of the tumor, lymph-node status (positive, coded 1;
or negative, coded 0), mastectomy (vs. no mastectomy), estrogen-receptor status (positive, coded 1; or
negative, coded 0), tumor histological grade, chemotherapy (vs. no chemotherapy) and hormonal therapy
(vs. no hormonal therapy). We first examined the association of each covariate with overall survival in
a univariate Cox regression model. The results of these univariate analyses are presented in Table 2.
Covariates with p-value less than 0.2 were earmarked to be adjusted for in the pathway analysis. There is
a substantial reduction in the number of pathways that were identified as having an association with overall
survival (q-value < 0.01), after adjustment for known demographics and clinical information. Specifically,
after the covariate adjustment, only four pathways were significantly (q-value < 0.01) associated with
overall survival. These pathways are listed in Table 3. All four were among the 635 pathways found to
be statistically significant in the analysis without covariate adjustment. Of all four pathways (Table 3),
three are tightly related: Glycine serine and threonine metabolism; Cyanoamino acid metabolism; and
Methane metabolism. Further, a key component of One_carbon_pool_by_folate pathway is coupled to
methionine synthase.
TABLE 3. LIST OF SIGNIFICANT PATHWAYS AFTER COVARIATE ADJUSTMENT
(p-VALUE AND q-VALUE BOTH <0:001)
No. of genes
with rp=sp
Pathway name Set size <0 >0
MAP00260_Glycine_serine_and_threonine_metabolism 7 3 4
MAP00460_Cyanoamino_acid_metabolism 4 2 2
MAP00670_One_carbon_pool_by_folate 5 1 4
MAP00680_Methane_metabolism 2 0 2
276 ADEWALE ET AL.
4. DISCUSSION
We have proposed a unified method of pathway analysis for diverse phenotypes via regression framework.
The term “pathway” has been used loosely to include groups of genes based on their chromosomal locations
since the proposed approach is also relevant in situation where the objective is to discover if the phenotype
is associated with expression of a group of genes that are located closely on the same chromosome.
The proposed W statistic is a sum of squares of Wald statistics from regression models assessing the
associations of individual genes in the pathway of interest with a phenotype.
There is an apparent similarity between the proposed W statistic and the SAMGS statistic. However,
SAMGS is based on linear modeling of individual gene expression in the pathway with the binary phenotype
label as a predictor. The SAMGS statistic is the ratio of the estimated phenotype effect to its adjusted
standard error. The issue of small variability of gene expression in microarray necessitated an adjustment
to the standard error in the SAMGS statistic in order to mitigate the chance of false discovery (Tusher
et al., 2001; Wright and Simon, 2003). In our proposed method, however, we modeled the phenotype,
not the gene expression, as the outcome variable. The proposed test statistic, therefore, does not require a
variance estimate of each gene’s expression which can lead to extremely small standard errors and high
chance of false discovery. Thus, the adjustment for the small variability used in the SAMGS statistic is not
applicable to the proposed test statistic.
The approach presented offers a significant addition to the literature in two ways. First, the proposed
method accommodates pathway analysis with respect to any phenotype that can be handled within the
existing regression framework. Second, the use of regression framework enables the assessment of pathway-
phenotype associations accounting for other known prognostic factors (covariates) and in study designs
that are subject to correlation or clustering in data. This pathway-analysis method, by incorporating these
features, should be a useful addition to the available tools.
In particular, for censored survival phenotype, the global test of Goeman et al. (2005) is the only
currently applicable method. Our proposal renders pathway analysis accessible for any phenotype with
existing classical regression method. An important advantage of the unified approach to pathway analysis
presented here is that the method is situated within the context of well known classical regression methods.
It thus puts a statistically sound but yet simple tool within the reach of biologists and clinicians with
interest in understanding the association of diverse clinical phenotypes and biological pathways. Also, the
method readily lends itself to easy implementation using existing programmable software with capability
for handling analysis involving basic regression methods with a minimal programming effort from the
analyst. R implementations for various phenotypes are available and can be requested from the authors.
We are currently preparing an Excel Add-in for a user-friendly interface.
ACKNOWLEDGMENTS
Support for this research was provided by the Alberta Heritage Foundation for Medical Research
(postdoctoral fellowship to A.J.A. and I.D., senior health investigator award to Y.Y.) and the Canada
Research Chair Program, Canadian Institute of Health Research, (to Y.Y.).
DISCLOSURE STATEMENT
No competing financial interests exist.
REFERENCES
Aalen, O.O. 1998. Frailty models. In: Everitt, B.S., and Dunn, G., eds., Statistical Analysis of Medical Data: New
Developments, Arnold, London, pp. 59–74.
Agresti, A. 2002. Categorical Data Analysis. Wiley, New York.
PATHWAY ANALYSIS OF MICROARRAY DATA VIA REGRESSION 277
Benjamini, Y., and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to
multiple testing. J. R. Statist. Soc. B 57, 289–300.
Bhojwani, D., et al. 2006. Biologic pathways associated with relapse in childhood acute lymphoblastic leukemia: a
Children’s Oncology Group study. Blood 108, 711–717.
Braun, T.M., and Feng, Z. 2001. Optimal permutation tests for the analysis of group randomized trials. J. Am. Statist.
Assoc. 96, 1424–1432.
Breslow, N., and Powers, W. 1978. Are there two logistic regressions for retrospective studies? Biometrics 34, 100–105.
Chamberlain, G. 1980. Analysis of covariance with qualitative data. J. R. Statist. Soc. B 74, 187–220.
Cox, D.R. 1972. Regression models and life-tables. J. R. Statist. Soc. 34, 187–220.
Dempster, A.P. 1958. A high dimensional two sample significance test. Ann. Math. Statist. 29, 995–1010.
Dempster, A.P. 1960. A significance test for the separation of two highly multivariate small samples. Biometrics 16,
41–50.
Dinu, I., et al. 2007. Improving gene set analysis of microarray data by SAM-GS. BMC Bioinform. 8, 242
Goeman, J.J., and Bühlmann, P. 2007. Analyzing gene expression data in terms of gene sets: methodological issues.
Bioinformatics 23, 980–987.
Goeman, J.J., et al. 2004. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics
20, 93–99.
Goeman, J.J., et al. 2005. Testing association of a pathway with survival using gene expression data. Bioinformatics
21, 1950–1957.
Good, P. 1994. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer, New
York.
Hochberg, Y. 1988. A sharper Bonferroni for multiple significance testing. Biometrika 75, 800–803.
Hougaard, P. 1995. Frailty models for survival data. Lifetime Data Anal. 1, 255–273.
Liu, Q., et al. 2007. Comparative evaluation of gene-set analysis methods. BMC Bioinform. 8, 431.
Mansmann, U., and Meister, R. 2005. Testing differential gene expression in functional groups. Goeman’s global test
versus an ANCOVA approach. Methods Inf. Med. 44, 449–453.
McCullagh, P., and Nelder, J.A. 1989. Generalized Linear Models. Chapman & Hall/CRC, New York.
McCulloch, C.E., and Shayle, R.S. 2000. Generalized, Linear, and Mixed Models. Wiley, New York.
Molenberghs, G., and Verbeke, G. 2005. Models for Discrete Longitudinal Data. Springer, New York.
Mootha, V.K., et al. 2003. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately
downregulated in human diabetes. Nat. Genet. 34, 267–273.
Nelder, J.A., and Wedderburn, R.W.M. 1972. Generalized linear models. J. R. Statist. Soc. Ser. A 135, 370–384.
Storey, J.D. 2002. A direct approach to false discovery rates. J. R. Statist. Soc. Ser. B 64, 479–498.
Storey, J.D., et al. 2004. Strong control, conservative point estimation and simultaneous conservative consistency of
false discovery rates: a unified approach. J. R. Statist. Soc. Ser. B 66, 187–205.
Storey, J.D., and Tibshirani, R. 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. USA 100,
9440–9445.
Subramanian, A., et al. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide
expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550.
Tusher, V.G., et al. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl.
Acad. Sci. USA 98, 5116–5121.
van de Vijver, M.J., et al. 2002. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J.
Med. 347, 1999–2009.
Wright, G.W., and Simon, R.M. 2003. A random variance model for detection of differential gene expression in small
microarray experiments. Bioinformatics 19, 2448–2455.
Address reprint requests to:
Dr. Y. Yasui
Department of Public Health Sciences
University of Alberta
13-106A Clinical Sciences Building
Edmonton, Alberta T6G 2G3 Canada
E-mail: [email protected]