network-based methods to identify highly discriminating subsets of biomarkers

9
Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers Seyed Javad Sajjadi, Xiaoning Qian, Bo Zeng, and Amin Ahmadi Adl Abstract—Complex diseases such as various types of cancer and diabetes are conjectured to be triggered and influenced by a combination of genetic and environmental factors. To integrate potential effects from interplay among underlying candidate factors, we propose a new network-based framework to identify effective biomarkers by searching for groups of synergistic risk factors with high predictive power to disease outcome. An interaction network is constructed with node weights representing individual predictive power of candidate factors and edge weights capturing pairwise synergistic interactions among factors. We then formulate this network-based biomarker identification problem as a novel graph optimization model to search for multiple cliques with maximum overall weight, which we denote as the Maximum Weighted Multiple Clique Problem (MWMCP). To achieve optimal or near optimal solutions, both an analytical algorithm based on column generation method and a fast heuristic for large-scale networks have been derived. Our algorithms for MWMCP have been implemented to analyze two biomedical data sets: a Type 1 Diabetes (T1D) data set from the Diabetes Prevention Trial-Type 1 (DPT-1) study, and a breast cancer genomics data set for metastasis prognosis. The results demonstrate that our network-based methods can identify important biomarkers with better prediction accuracy compared to the conventional feature selection that only considers individual effects. Index Terms—Maximum weighted multiple clique problem, discriminating biomarkers, column generation Ç 1 INTRODUCTION M ODERN high-throughput technologies have generated unprecedented amounts of large-scale high-dimen- sional “-omics” data for better understanding complex dis- eases, which have been commonly believed to result from complicated interactions between both genetic risk factors and environmental exposures [1]. Analyzing these high- dimensional heterogeneous data to identify effective bio- markers for better disease prognosis and diagnosis has been a critical challenge in computational biology [2], [3], [4]. Pre- vious methods have focused on either greedy or penalized feature selection including LASSO [3], [4], [5], [6], which typ- ically do not explicitly consider interactions among different candidate risk factors. These methods have shown limited power to identify stable and effective biomarkers with high predictive power in complex disease studies as interactive effects may be essential to understand these systems impair- ments, including cancer and diabetes [1], [7], [8], [9]. To bridge this discrepancy, in a previous study [10], an interaction network representation scheme has been developed to capture both the individual effects from candidate risk factors and the pairwise interactive effects among them. In this interaction network, each node rep- resents a candidate risk factor and its assigned node weight captures its individual predictive power for the outcome of interest. An edge between any pair of nodes also has an assigned edge weight corresponding to the synergistic power of the interaction between two corre- sponding factors. There are different ways to estimate synergy between two risk factors, the estimation we pres- ent in this paper is based on regression models. In this interaction network framework, we then formulate the biomarker identification problem as a network optimiza- tion problem to search for a Maximum Weighted Clique (MWC) that has the maximum total weight from both constituent nodes and induced edges. The identified MWC is a complete subnetwork with selected risk factors that have the highest predictive power with the most syn- ergistic interactions among them. Therefore, interactive effects among risk factors are integrated together with individual effects for the most effective biomarker identi- fication. It has been known that complex diseases may be triggered and affected by multiple factors (genetic as well as environmental) [1], [11], [12], [13], which indicates a single clique may not be sufficient to fully explain the cause or development of disease. So, a more comprehen- sive model should be developed and employed to iden- tify a set of highly synergistic cliques in a systematic way. However, such a task imposes a big computational chal- lenge, given the fact that computing a single MWC is already NP-hard [14], [15]. Actually, to the best of our knowledge, there has been no analytical study on select- ing a set of cliques whose total weight for both nodes and edges is maximized. To achieve the goal of identifying effective biomarkers for accurate disease prognosis and diagnosis, in this paper, we aim to address this challenge by first developing advanced mathematical models and algorithms to identify S.J. Sajjadi and B. Zeng are with the Department of Industrial and Man- agement Systems Engineering, University of South Florida, Tampa, FL 33620. E-mail: [email protected], [email protected]. X. Qian is with the Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX 77843, and the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, and the Department of Pediatrics, University of South Florida, Tampa, FL 33620. E-mail: [email protected]. A.A. Adl is with the Department of Computer Science and Engineering, University of South Florida, Tampa FL 33620. E-mail: [email protected]. Manuscript received 14 Mar. 2013; revised 16 Apr. 2014; accepted 10 May 2014. Date of publication 15 May 2014; date of current version 4 Dec. 2014. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TCBB.2014.2325014 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 6, NOVEMBER/DECEMBER 2014 1029 1545-5963 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: amin-ahmadi

Post on 11-Apr-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

Network-Based Methods to Identify HighlyDiscriminating Subsets of Biomarkers

Seyed Javad Sajjadi, Xiaoning Qian, Bo Zeng, and Amin Ahmadi Adl

Abstract—Complex diseases such as various types of cancer and diabetes are conjectured to be triggered and influenced by a

combination of genetic and environmental factors. To integrate potential effects from interplay among underlying candidate factors, we

propose a new network-based framework to identify effective biomarkers by searching for groups of synergistic risk factors with high

predictive power to disease outcome. An interaction network is constructed with node weights representing individual predictive power

of candidate factors and edge weights capturing pairwise synergistic interactions among factors. We then formulate this network-based

biomarker identification problem as a novel graph optimization model to search for multiple cliques with maximum overall weight, which

we denote as the MaximumWeighted Multiple Clique Problem (MWMCP). To achieve optimal or near optimal solutions, both an

analytical algorithm based on column generation method and a fast heuristic for large-scale networks have been derived. Our

algorithms for MWMCP have been implemented to analyze two biomedical data sets: a Type 1 Diabetes (T1D) data set from the

Diabetes Prevention Trial-Type 1 (DPT-1) study, and a breast cancer genomics data set for metastasis prognosis. The results

demonstrate that our network-based methods can identify important biomarkers with better prediction accuracy compared to the

conventional feature selection that only considers individual effects.

Index Terms—Maximum weighted multiple clique problem, discriminating biomarkers, column generation

Ç

1 INTRODUCTION

MODERN high-throughput technologies have generatedunprecedented amounts of large-scale high-dimen-

sional “-omics” data for better understanding complex dis-eases, which have been commonly believed to result fromcomplicated interactions between both genetic risk factorsand environmental exposures [1]. Analyzing these high-dimensional heterogeneous data to identify effective bio-markers for better disease prognosis and diagnosis has beena critical challenge in computational biology [2], [3], [4]. Pre-vious methods have focused on either greedy or penalizedfeature selection including LASSO [3], [4], [5], [6], which typ-ically do not explicitly consider interactions among differentcandidate risk factors. These methods have shown limitedpower to identify stable and effective biomarkers with highpredictive power in complex disease studies as interactiveeffects may be essential to understand these systems impair-ments, including cancer and diabetes [1], [7], [8], [9].

To bridge this discrepancy, in a previous study [10], aninteraction network representation scheme has beendeveloped to capture both the individual effects fromcandidate risk factors and the pairwise interactive effects

among them. In this interaction network, each node rep-resents a candidate risk factor and its assigned nodeweight captures its individual predictive power for theoutcome of interest. An edge between any pair of nodesalso has an assigned edge weight corresponding to thesynergistic power of the interaction between two corre-sponding factors. There are different ways to estimatesynergy between two risk factors, the estimation we pres-ent in this paper is based on regression models. In thisinteraction network framework, we then formulate thebiomarker identification problem as a network optimiza-tion problem to search for a Maximum Weighted Clique(MWC) that has the maximum total weight from bothconstituent nodes and induced edges. The identifiedMWC is a complete subnetwork with selected risk factorsthat have the highest predictive power with the most syn-ergistic interactions among them. Therefore, interactiveeffects among risk factors are integrated together withindividual effects for the most effective biomarker identi-fication. It has been known that complex diseases may betriggered and affected by multiple factors (genetic as wellas environmental) [1], [11], [12], [13], which indicates asingle clique may not be sufficient to fully explain thecause or development of disease. So, a more comprehen-sive model should be developed and employed to iden-tify a set of highly synergistic cliques in a systematic way.However, such a task imposes a big computational chal-lenge, given the fact that computing a single MWC isalready NP-hard [14], [15]. Actually, to the best of ourknowledge, there has been no analytical study on select-ing a set of cliques whose total weight for both nodes andedges is maximized.

To achieve the goal of identifying effective biomarkersfor accurate disease prognosis and diagnosis, in this paper,we aim to address this challenge by first developingadvanced mathematical models and algorithms to identify

� S.J. Sajjadi and B. Zeng are with the Department of Industrial and Man-agement Systems Engineering, University of South Florida, Tampa, FL33620. E-mail: [email protected], [email protected].

� X. Qian is with the Department of Electrical & Computer Engineering,Texas A&M University, College Station, TX 77843, and the Departmentof Computer Science and Engineering, University of South Florida,Tampa, FL 33620, and the Department of Pediatrics, University of SouthFlorida, Tampa, FL 33620. E-mail: [email protected].

� A.A. Adl is with the Department of Computer Science and Engineering,University of South Florida, Tampa FL 33620.E-mail: [email protected].

Manuscript received 14 Mar. 2013; revised 16 Apr. 2014; accepted 10 May2014. Date of publication 15 May 2014; date of current version 4 Dec. 2014.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TCBB.2014.2325014

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 6, NOVEMBER/DECEMBER 2014 1029

1545-5963� 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

multiple cliques from our interaction network representa-tion. Specifically, a discrete optimization model, whichseeks for a collection of non-overlapping (disjoint) cliqueswith maximum total weight, and its top-K extension, whichrestricts the cardinality of that collection to K, are con-structed. We observe that although those formulations arecompact, the state-of-the-art professional solvers cannoteven deal with very small instances with tens of candidaterisk factors. So, a sophisticated computational strategy, i.e.,the column generation (CG) method, is adopted and cus-tomized to identify those disjoint cliques simultaneously.Column generation is a two-level master-subproblem com-puting framework [16], [17], [18] that is suitable to solvelarge-scale optimization problems. In the master problem(MP), which is a computationally friendly linear program(LP), a solution is derived based on a restricted set of feasiblesolutions. Then, the dual information of the LP problem isapplied to populate the subproblem to generate high qualitypotential solutions, which will be used to augment the feasi-ble solution set of the master problem. By iteratively com-puting the master and subproblems, a global optimal ornear optimal solution of a large-scale instance can beobtained in a reasonable time. Recently, this computationalmethod has been adopted in the study of predicting HIV-1drug resistance [19] and protein fold prediction [20] prob-lems. To further improve our solution capability for large-scale problems with -omics data, a fast heuristic method isalso designed to sequentially identify highly weighted cli-ques from the network. These two algorithms allow us tohandle networks at different scales in a reasonable timewith desired quality.

We have performed a set of experiments to demonstratethe significance of considering synergistic interactions forbiomarker identification as well as the effectiveness of iden-tified biomarkers for disease prognosis for both Type 1 Dia-betes (T1D) and breast cancer. Our experimental resultswith both randomly generated networks and constructedinteraction networks from both T1D and breast cancer datasets have shown that our network-based biomarker identifi-cation methods can effectively identify critical biomarkersfor better prediction accuracy.

2 METHODS

For a given data set with disease outcomes as well as col-lectedmeasurements for candidate risk factors for biomarkeridentification, we construct a network with weighted nodesand edges, where nodes represent potential biomarkers withnode weights corresponding to its individual predictingpower of disease outcome, and each edge weight estimatesthe synergistic effect of two biomarkers connected by thatedge. In order to identify highly synergistic biomarkers withhigh predicting power of disease outcome, we formulate anetwork optimization problem to search for a group of cli-ques whose total weight is maximized. Such a problem isdenoted as the Maximum Weighted Multiple Clique Prob-lem (MWMCP). To solve this problem which has not beenstudied in the literature, we have designed two algorithmsthat are delineated in the following sections. As both algo-rithms require to solve a series of Maximum Weighted Cli-que Problems (MWCP), we begin with discussing the

MWCP, and then introduce our optimization model for theMWMCP, followed by our descriptions of solution methodsfor theMWMCP.

2.1 Exact Algorithm for MWCP

In a weighted undirected network denoted by GðV;EÞwhere V ¼ fv1; . . . ; vng is a set of nodes and E � V � V is aset of edges, a clique is a subset of V in which every pair ofnodes are directly connected by an edge in E. Each node isassigned with a node weight, which is computed by a func-tion p : V ! R that estimates the individual predictingpower for biomarker identification in our applications. Sim-ilarly, the edge between two nodes is assigned with an edgeweight according to a function w : E ! R that estimates thesynergistic power between two nodes. Let eij denote theedge between vi and vj and EC � E denote the set of edgesinduced by a clique C. Then the weight of clique C, denotedby wC , can be written as

wC ¼Xvi2C

pðviÞ þX

eij2EC

wðeijÞ:

Solving this MWCP yields a clique C with the maximumweight wC .

Notice that the classical Maximum Clique Problem(MCP) is a special case of MWCP with pð:Þ ¼ 1 and wð:Þ ¼ 0.Consequently, the objective of the MCP is to find a clique Cof largest size jCj. The MCP is well known to be an NP-hardproblem [14], which actually is not even approximable [21].Hence, the MWCP is also NP-hard. As one of the centralgraph theoretical problems and having various applica-tions, in particular, in coding theory [22], computer vision[23], geometric tiling [24], fault diagnosis [25], computa-tional chemistry [26], and especially in computational biol-ogy and bioinformatics [27], a tremendous effort has beenmade on devising solution algorithms for both weightedand unweighted maximum clique problems over years (see[15] for a thorough survey). Furthermore, because it is com-putationally equivalent to the Maximum Independent SetProblem (MISP) and the Minimum Vertex Cover Problem,the study of this problem is very important with broad real-world applications.

The generalized case of the MWCP, in which both nodesand edges could have unrestricted real-valued weights, hasnot received much attention in the existing literature. Thus,we first develop an exact algorithm [10] for this generalcase. It can be considered as an extension of the algorithmin [28] for unweighted cases, which is also modified recentlyto solve node-weighted cases [29], [30]. Our algorithmadopts a branch-and-bound framework to obtain the exactsolution. It maintains three sets: S, which denotes the cur-rent forming clique; U , which is the working set and storesthe prospective members to S; and P , which stores theupdated weights of nodes in S. Initially and at the rootnode of the branch-and-bound tree (level 0), S0 ¼ ? , U0 ¼ Vand P0 ¼ fpðvjÞjvj 2 U0g. A branch at level dþ 1 is createdby selecting a node vnew from Ud. This newmember is addedto the forming clique Sdþ1 and removed from the previousworking set. All nodes in Ud that are adjacent to vnew willform Udþ1 and node weights are updated by shifting therespective weights of the edges connecting these nodes to

1030 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 6, NOVEMBER/DECEMBER 2014

Page 3: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

vnew. Equivalently, Ud ¼ Ud n fvnewg, Sdþ1 ¼ Sd [ fvnewg,Udþ1 ¼ Ud \NðvnewÞ and Pdþ1 ¼ fpðvjÞ þ

Pvk2Sdþ1

wðejkÞjvj 2 Udþ1g are updated, whereNðvnewÞ is the set of all neigh-bors of vnew. The algorithm adopts a depth-first search forbranch-and-bound: it goes in depth of the search tree first;and whenever Ud ¼ ? it steps back to level d� 1 to branchagain. This procedure keeps track of the updated totalweight of the forming clique while forming all possible cli-ques recursively. Eventually, the clique with the maximumweight is obtained when U0 ¼ ? , i.e., the search tree is tra-versed completely. Note that the updated node weights inP are used for all weight calculations.

To enhance the search process, we embed a pruningstrategy during the depth-first branch-and-bound. Specifi-cally, assuming that we are in level d, before expanding theforming clique Sd, we calculate an upper bound, for theweight of the best clique in Ud. If this upper bound togetherwith the weight of the forming clique wSd is lower than theweight of the current best clique, we do not need to exploreUd, thus, the current branch is pruned and the algorithmdirectly steps back to level d� 1. To obtain strong boundson updated weight estimation for fast pruning, we adopt aquick heuristic coloring technique [31] which assigns dis-tinct colors to any two adjacent nodes. Given that all nodesin a clique will have all different colors, we can obtain anupper bound for the weight of the maximum clique in aworking set Ud, by adding the maximum node weights ineach color class with the summation of all edge weightsinduced by Ud. We further have adopted another strategyto improve efficiency: we have employed an ordered set ofvertices V , given that the node ordering may only affect theefficiency of the algorithm due to the induced branchingorder but not the correctness of the solution. In our imple-mentation, we order nodes increasingly by their degreesfor more efficient pruning, because intuitively fewerbranches on lower levels of the search tree may help toprune higher degree nodes in higher levels. Experimentalobservations have confirmed the effectiveness of this order-ing strategy.

2.2 Mathematical Model for MaximumWeightedMultiple Clique Problem

As explained earlier, we formulate our final biomarkeridentification problem as a Maximum Weighted MultipleClique Problem to find a set of disjoint cliques with maxi-mum total weight. The explicit integer programmingform assuming all cliques are given is presented as fol-lows, while the compact form without that assumptioncan be found in the appendix:

MWMCP-IP : max

�XC2C

wCXC

����X

fC2Cji2CgXC � 1 8i 2 V

XC 2 f0; 1g 8C 2 C

�;

C denotes the collection of all cliques in G and XC is anindicator variable that equals to 1 if clique C is selected inthe solution, and 0 otherwise. The objective is to maximizethe total weight of selected cliques. The first constraintensures that the selected cliques are disjoint, i.e., no nodecan be assigned to more than one selected clique. The

second constraint requires that no clique can be partiallyselected. Because there are potentially an exponentialnumber of cliques with respect to the number of nodes inG, it is not practical to enumerate all possible cliques in C

to obtain this explicit form. To address this challenge, weemploy a column generation algorithm [18], [32] tosequentially generate quality cliques, then, we identify asubset of them with maximum total weight. In addition tothis analytical algorithm, we design a fast heuristic proce-dure that can deal with large-scale instances, for examplein analysis of microarray data in genomics. In the follow-ing, we describe our column generation and heuristic pro-cedures to solve MWMCP-IP.

2.3 The Column Generation Method to SolveMWMCP

Generally, a column generation algorithm [18], [32] has twoparts in implementation: a master problem and a subprob-lem. Using the aforementioned algorithm for MWCP as asubroutine to identify a desired MWC for the subproblem,we now can implement the CG algorithm to solveMWMCP-IP to obtain a collection of disjoint cliqueswhosetotal weight is maximized. Assume that we have an initialcollection C � C of cliques. Themaster problem is as follows:

MP : max

�XC2C

wCXC

����X

fC2Cji2CgXC � 1 8i 2 V

0 � XC � 1 8C 2 C

�;

Note that it is a linear program and by solving it, we obtainthe dual solution to each constraint. Such dual informationprovides us an improving direction to seek (price out) a newclique, i.e., a column denoting the nodes belonging to theclique. Specifically, let pi denote the dual value correspond-ing to the ith constraint in MP. The weight of vi in the sub-problem is updated as pðviÞ � pi from the original nodeweight, see Fig. 1 for an illustration. On this updated net-work, we solve the subproblem, i.e., an MWCP, to identifythe most weighted clique C0 by using the aforementionedalgorithm for MWCP. If wC0 is positive, we expand C toC [ fC0g, i.e., include one more column in MP, and itera-tively repeat this procedure to solve MP. Otherwise, westop the procedure.

If the column generation procedure terminates with aninteger solution to the master problem, we obtain an exactsolution to MWMCP-IP. If this is not the case, we includethe integrality restriction in MP and solve it again to obtaina feasible integer solution. Because the column generationprocedure only generates cliques that are needed most inorder to optimize MP, that feasible solution is of high qual-ity in general.

2.3.1 Top-K-Node Model

It is common nowadays that we collect high-dimensionalmeasurements by including all candidate risk factors thatmay contribute to disease development, especially due tothe advancement of high-throughput-omic profiling tech-nologies [33]. By analyzing these high-dimensional data,we hope to identify critical risk factors as biomarkers to

SAJJADI ET AL.: NETWORK-BASED METHODS TO IDENTIFY HIGHLY DISCRIMINATING SUBSETS OF BIOMARKERS 1031

Page 4: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

better understand the disease of interest. It is often thecase that only a limited number of measured variablesare associated with disease outcome. Motivated by identi-fying a small number of biomarkers that are most effec-tive, we consider a variety of MWMCP-IP model thatfinds a collection of cliques with maximum total weightthat contains up to K nodes. Towards this direction, weextend the MWMCP-IP model by adding a cardinalityconstraint

PC2C jCj XC � K. Parameter K can be set

based on the percentage of nodes in G that we want toinclude for prediction. For example, if we take 25 percentof nodes in G, K ¼ d0:25 ne.

2.4 Heuristic Sequential Method

In case of very large-scale problems with more than tens ofthousands of nodes, we observe that it may be challengingto solve MWMCP-IP or its top-K extension by our CGmethod in a reasonable time. Clearly, the computationalcomplexity of this problem necessitates the development ofa faster method to obtain solutions with high quality withina shorter time frame. Hence, we develop a fast greedy heu-ristic method to handle such problems. We mention thatthis heuristic procedure can also be applied to solve the top-K-node variant with minor modifications.

In the greedymethod, we sequentially solve theMWCP inG to find a single maximumweighted clique C, updateG byremoving the selected clique fromG, and repeat those opera-tions until there exists no cliquewith a positiveweight. Obvi-ously, the collection of obtained cliques is a feasible solutionto MWMCP-IP as those cliques are disjoint. To furtherimprove the solution quality, we design and implement aperturbation procedure that can avoid removing a cliquethat may hurt other potentially highly-weighted cliques.Specifically, before removing C from G, we check whetherthere exists a node vi 2 C that could be removed from C toform a seperate cliqueC0 (consisting of vi only, or vi with oneof its neighbors outside C) such that the total weight of theresulting two cliques (C n fvig and C0) is greater than theweight of C. If such a node exists, we perturb C by removingvi. The idea of perturbation is demonstrated in Fig. 2 whereall node weights are assumed to be zero. The solution to theMWCP for the network given in Fig. 2, is the triangle in themiddle with a total weight of 2.1. If we remove this clique,there would be no clique with a positive weight in theremaining network. Hence, this triangle clique would be asolution to MWMCP based on the greedy sequential proce-durewhile it is not optimal as there are three pairwise cliqueswith a greater total weight of 6. Now with perturbation, wecan perturb any node in the triangle clique C by removing itfrom C, forming another clique with weight of 2 by

connecting it to its adjacent node and obtaining two cliqueswith 2.7 as their total weight, greater than wC ¼ 2:1. Byrepeating those steps, we will have three cliques as shown inFig. 2 (right), which actually consists the optimal solution fortheMWMCP in the given network.

3 EXPERIMENTS AND DISCUSSIONS

The performance of our methods is evaluated on bothErdo��s-R�enyi (ER) random networks [34] and constructedinteraction networks based on data collected from the DPT-1 study for T1D as well as the breast cancer microarray dataset. All algorithms are implemented in C++ on a standardPC with a 2.2 GHz CPU and 2 GB of RAM. The state-of-the-art integer programming solver IBM ILOG CPLEX 12.1 isadopted to solve MP within the CG method, as well as thecompact integer programming formulation (see the appen-dix). Results of the latter can be used to benchmark thedeveloped CG and the heuristic sequential methods.

3.1 ER Random Networks and AlgorithmPerformance

We generate a set of random networks to evaluate andcompare the performance of proposed methods in terms ofsolution quality and time efficiency. To generate an ER ran-dom network with a given number of edges (or equiva-lently the density), a pair of disjoint nodes are randomlychosen and connected by adding an edge, this process isrepeated until we get the desired number of edges. Nodeand edge weights are also assigned randomly. Nodeweights are random numbers between �1 and 1 followinga uniform distribution, while edge weights are uniformlydistributed between �0:5 and 0:5. For the top-K exten-sions, we set K such that the solution involves 25 percentof the nodes, i.e., K ¼ d0:25 ne.

Experimental results on ER networks are presented inTable 1. It is obvious that solving the compact integer pro-gramming formulation by CPLEX is not practically feasibleas it is extremely hard to solve instances with only tens ofnodes. On the contrary, the CG method can solve instances

Fig. 2. The solution to MWMCP without perturbation (left) may vary dra-matically from the solution with perturbation (right).

Fig. 1. Illustration of the subproblem in CG in three iterations—Each time, the dual values (represented in vectors) are updated by solving the masterproblem and selected cliques are added to the master problem. We stop when there is no clique with a positive total weight.

1032 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 6, NOVEMBER/DECEMBER 2014

Page 5: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

up to a thousand nodes in a reasonable time, which drasti-cally improves our solution capability to larger instances.Indeed, for instances that their optimal solutions can be com-puted by CPLEX, CG solves them to optimality with negligi-ble computational time. For those instances that CPLEX failsto derive their optimal solutions within two hours, the CGmethod always generates significantly better solutions.Hence, this observation confirms the capability of CGmethod to obtain optimal or near optimal solutions. Actu-ally, the experiments on random networks have shownempirically that for CGmethod, the gap between the integralsolution obtained after adding the integrality constraint, andthe optimal integral solution is nomore than 10 percent.

We further note for instances with thousands of nodesthat CGmethod may take a long time to compute. Neverthe-less, the heuristic sequential method is much less sensitiveto the size of instances. It can quickly generate high-qualitysolutions for large-scale networks. Actually, it generallygenerates solutions of high quality and sometimes the differ-ence is marginal, compared to those derived by CG method.Hence, we believe that these two algorithms allow us to han-dle networks at different scales in a reasonable time withdesired quality.

Finally, for the top-K model, both CG and the sequentialmethods can complete within a shorter time. One explana-tion is that the cardinality constraint made the problemseasier to solve by cutting out a significant amount of feasi-ble solutions.

3.2 Network-Based Biomarker Identification

3.2.1 Construction of Interaction Networks

In order to evaluate our network-based biomarker identi-fication methods, we first construct a weighted networkfor all included candidate risk factors in the analysis. Wedefine the node weight pðviÞ ¼ �log pi, in which pi is thecoefficient p-value for b1 by fitting a logistic regressionmodel: log ðg=ð1� gÞÞ ¼ b0 þ b1vi with the correspondingcandidate factor vi. Here, g denotes the posterior probabil-ity of a certain disease outcome y given measurement ofvi: g ¼ PrðyjviÞ; and log ðg=ð1� gÞÞ is the link function of

the logistic regression model. Similarly, we can define theedge weight wðeijÞ between candidate factors vi and vj aswðeijÞ ¼ �log pij based on the coefficient p-value pij for b3

in the logistic regression model integrating with the inter-action term between vi and vj: log ðg=ð1� gÞÞ ¼ b0 þ b1viþb2vj þ b3vivj, in which g ¼ Prðyjvi; vjÞ. In the constructednetwork, we remove low-weighted edges to focus onstrong interactions, which also makes the problem easierto solve for large scales.

3.2.2 Network-Based Biomarker Identification and

Performance Evaluation

We implement our network-based methods and compareit with a traditional forward feature selection algorithm[3] that only considers the discriminating power of indi-vidual candidate biomarkers. Such a comparison demon-strates that our network-based biomarker identificationapproach can achieve better prediction accuracy due tothe integration of interactive effects among candidate fac-tors. We first apply both column generation and heuristicsequential algorithms to solve MWMCP on the interactionnetworks. As a result, we obtain multiple cliques whichcapture both individual and interactive effects amongcandidate factors. To evaluate and compare the perfor-mance of biomarker identification, we adopt a SupportVector Machine (SVM) with polynomial kernel of degreetwo as our classifier. The choice of kernels in SVM is toensure that interactions among biomarkers are consideredfor classification while controlling model complexity atthe same time. In our experiments, we have adopted theLIBSVM [35] implementation of SVM in Matlab. For bothnetwork-based and individual biomarker identification,the same forward feature selection procedure has beenapplied to select the best group of biomarkers with thehighest classification accuracy.

As there are several steps during our classifier trainingstage, we perform the following “embedded” cross-valida-tion to appropriately estimate the classification perfor-mance for both network-based and individual biomarkeridentification. In this cross-validation procedure, we firstrandomly divide the data set into five folds. Then, four

TABLE 1Experimental Results for Random Graphs

CG CG top-K Seq Seq top-K CPLEX IP

jV j % jCj obj t obj t obj t obj t obj t gap %

20 50 27 4.65 0.02 3.6 0.01 3.97 0.01 2.99 0.01 4.65 11.3 020 80 35 13.02 0.02 6.6 0.02 12.29 0.01 6.3 0.01 13.02 67.3 050 5 58 13.73 0.03 9.65 0.02 13.51 0.01 8.62 0.01 13.73 31.1 050 20 75 14.15 0.03 8.9 0.02 13.48 0.01 8.12 0.01 14.15 7200 12.850 30 91 22.61 0.05 13.22 0.04 21.12 0.01 11.26 0.01 22.03 7200 42.350 40 86 17.86 0.06 12.29 0.05 16.78 0.01 10.2 0.01 11.58 7200 193.350 60 129 29.81 0.34 15.16 0.26 26.92 0.01 14.33 0.01 19.59 7200 261.450 80 162 36.11 5.43 19.53 5.37 32.2 0.03 17.8 0.02 17.24 7200 453.0100 40 295 173.1 7.88 64.47 1.42 169.6 0.03 65.36 0.01 - - -200 40 525 120.0 66.8 56.75 44.8 108.4 0.59 51.97 0.37 - - -500 25 1261 270.5 578.2 132.2 563.2 247.2 7.76 125.4 4.8 - - -1000 10 2873 1535.8 1385.9 554.2 1026.1 1520.7 11.8 523.6 8.15 - - -2000 5 5730 2355.0 4674.1 888.2 4630.1 2213.3 57.0 849.3 33.4 - - -

(%—Graph Density; CG—Column Generation Algorithm; Seq— Heuristic Sequential Algorithm; CPLEX IP—Compact Formulation Solved by Solver;obj—the Total Weight of Selected Cliques; t—Computing Time in Seconds; gap %—the Relative Gap between the Best Integer Solution and the BestUpper Bound).

SAJJADI ET AL.: NETWORK-BASED METHODS TO IDENTIFY HIGHLY DISCRIMINATING SUBSETS OF BIOMARKERS 1033

Page 6: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

folds of data are used as the training set to select bio-markers and build the classifier; and the remaining fold isused as the testing set to estimate the classification accuracyof selected biomarkers. This procedure is repeated fivetimes for each fold as the testing set.

For each training set, we perform a feature selection algo-rithm. For individual feature selection, we rank candidatebiomarkers based on a descending order of their individualdiscriminating power measured by the coefficient p-valuesfrom fitted regression models. For network-based methods,we rank the identified cliques based on a descending orderof their corresponding total weights. Then, the same for-ward selection procedure is implemented to sequentiallyadd individual biomarkers or cliques, in the ranked order tothe set of selected biomarkers. If adding a new individualfactor or clique improves the estimated classification accu-racy, it will be selected in the final biomarker set. Otherwise,we move on to the next ranked factor or clique to iterate thesame procedure.

The classification accuracy for feature selection is esti-mated by traditional three-fold cross-validation using thetraining set, in which two folds of the training set are used totrain the SVM classifier and one fold is used for testing. Theprocedure is repeated three times to estimate the accuracybased on the currently selected biomarkers. Finally, the test-ing set is used to estimate the testing classification accuracybased on the selected final biomarker set. The overall evalu-ation procedure is repeated 100 times and the average accu-racy is reported as the estimated classification accuracy.

3.2.3 DPT-1

We first test and compare the performance of different bio-marker identification methods using a relatively small dataset studying Type 1 Diabetes (T1D). We study baseline char-acteristics including immunologic and metabolic indiceswith respect to T1D development in subjects with high riskusing the collected data from the Diabetes Prevention Trial-Type 1 (DPT-1) study. In DPT-1, there are 3;483 subjects pos-itive for islet cell autoantibodies (ICA) among the total103;391 screened subjects. The projected five-year risk ofdiabetes for these subjects is estimated according to geneticsusceptibility; age; immunologic indices from differentautoantibodies, including ICA, insulin autoantibodies(IAA), glutamic acid decarboxylase (GAD), ICA512 (insuli-noma-associated protein 2), and micro-insulin autoantibod-ies (MIAA); and metabolic indices, including 2-hourglucose, fasting glucose, glycated hemoglobin (HbA1c), fast-ing insulin, first-phase insulin response (FPIR), C-peptidemeasurements in the fasting state, and then 30, 60, 90, and120 minutes after oral glucose. As in the previous univariateanalysis [36], we compute Homeostasis model assessmentof insulin resistance (HOMA-IR ¼ fasting insulin (mm=l) �fasting glucose (mmol=l) /22.5), FPIR-to-HOMA-IR ratio,peak C-peptide as the maximum point of all measurements,and AUC (area under the curve) C-peptide using the trape-zoid rule based on the given metabolic indices. Further-more, we include age and Body Mass Index (BMI) in ournetwork-based multivariate analysis as important con-founding factors.

In this paper, we focus on DPT-1 study subjects stagedto the “high risk” group [36], [37], [38], which contains 339

subjects in total. Within this high risk group, 169 subjectsreceived parenteral insulin supplement and we refer thisset as the “Treatment” group while the other 170 subjectsreceived placebo as a control group, which is referred as“Placebo” in this paper. We are interested in identifyingthe most predictive group of biomarkers from the previ-ously described candidates to predict the outcome y—thedevelopment of T1D at the end of DPT-1 study. Withinboth the “Treatment” and “Placebo” groups, there are 80subjects diagnosed with T1D at the end of the study withy ¼ 1. We have tested both the individual and network-based methods using both groups of data. We have com-puted the classification accuracies from different bio-marker identification methods based on the previouscross-validation procedure. These estimated classificationaccuracies are reported in Table 2.

Comparing both column generation and sequential net-work-based methods with the individual-based featureselection, the reported results clearly show that both net-work-based biomarker identification methods are perform-ing significantly better (with p-values < 1e� 6) than thetraditional individual-based feature selection. These resultsverify our expectation that network-based biomarker selec-tion methods are able to find biomarkers with higher pre-dictive accuracies by integrating interactive effects amongbiomarkers.

In the previous cross-validation experiments, we haveimplemented 500 (100 repeated five-fold “embedded” crossvalidation) feature selections, each time based on a ran-domly sampled training subset. As a result, we have found500 different subsets of final biomarker sets. To ensure thatwe have obtained reliable results without overfitting, weprovide in Figs. 3A and 3B the lists of frequently selectedfinal biomarkers that appeared in at least 70 percent of 500different feature selection runs from different biomarkeridentification methods for the “Treatment” and “Placebo”groups respectively. When comparing selected features byour CG and sequential methods, we find that the additionalfeatures selected by CG are Fasting glucose levels fromeither OGTT or IVGTT. As discussed in the recent positionstatement of the American Diabetes Association (ADA) [39],these indices are main diagnosis criteria for clinical diabe-tes. We also have tested the performance of those final bio-markers based on 100 repeated five-fold cross validation(without feature selection) and their corresponding esti-mated testing accuracies are given in Table 3. The resultsfurther verify that within these commonly conjecturedimportant biomarkers for T1D [36], [37], [38], network-based biomarker selection can provide better biomarkerswith higher predictive power for T1D, which may lead tobetter prognosis models.

TABLE 2The Classification Accuracies for Different Methods

Based on T1D Data Sets

Dataset Ind Seq CG

T1D Treatment 62.39 65.69 65.60T1D Placebo 59.74 62.57 62.45

(Ind—Individual predictive power based feature selection; CG—Column Gen-eration algorithm; Seq—Heuristic sequential algorithm).

1034 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 6, NOVEMBER/DECEMBER 2014

Page 7: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

3.2.4 Breast Cancer

We further evaluate our proposed network-based bio-marker identification methods on a large genomic data setfor breast cancer metastasis study [40], which is referred asthe “USA” data set as in the literature [8], [9]. The USA dataset contains the gene expression profiles for 22,283 genes of286 breast cancer patients from which 107 are detected withmetastasis and the remaining 179 are metastasis-free. Anextremely large amount of time needs to be spent, espe-cially for CG method, if we apply the previous embeddedcross validation procedure with 500 repeats of network con-struction and clique finding with such a large number ofcandidate genes. In order to perform a comparison betweenCG and the other methods in a reasonable time, we haveadopted a preprocessing step to filter out a large number ofgenes. To obtain a smaller set of important genes as poten-tial biomarkers, they are ranked by their individual predic-tive power, again based on the coefficient p-value in logisticregression using all the samples. Then, the top 1 percentof genes (222 genes in total) with the highest individualpredictive power are kept for performance comparison forthe USA data set. Table 4 provides the estimated classifica-tion accuracies for all the methods. The results clearly showthat our network-based biomarker identification methods

which incorporate the interactions among candidate genes,select markers with significantly (p-values < 1e� 7) betterclassification accuracy than the traditional feature selectionbased on only individual power.

To check the consistency of selected genes among 500repeated runs in cross validation, we draw a frequencycurve for selected genes. Each gene would appear from 0 to500 times among 500 final biomarker sets of genes. We com-pute the ratio of the number of genes that have appeared atleast f times (1 � f � 500) over the total number of genesthat are selected at least once. As illustrated in Fig. 4, theratio of repeatedly selected genes for our network-basedmethods are consistently higher than the correspondingratio for individual-based feature selection method. Thisdemonstrates that the selected genes by network-basedmethods are more stable towards different training sets.

Finally, we provide in Fig. 5 the list of frequently selectedgenes as final biomarkers that have been selected in at least30 percent of 500 repeated runs from different biomarkeridentification methods. According to a recent study [41],protein RNF19A has been identified as a differentiallyexpressed marker for breast cancer. The authors in [41]have shown that RNF19A is one of functional molecules incancer-associated fibroblasts. Based on our proposed fea-ture selection results shown in Fig. 5, the CG method hassuccessfully selected this marker, which demonstrates itspromising potential for accurate identification of discrimi-nating biomarkers. We also have tested the performance ofthose final biomarkers based on 100 repeated five-fold crossvalidation (without feature selection) and the correspondingestimated testing accuracies are given in Table 3. Althoughthe difference of the obtained accuracies by different feature

TABLE 3The Estimated Testing Classification Accuracies of FinalBiomarkers Based on 100 Repeated Five-Fold Cross

Validation for Different Methods

Dataset Ind Seq CG

T1D Treatment 62.1 68.21 68.02T1D Placebo 57.51 65.36 65.18Breast Cancer 74.56 75.26 76.43

(Ind—Individual predictive power based feature selection; CG—Column Gen-eration algorithm; Seq—heuristic sequential algorithm).

TABLE 4The Classification Accuracies of Different Methods Based on

Breast Cancer (USA) Data Set

Dataset Ind Seq Seq top-K CG CG top-K

Breast Cancer 65.54 70.89 68.65 71.02 67.82

(Ind—Individual predictive power based feature selection; CG—Column Gen-eration algorithm; Seq—heuristic sequential algorithm).

Fig. 4. Stability curves for breast cancer (USA) data set.Fig. 3. Features that appeared in at least 70 percent of 500 featureselections done in experiments for Treatment (A) and Placebo(B) groups in T1D data set.

Fig. 5. Features that appeared in at least 30 percent of 500 featureselections done in experiments for breast cancer data set.

SAJJADI ET AL.: NETWORK-BASED METHODS TO IDENTIFY HIGHLY DISCRIMINATING SUBSETS OF BIOMARKERS 1035

Page 8: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

selection methods is relatively small, the improvement byour new methods considering interactions among bio-markers in the synergy network is consistent and indeedstatistically significant with p-values smaller than 0:01based on two sample t-tests.

4 CONCLUSION

We formulate a MWMCP for biomarker identification toidentify a group of disjoint cliques whose total weight ismaximized. To solve MWMCP, we develop a column gener-ation method and a fast heuristic algorithm. The prelimi-nary results show that the developed algorithms can handleinstances of different scales with high quality solutions, andthe network-based methods are capable to identify moreaccurate biomarkers by capturing extra interactive effects incomparison to individual-based feature ranking.

APPENDIX ACOMPACT FORMULATION FOR MAXIMUM WEIGHTED

MULTIPLE CLIQUE PROBLEM

maxXi

Xk

pðviÞXik þXi

Xj> i

Xk

wðeijÞZijk

s:t:P

k Xik � 1 8i

Xik þXjk � 1 8i; j; k : j > i; eij =2 E

Zijk � 1

2ðXik þXjkÞ 8i; j; k : j > i

Zijk � Xik þXjk � 1 8i; j; k : j > i

Xik; Zijk 2 f0; 1g 8i; j; k

i ¼ 1; . . . ; nj ¼ 1; . . . ; nk ¼ 1; . . . ; K;

where variable Xik equals to 1 if node i is selected in cliquek and 0 otherwise, variable Zijk equals to 1 if edge i� j isselected in clique k and 0 otherwise, parameter K is anupper bound for the number of cliques to be selected, e.g.,K ¼ n.

ACKNOWLEDGMENTS

The project was supported in part by Award R21DK092845from the National Institute of Diabetes and Digestive andKidney Diseases, National Institutes of Health.

REFERENCES

[1] D. Thomas, “Gene–environment-wide association studies: Emerg-ing approaches,” Nat. Rev. Genetics, vol. 11, no. 4, pp. 259–272,2010.

[2] U. Alon, N. Barkai, D. A. Notterman, K. Gish, S. Ybarra, D. Mack,and A. J. Levine, “Broad patterns of gene expression revealed byclustering analysis of tumor and normal colon tissues probed byoligonucleotide arrays,” Proc. Nat. Acad. Sci. USA, vol. 96, no. 12,pp. 6745–6750, 1999.

[3] Y. Saeys, I. Inza, and P. Larra~naga, “A review of feature selectiontechniques in bioinformatics,” Bioinformatics, vol. 23, no. 19,pp. 2507–2517, 2007.

[4] Z. Q. Tang, L. Y. Han, H. H. Lin, J. Cui, J. Jia, B. C. Low, B. W. Li,and Y. Z. Chen, “Derivation of stable microarray cancer-differenti-ating signatures using consensus scoring of multiple random sam-pling and gene-ranking consistency evaluation,” Cancer Res.,vol. 67, no. 20, pp. 9996–10 003, 2007.

[5] R. Tibshirani, “Regression shrinkage and selection via the lasso,”J. Royal Statist. Soc.. Ser. B (Methodol.), vol. 58, no. 1, pp. 267–288,1996.

[6] S. Ma, and J. Huang, “Penalized feature selection and classifica-tion in bioinformatics,” Briefings Bioinformat., vol. 9, no. 5, pp. 392–403, 2008.

[7] J. Watkinson, X. Wang, T. Zheng, and D. Anastassiou,“Identification of gene interactions associated with disease fromgene expression data using synergy networks,” BMC Syst. Biol.,vol. 2, no. 1, p. 10, 2008.

[8] H.-Y. Chuang, E. Lee, Y.-T. Liu, D. Lee, and T. Ideker, “Network-based classification of breast cancer metastasis,” Mol. Syst. Biol.,vol. 3, no. 1, pp. 140–149, 2007.

[9] J. Su, B.-J. Yoon, and E. R. Dougherty, “Accurate and reliable can-cer classification based on probabilistic inference of pathwayactivity,” PLoS One, vol. 4, no. 12, p. e8161, 2009.

[10] S. J. Sajjadi, A. A. Adl, B. Zeng, and X. Qian, “Finding the mostdiscriminating sets of biomarkers by maximum weighted clique,”presented at the 6th INFORMS Workshop Data Mining Health Infor-matics, Charlotte, NC, USA, vol. 1500, 2011.

[11] W. F. Symmans, J. Liu, D. M. Knowles, and G. Inghirami, “Breastcancer heterogeneity: Evaluation of clonality in primary and met-astatic lesions,”Human Pathol., vol. 26, no. 2, pp. 210–216, 1995.

[12] L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany, “Outcomesignature genes in breast cancer: Is there a unique set?” Bioinfor-matics, vol. 21, no. 2, pp. 171–178, 2005.

[13] A. H. Bild, G. Yao, J. T. Chang, Q. Wang, A. Potti, D. Chasse, M.-B.Joshi, D. Harpole, J. M. Lancaster, A. Berchuck, J. A. Olson, J. R.Marks, H. K. Dressman, M. West, and J. R. Nevins, “Oncogenicpathway signatures in human cancers as a guide to targetedtherapies,” Nature, vol. 439, no. 7074, pp. 353–357, 2005.

[14] M. R. Gary, and D. S. Johnson, Computers and intractability: A Guideto the Theory of Np-Completeness. San Francisco, CA, USA, Freeman,1979.

[15] I. M. Bomze, M. Budinich, P. M. Pardalos, and M. Pelillo, “Themaximum clique problem,” in Handbook of Combinatorial Optimiza-tion. New York, NY, USA: Springer, 1999, pp. 1–74.

[16] F. Vanderbeck and L. A. Wolsey, “An exact algorithm for ip col-umn generation,”Oper. Res. Lett., vol. 19, no. 4, pp. 151–159, 1996.

[17] C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. Savelsbergh,and P. H. Vance, “Branch-and-price: Column generation for solv-ing huge integer programs,” Oper. Res., vol. 46, no. 3, pp. 316–329,1998.

[18] M. E. L€ubbecke and J. Desrosiers, “Selected topics in column gen-eration,” Oper. Res., vol. 53, no. 6, pp. 1007–1023, 2005.

[19] H. Saigo, T. Uno, and K. Tsuda, “Mining complex genotypic fea-tures for predicting hiv-1 drug resistance,” Bioinformatics, vol. 23,no. 18, pp. 2455–2462, 2007.

[20] Y. Ying, K. Huang, and C. Campbell, “Enhanced protein fold rec-ognition through a novel data integration approach,” BMC Bioin-formatics, vol. 10, no. 1, p. 267, 2009.

[21] J. Ha� stad, “Clique is hard to approximate within n1�",” ActaMath., vol. 182, no. 1, pp. 105–142, 1999.

[22] N. Sloane, “Unsolved problems in graph theory arising from thestudy of codes,” Graph Theory Notes New York, vol. 18, pp. 11–20,1989.

[23] R. Horaud and T. Skordas, “Stereo correspondence through fea-ture grouping and maximal cliques,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 11, no. 11, pp. 1168–1180, Nov. 1989.

[24] K. Corr�adi and S. Szab�o, “A combinatorial approach for keller’sconjecture,” Periodica Math. Hungarica, vol. 21, no. 2, pp. 95–100,1990.

[25] P. Berman and A. Pelc, “Distributed probabilistic fault diagnosisfor multiprocessor systems,” in Proc. 20th Int. Symp. Fault-TolerantComput., 1990, pp. 340–346.

[26] F. S. Kuhl, G. M. Crippen, and D. K. Friesen, “A combinatorialalgorithm for calculating ligand binding,” J. Comput. Chem., vol. 5,no. 1, pp. 24–34, 1984.

1036 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 11, NO. 6, NOVEMBER/DECEMBER 2014

Page 9: Network-Based Methods to Identify Highly Discriminating Subsets of Biomarkers

[27] A. Ben-Dor, R. Shamir, and Z. Yakhini, “Clustering gene expres-sion patterns,” J. Comput. Biol., vol. 6, no. 3-4, pp. 281–297, 1999.

[28] R. Carraghan and P. M. Pardalos, “An exact algorithm for themaximum clique problem,” Oper. Res. Lett., vol. 9, no. 6, pp. 375–382, 1990.

[29] P. R. €Osterga�rd, “A new algorithm for the maximum-weight cli-

que problem,” Nordic J. Comput., vol. 8, no. 4, pp. 424–436, 2001.[30] D. Kumlander, “A new exact algorithm for the maximum-weight

clique problem based on a heuristic vertex-coloring and a back-track search,” in Proc. 5th Int. Conf. Model., Comput. Optim. Inf.Syst. Manage. Sci., 2004, pp. 202–208.

[31] D. Br�elaz, “New methods to color the vertices of a graph,” Com-mun. ACM, vol. 22, no. 4, pp. 251–256, 1979.

[32] A. Mehrotra, and M. A. Trick, “Cliques and clustering: A combi-natorial approach,”Oper. Res. Lett., vol. 22, no. 1, pp. 1–12, 1998.

[33] R. Chen, G. I. Mias, J. Li-Pook-Than, L. Jiang, H. Y. Lam, R. Chen,E. Miriami, K. J. Karczewski, M. Hariharan, F. E. Dewey, Y.Cheng, M. J. Clark, H. Im, L. Habegger, S. Balasubramanian,M. O’Huallachain, J. T. Dudley, S. Hillenmeyer, R. Haraksingh,D. Sharon, G. Euskirchen, P. Lacroute, K. Bettinger, A. P. Boyle,M. Kasowski, F. Grubert, S. Seki, M. Garcia, M. Whirl-Carrillo, M.Gallardo, M. A. Blasco, P. L. Greenberg, P. Snyder, T. E Klein,R. B. Altman, A. J. Butte, E. A. Ashley, M. Gerstein, K. C Nadeau,H. Tang, and M. Snyder, “Personal omics profiling revealsdynamic molecular and medical phenotypes,” Cell, vol. 148, no. 6,pp. 1293–1307, 2012.

[34] P. Erdo��s, and A. R�enyi, “On the evolution of random graphs,”Publ. Math. Inst. Hungarian Acad. Sci., vol. 5, pp. 17–61, 1960.

[35] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vectormachines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 27,2011.

[36] P. Xu, Y. Wu, Y. Zhu, G. Dagne, G. Johnson, D. Cuthbertson, J. P.Krischer, J. M. Sosenko, J. S. Skyler, and on behalf of the DiabetesPrevention TrialType 1 (DPT-1) Study Group, “Prognostic perfor-mance of metabolic indexes in predicting onset of type 1 dia-betes,” Diabetes Care, vol. 33, no. 12, pp. 2508–2513, 2010.

[37] J. P. Krischer, D. D. Cuthbertson, L. Yu, T. Orban, N. Macla-ren, R. Jackson, W. E. Winter, D. A. Schatz, J. P. Palmer, G. S.Eisenbarth, and the Diabetes Prevention TrialType 1 StudyGroup, “Screening strategies for the identification of multipleantibody-positive relatives of individuals with type 1 dia-betes,” J. Clin. Endocrinol. Metabolism, vol. 88, no. 1, pp. 103–108, 2003.

[38] J. M. Sosenko, J. P. Palmer, C. J. Greenbaum, J. Mahon, C. Cowie,J. P. Krischer, H. P. Chase, N. H. White, B. Buckingham, K. C.Herold, D. Cuthbertson, J. S. Skyler, and the Diabetes PreventionTrial-Type 1 Study Group, “Increasing the accuracy of oral glu-cose tolerance testing and extending its application to individualswith normal glucose tolerance for the prediction of type 1 diabetesthe diabetes prevention trial-type 1,” Diabetes Care, vol. 30, no. 1,pp. 38–42, 2007.

[39] American Diabetes Association, “Diagnosis and classification ofdiabetes mellitus,” Diabetes Care, vol. 36, no. Suppl 1, pp. S67–S74,2013.

[40] Y. Wang, J. G. Klijn, Y. Zhang, A. M. Sieuwerts, M. P. Look, F.Yang, D. Talantov, M. Timmermans, M. E. Meijer-van Gelder, J.Yu, T. Jatkoe, E. M. Berns, D. Atkins, and J. A. Foekens, “Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer,” The Lancet, vol. 365, no. 9460,pp. 671–679, 2005.

[41] B. Boz�oky, A. Savchenko, P. Csermely, T. Korcsm�aros, Z. D�ul, F.Pont�en, L. Sz�ekely, and G. Klein, “Novel signatures of cancer-associated fibroblasts,” Int. J. Cancer, vol. 133, pp. 286–293, 2013.

Seyed Javad Sajjadi received the BS degree instatistics from the University of Tehran, and theMS degree in industrial engineering from SharifUniversity of Technology. He is currently workingtoward the PhD degree in industrial engineering atthe University of South Florida. His research inter-ests include operations research applications indata mining and data analysis, large-scale net-work optimization in computational biology andbioinformatics and integer programming and com-binatorial optimization.

Xiaoning Qian (S’01-M’07) received the PhDdegree in electrical engineering from Yale Univer-sity, New Haven, CT, in 2005. Currently, he is anassistant professor with the Department of Electri-cal & Computer Engineering, Texas A&M Univer-sity, College Station, TX. He also is a courtesyassistant professor in the Department of Com-puter Science & Engineering and the Departmentof Pediatrics at the University of South Florida,Tampa, FL, in which he spent four years beforejoining Texas A&M. He was with the Bioinformat-

ics Training Program at Texas A&M University, sponsored by theNational Cancer Institute (NCI). His current research interests includecomputational network biology, genomic signal processing, and biomed-ical image analysis. He is a member of the IEEE.

Bo Zeng (M’11) received the PhD degree inindustrial engineering with an emphasis on opera-tions research from Purdue University, WestLafayette, IN, in 2007. Currently, he is an assistantprofessor in the Department of Industrial and Man-agement Systems Engineering at the University ofSouth Florida, Tampa, FL. His research interestsare algorithm development for large-scale discreteoptimization problems, and stochastic and robustoptimization, coupled with applications in healthinformatics and power systems. He is a member

of the IEEE, SIAM and INFORMS.

Amin Ahmadi Adl received the BS and MSdegrees in computer science from the Universityof Tehran. He is currently working toward the PhDdegree in computer science and engineering atthe University of South Florida. His research inter-ests include computational network biology, geno-mic signal processing, and combinatorial graphalgorithms.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

SAJJADI ET AL.: NETWORK-BASED METHODS TO IDENTIFY HIGHLY DISCRIMINATING SUBSETS OF BIOMARKERS 1037