a framework for attribute-based …psb.stanford.edu/psb-online/proceedings/psb16/yu.pdfwith...

12
A FRAMEWORK FOR ATTRIBUTE-BASED COMMUNITY DETECTION WITH APPLICATIONS TO INTEGRATED FUNCTIONAL GENOMICS HAN YU Biostatistics, University at Buffalo, Buffalo, NY 14220/EST, USA E-mail: hyu9@buffalo.edu RACHAEL HAGEMAN BLAIR Biostatistics, University at Buffalo, Buffalo, NY 14220/EST, USA E-mail: hageman@buffalo.edu Understanding community structure in networks has received considerable attention in recent years. Detecting and leveraging community structure holds promise for understanding and potentially in- tervening with the spread of influence. Network features of this type have important implications in a number of research areas, including, marketing, social networks, and biology. However, an over- whelming majority of traditional approaches to community detection cannot readily incorporate information of node attributes. Integrating structural and attribute information is a major chal- lenge. We propose a flexible iterative method; inverse regularized Markov Clustering (irMCL), to network clustering via the manipulation of the transition probability matrix (aka stochastic flow) corresponding to a graph. Similar to traditional Markov Clustering, irMCL iterates between “ex- pand” and “inflate” operations, which aim to strengthen the intra-cluster flow, while weakening the inter-cluster flow. Attribute information is directly incorporated into the iterative method through a sigmoid (logistic function) that naturally dampens attribute influence that is contradictory to the stochastic flow through the network. We demonstrate advantages and the flexibility of our approach using simulations and real data. We highlight an application that integrates breast cancer gene ex- pression data set and a functional network defined via KEGG pathways reveal significant modules for survival. Keywords : KEGG pathways, logistic regression, community detection, Markov clustering, omics, survival 1. Introduction Community structure occurs when nodes exhibit a high-degree of connectivity to each other, and a lower degree of connectivity to other groups and nodes in the network. 1,2 The community detection problem has been studied extensively in Social Network Analysis (SNA). In the areas of bioinformatics and computational biology, the problem is also referred to as module detection or graph clustering. 3,4 In a general sense, the community detection problem can be viewed as the clustering of a network. Classical graph clustering methods inlcude Kernighan-Lin algorithm, 5 hierarchical clustering methods, 6 spectral clustering, 7,8 Newman and Girvan algorithm, 9,10 and modularity- based algorithms comprise an important class of community detection methods. 11–13 Classical approaches to community detection cannot readily incorporate information of node attributes and rely solely on network structures. The simultaneous use of attribute and connectivity information can yield more accurate results and can be leveraged in downstream analysis Pacific Symposium on Biocomputing 2016 69

Upload: buidieu

Post on 21-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

A FRAMEWORK FOR ATTRIBUTE-BASED COMMUNITY DETECTIONWITH APPLICATIONS TO INTEGRATED FUNCTIONAL GENOMICS

HAN YU

Biostatistics, University at Buffalo,Buffalo, NY 14220/EST, USA

E-mail: [email protected]

RACHAEL HAGEMAN BLAIR

Biostatistics, University at Buffalo,Buffalo, NY 14220/EST, USAE-mail: [email protected]

Understanding community structure in networks has received considerable attention in recent years.Detecting and leveraging community structure holds promise for understanding and potentially in-tervening with the spread of influence. Network features of this type have important implications ina number of research areas, including, marketing, social networks, and biology. However, an over-whelming majority of traditional approaches to community detection cannot readily incorporateinformation of node attributes. Integrating structural and attribute information is a major chal-lenge. We propose a flexible iterative method; inverse regularized Markov Clustering (irMCL), tonetwork clustering via the manipulation of the transition probability matrix (aka stochastic flow)corresponding to a graph. Similar to traditional Markov Clustering, irMCL iterates between “ex-pand” and “inflate” operations, which aim to strengthen the intra-cluster flow, while weakening theinter-cluster flow. Attribute information is directly incorporated into the iterative method througha sigmoid (logistic function) that naturally dampens attribute influence that is contradictory to thestochastic flow through the network. We demonstrate advantages and the flexibility of our approachusing simulations and real data. We highlight an application that integrates breast cancer gene ex-pression data set and a functional network defined via KEGG pathways reveal significant modulesfor survival.

Keywords: KEGG pathways, logistic regression, community detection, Markov clustering, omics,survival

1. Introduction

Community structure occurs when nodes exhibit a high-degree of connectivity to each other,and a lower degree of connectivity to other groups and nodes in the network.1,2 The communitydetection problem has been studied extensively in Social Network Analysis (SNA). In theareas of bioinformatics and computational biology, the problem is also referred to as moduledetection or graph clustering.3,4

In a general sense, the community detection problem can be viewed as the clustering ofa network. Classical graph clustering methods inlcude Kernighan-Lin algorithm,5 hierarchicalclustering methods,6 spectral clustering,7,8 Newman and Girvan algorithm,9,10 and modularity-based algorithms comprise an important class of community detection methods.11–13 Classicalapproaches to community detection cannot readily incorporate information of node attributesand rely solely on network structures. The simultaneous use of attribute and connectivityinformation can yield more accurate results and can be leveraged in downstream analysis

Pacific Symposium on Biocomputing 2016

69

for prediction under attribute or network perturbations. Hanisch et al. derive the distancematrix by combining the structural and gene profiles distances, but require prior domainknowledge.14 Zhou et al. represent attributes as additional nodes.15 In this setting, attributesare restricted to discrete values, and consequently the size and complexity of the graph grows,and requires accounting of the different types nodes and edges.16 Instead of graph partitioning,the algorithms of CoPaM17 and DME18 introduces a problem of identifying cohesive patterns orsubnetworks satisfying a density threshold and cohesive constraints.

We have developed a novel community detection method that rely on stochastic flow innetworks. Leveraging robust statistical classification methods, we bridge and simultaneouslymodel the attribute and structural space. The methods that we propose are highly general-izable and flexible in their implementation. We showcase their flexibility through simulationand application that integrates breast cancer gene expression data set with KEGG ontologiesand survival data.

2. Materials and Methods

Briefly, we begin by outlining Markov CLustering (MCL) and regularized Markov CLustering(rMCL) frameworks, which set the foundation of our approaches.19,20 MCL is based on thenotion that if a group of nodes belongs to the same community, then the stochastic flow fromthese nodes will be concentrated towards nodes in that community.19 Performing randomwalks on a graph may reveal where flows gather, which suggests potential communities. Inthis setting, our focus is on undirected graphs, which have a symmetric adjacency matrix andhave edge interpretations of association (not causation).

MCL algorithms depend on the iteration between two operators expand and inflate, untilconvergence, in order to identify communities in the network. Markov clustering utilizes astochastic matrix that is initially derived from the adjacency matrix, Aadj ∈ Rn×n of thegraph. The stochastic matrix is defined as the matrix product, M = A0 · D−1, where A0 =

Aadj + I, and D ∈ Rn×n is the diagonal matrix containing the degree information for eachnode, D(k, k) = diag (

∑ni=1A (i, k)). The operations in MCL and rMCL utilize the stochastic

matrix, M , which has columns that can be interpreted as transition probabilities. In the classicMCL, the expand step at the j + 1th iteration requires a matrix product Mj+1 = Mj ·Mj.

The inflate operator, M infj+1 = Inflate(Mj+1, r), can be understood as the component-wise

exponentiation m(i, j)r, ∀ i, j = 1, . . . , n, where the inflation operator, r, is a constant. Followinginflation, M inf

j+1 is converted to a stochastic matrix, Mj+1, and a new iteration is started.Importantly, the expand operator alone would give rise to a Markov Chain via a random walkon the graph. However, due to the inflation operator the process cannot be regarded as aMarkov Chain. Inflation is critical to accentuate strong ties and paths, and deemphasize weakones. The inflation constant, r, controls the degree at which this strengthening and weakeningis enforced, and has a direct impact on the cluster formation. Upon convergence of MCL tosteady-state, the stochastic matrix can be understood in terms of attractors. The matrix issparse, and the attractors have at least one positive value in their row. The indices of thesepositive values, together with the attractor, form the community.

A regularized version of Markov Clustering, rMCL, was proposed and has been shown

Pacific Symposium on Biocomputing 2016

70

to overcome some fragmentation issues in the communities. The rMCL algorithm follows thesame iterative approach, with an expand step that is replaced by a regularization operation,Mj+1 = Mj ·M0, where M0 is the initial stochastic matrix formed from the network adjacencymatrix.20 The regularize step ensures that the original structural information is still utilizedfor the graph clustering process after the first iteration. Unfortunately, the regularized MCLdoes not naturally converge to a steady state with the same desirable interpretations in termsof community membership. In order to achieve this, at each iteration, a prune step is addedthat forces some smaller entries of the stochastic matrix to zero using a heuristic threshold.The pruning aims to eliminate entries that are small relative to other entries in the matrix.20

2.1. inverse regularized Markov Clustering (irMCL)

We propose a flexible method, inverse regularized Markov CLustering (irMCL), which utilizesthe expand and inflate operators, but relies on an alternative concept of community thatemphasizes the spreading of influence or information in a non-exclusive manner. Our approachrelies on the following modeling assumptions:

(A1) Spreading of information/influence from Node i to Node j will not affect thatfrom Node i to other nodes, k 6= j.(A2) Nodes in the same community are influenced or share information from similargroup of nodes.(A3) Nodes with larger degrees tend to be more influential.(A4) If an individual is highly influenced by a group of nodes, such influence tends tobe self-amplified.(A5) Spread of information between nodes with similar attributes is easier, and thusshould be a function of the attributes similarity measures between nodes.

In this model, the community membership of a node is measured by information that flowsinto the nodes, as opposed to MCL and rMCL, where a feature is the stochastic flow thatexits this node. Accordingly, we term this procedure “inverse regularized Markov Clustering”(irMCL). These assumptions naturally give higher weights to nodes in the network with highdegrees and naturally incorporate attribute information in a flexible manner. Similar to MCL,we denote Aadj ∈ Rn×n as the adjacency matrix of graph G. We define a symmetric spreadmatrix as: A = Aadj + I, which defines the graph with the addition of self loops.

Algorithm 1 shows the full details of the irMCL approach. At each iteration, the initialspread matrix used to regularize. Repeated use of the spread matrix naturally puts moreweight on the high degree nodes in the network (A3), and is unique to our approach. Thesame inflation operator as in MCL is used according to assumption (A4). Convergence istracked empirically by examining the mean squared difference as the difference between Mj

and Mj−1, defined as∑n

i=1

∑nk=1

(m

(j)ik −m

(j−1)ik

)2/n, where m(j)

ik is the entry of Mj.

The output of this iterative method is a stochastic matrix, where the rows with highsimilarity are likely to belong to the same community. In our applications, we utilize completelinkage, and estimate the similarity using a euclidean distance. Silhouette plots are utilizedfor the determination of the number of clusters via average silhouette width.21

Pacific Symposium on Biocomputing 2016

71

Algorithm 2.1 Feature derivation for inverse Regularized Markov Clustering (iRMCL)Initialize:Aadj ∈ Rn×n Adjacency MatrixA0 = Aadj + Ifor k = 1 to n doD0(k, k) = diag (

∑ni=1A0 (i, k))

end forset: r > 1

Repeat until stopping criteria is metfor j = 1 to m doMj ←Mj−1 ·A0

M inflj = Inflate(Mj , r)

for k = 1 to n doDj(k, k) = diag

(∑ni=1M

inflj (i, k)

)end forMj = M infl

j ·D−1j

end for

Output: Mj for row clustering

2.2. attribute inverse regularized Markov Clustering (airMCL)

The irMCL algorithm is based solely on network connectivity. We propose a natural extensionfor clustering of networks that contain nodes with heterogenous attributes. In this setting,we use the term attribute to loosely to define features of the nodes. In the biological con-text, this could include, for example, a measurement of a phenotype, gene expression, ordemographic information. The term heterogenous is used to describe the set of attributes de-fined on the network, which can be continuous or categorical. We call this method attributeinverse regularized Markov CLustering (airMCL), because it connects the inverse regularizedMarkov Clustering (irMCL) approach with statistical classification methods, for the purposeof community detection in attributed networks.

The link between irMCL and is achieved through use of multiple logistic regression, inwhich the attribute information is regressed on the vectorized structure of the network.22

This approach gives rise to probabilistic estimate of association between network structureand attributes directly, which is embedded into the weights for edges in the spread matrixfor Algorithm 1. Specifically, airMCL relies on vectorized versions of distance matrices, whichreflect the similarity (or lack thereof) between individuals for an attribute or set of attributes.The distance matrix, D ∈ Rn×n is symmetric, and the entries d(i, j) = d(j, i) convey thesimilarity between nodes i and j for a given set of attributes. Consequently, vectorizing thestrict upper triangular portion (not including the diagonal) of these matrices maps the pairwiseinformation between nodes and attributes into a vectorized space. This set of vectors formsthe set of predictors for the logistic regression modeling.

More formally, let Zk be the vectorized strict upper triangular regions Dk, in the same wayas the vectorization of Aadj. The logistic model is defined as:

log

(Pr(Y = 1|Z)

1− Pr(Y = 1|Z)

)= β0 +

p∑k=1

βkZk, (1)

Pacific Symposium on Biocomputing 2016

72

where β0 is an intercept term, and β1 . . . βp are the regression coefficients for the vector-ized attributes. The left hand side of Equation 1 is the log-odds ratio. We can directlyestimate the odds ratio using the estimated coefficients β for each pairwise-relationship:w = exp

(∑pk=1 βkZk

), which is embedded into the weights for edges in the spread matrix

for Algorithm 1.Implementations rMCL and airMCL are performed in the R programming language

(https://www.r-project.org/). A library airMCL that implements these algorithms will bemade available in the CRAN repository upon publication.

2.3. Simulations

We examine the performance irMCL and airMCL using a variety of network simulationsfollowing the general framework proposed by Girvan and Newman.9 In our simulations, weconsider networks containing 128 nodes that are divided into four communities of 32 nodeseach. Vertices are connected independently and randomly with a probability Pin for thosewithin the same community, and Pout for vertices in different communities (Pout < Pin). Theprobabilities are selected such that the average degree of a vertex is 16. The expected numberof links to a vertex in a different community is defined as zout, while the expected number oflinks to a vertex in the same community is defined as zin. Note that the community structureis less defined (weak) when zout is larger.

Within simulations of different connectivity patterns, we examined single continuous andcategorical attributes, as well as their combination. Categorical attributes in the ith groupwere generated from a multinomial distribution:

p(X = x) =

p, x = i

1− p3

, x ∈ {1, 2, 3, 4}/i

The values of p were set to 0.9, 0.6, 0.3 to mimic strong, moderate, and weak associations tothe network structure, respectively. Note that when p takes large value (0.9), the attributeX is highly homogeneous within communities. When p is small, however, it implies X hashigh variability within each group, and will be less informative for the purpose of communitydetection.

A normal distribution, N(µi, 1), was used for continuous attributes of group i. The differ-ence between means of consecutive groups ∆µ = µi+1 − µi was set at 4, 2, or 0.5, to conveystrong, moderate, and weak levels of association, respectively, between structural and attributeinformation. Within the simulation framework, we also set out to determine how sensitive ourmethods are to noise in network in the form of missing links. For each scenario, we performedcommunity detection on the full network, and networks with up to 30% of their links missingat random. We compared our methods, airMCL and irMCL , with rMCL and a fast-greedymethod.11 We also examined an irMCL-adhoc method, which can be only applied to networkswith single categorical attribute. In this setting, irMCL-adhoc assigns a fixed weight of 0.5when the two nodes have different attribute values, regardless of the structural relevance.

Mixed attributes were also explored for different combinations of continuous and cate-gorical levels of association. The mixed attribute simulations described previously were also

Pacific Symposium on Biocomputing 2016

73

carried out to explore performance for networks varying from well defined communities (smallzout) to poorly defined communities (large zout). The clustering by attribute information aloneis also performed. For continuous attributes, Euclidean distance and hierarchical clusteringwith complete linkage is used. For categorical attribute, the attribute value is directly used ascluster label. For combination of two heterogeneous attributes, the larger average performancebetween continuous and categorical is used, because they cannot be combined for clustering.

Performance is assessed using the Adjusted Rand Index (ARI) as a measure of agreementbetween two data clusterings.23,24 Let S be a set of n elements and consider two partitions of Sto compare, X = {X1, . . . , Xr} ∈ S and Y = {Y1, . . . , Ys} ∈ S. The ARI assumes the generalizedhypergeometric distribution as the model of randomness, where the two partitions are pickedat random such that the number of classes and clusters are fixed.24 Specifically, letting nijdenote the number of objects in common between Xi and Yj and ai =

∑j nij, and bj =

∑i nij,

the ARI is defined as:24

ARI =

∑ij

(nij

2

)− [∑

i

(ai

2

)∑j

(bj2

)]/(n2

)12 [∑

i

(ai

2

)+∑

j

(bj2

)]− [

∑i

(ai

2

)∑j

(bj2

)]/(n2

) .For each parameter setting, 100 simulated networks are tested and the standard error is cal-culated.

2.4. Application to functional genomics

We applied the airMCL method to a breast cancer microarray dataset by Van Der Vijver etal .25 The data was obtained from the package seventyGenesData available in Bioconductor(https://www.bioconductor.org/). Our objective was to infer communities using airMCL andidentify those which relate to survival. Briefly, the data consists of 295 tumor samples froma 295 women with breast cancer. Survival data was also made available for all each patientin this population. The duration for survival analysis in this study is Time To Metastasis(TTM). In this study, 101 metastasis events occurred and 194 censored data points.

The input to airMCL requires specification of an adjacency matrix for a correspondingnetwork and a set of attributes. In our application, we define the network using the KEGG

database.26 The 24, 496 transcripts in the dataset were mapped to KEGG pathways usingEntrez gene identifiers with the Bioconductor annotation package KEGG.db. In order to obtaina 1: 1 mapping, when several transcripts mapped to a gene, the one with the most variationacross the sample was retained for the modeling. After mapping, the data set consisted of295 samples and 4, 715 genes that represent nodes in the network. Transcript abundance wasrepresented by the log10 of the ratio between each sample and the reference RNA.25 Theadjacency matrix (input) was determined through an pathway-based gene network that wasformed by placing links between genes when they are present in the same KEGG pathway.The functional network consists of 4, 715 nodes (genes) and 883, 557 edges.

Node attributes for the airMCL are defined through a measure of dissimilarity of the geneexpression data. Several dissimilarity options are feasible and we expand on this point in thediscussion. The dissimilarity measure is defined as di,j = 1 − |ri,j |, where ri,j is the Pearsoncorrelation coefficient between the ith and jth genes. Logistic regression models are fit usingthe vectorized pairwise dissimilarity on edges (1 linked, 0 for unlinked pairs) as the predictor,

Pacific Symposium on Biocomputing 2016

74

and the vectorized adjacency matrix as the response variable. However, the gene networkhas 4, 715 nodes, implying more than 11 million observations in the regression. Moreover, thesparsity of the network gives rise a severe class imbalance. To alleviate the computationalcomplexity and address imbalance, we randomly selected the unlinked node pairs so as tohave the same number as that of the edges.

Survival analysis is performed on TTM using a Cox proportional hazard model.27 Ben-jamini and Hochberg method was used to control the false discovery rate.28 A threshold ofP -value< 0.05 was used to identify modules whose overall expression levels are significantlyassociated with the time to metastasis. Kaplan-Meier estimates were calculated for each sig-nificant module based on stratification of the 295 patients into two groups, using the medianoverall expression levels of the module. Specifically, wkl = 1

ml

∑ml

i∈cl zik, where wkl is the averageexpression level of lth module for kth patient, cl is the set of node index of lth module, andml is the number of nodes in this module.

3. Results

Each simulation was run to convergence. Some general trends persisted for the different pa-rameter and attribute simulations (Figure 1). The overall performance of rMCL was poor,but relatively stable across missing links and different levels of association between structureand attribute. This was the case for categorical, continuous, and mixed attribute settings.When the attribute associations are moderate and weak, fast-greedy shows advantages overthe other methods when the missing links is larger (Figure 1B-C,E-F).

When a categorical attribute is highly relevant to true groups (p = 0.9), the inclusion ofattribute information significantly improved the performance (Figure 1A). In this case, theairMCL and post-hoc weighting were both useful in boosting performance. The performancefor post-hoc weighting degrades as the attribute association weakens (Figures 1B-C). For con-tinuous attributes, the airMCL is superior for strong associations across all levels of missinglinks (Figure 1D), and is the top-performer for moderate association with fewer missing links(Figure 1E). When the associations are weak for continuous attributes, airMCL is competi-tive with irMCL for scenarios with few missing links (Figure 1F). In simulations with multipleheterogeneous attributes (Figure 2G-I), the airMCL successfully extracts the structurally rel-evant information and improves the performance over clustering using structural informationonly (irMCL).

Tuning the parameter zout in the simulations enables us to test the performance of ourapproaches in scenarios where the communities are not well defined. The performance ofirMCL is comparable to fast greedy algorithm, and actually slightly outperforms fast-greedyunder zout ranges from 1 to 6 (Figure 2A-C). In our simulations, large zout represents networksin which there is poor community structure. The airMCL’s use of attributes offsets this poorstructure and is the top-performing method in these extreme scenarios.

We applied the airMCL method to a breast cancer dataset using a KEGG pathway-basednetwork and gene expression attributes.25 A correlation-based similarity was utilized for theattributes, and the estimated coefficient for the logistic regression was −0.7624 and significant.Convergence was observed 15 iterations. The clustering of the rows of the stochastic matrix was

Pacific Symposium on Biocomputing 2016

75

0.0

0.3

0.6

0.9

unobserved links

adju

sted

ran

d in

dex

��

��

��

���

��

��

��

��

���

��

���

��

��

��

��

��

��

��

���

��

��

��

��

0.0 0.1 0.2 0.3unobserved links

0.0 0.1 0.2 0.3unobserved links

0.0 0.1 0.2 0.3

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

0.0

0.3

0.6

0.9

adju

sted

ran

d in

dex

unobserved links0.0 0.1 0.2 0.3

unobserved links0.0 0.1 0.2 0.3

unobserved links0.0 0.1 0.2 0.3

unobserved links0.0 0.1 0.2 0.3

unobserved links0.0 0.1 0.2 0.3

unobserved links0.0 0.1 0.2 0.3

method

airMCL

fast greedy

irMCL

rMCL

method

airMCL

fast greedy

irMCL

rMCL

method

airMCL

fast greedy

irMCL

irMCL−post hoc

rMCL

G) p=0.9 (strong) and Δμ=4 (strong) H) p=0.9 (strong) and Δμ=0.5 (weak) I) p=0.3 (weak) and Δμ=4 (strong)

D) Δμ=4 (strong) E) Δμ=2 (moderate) F) Δμ=0.5 (weak)

A) p=0.9 (strong) B) p=0.6 (moderate) C) p=0.3 (weak)

Performance on Simulated DataCategorical Attribute

Continuous Attribute

Mixed Attributes

Fig. 1. Simulation results for community detection for a categorical attribute (top row), continuous attribute(second row), and a mixture of a continuous and categorical attributes (third row). Relationships betweencategorical attributes and community structure were simulated to be (A) strong, (B) moderate, and (C)weak, respectively. Likewise, for continuous attributes (D-F). For the mixed attribute simulation the cat-egorical/continuous relationships between attribute and structure considered were (G) strong/strong, (H)strong/weak, and (I) weak/strong.

determined using the maximum average silhouette, which was 0.85, and yielded 434 clusters.Note that the rule of thumb for strong structure is an average silhouette between 0.71− 1.21

Only modules with size ≥ 8 were selected for survival analysis, and the overall activationstatus of each module was used for the covariate (see M&M) for predicting TTM. Cox pro-portional hazard model was used and a multiple testing adjustment was made. A thresholdcriteria of P -value< 0.05, both methods yields six modules whose overall expression levelsare significantly associated with the time to metastasis. Table 1 shows the summary of mod-ules detected and a full listing of module members is available in the Supplement (postedon https://sphhp.buffalo.edu/biostatistics/news-events/workshops/). The adjusted p-valuesin Table 1 are from Cox regression.

In order to utilize the Kaplan-Meier product limit estimator, for each of the six modules,the 295 patients were split into two groups (low-expression and high-expression) using themedian of overall expression levels as cut-off. The survival curves are shown in Figure 3. Log-rank tests were used to test the difference between survival curves of high- and low-expression

Pacific Symposium on Biocomputing 2016

76

adju

ste

d R

and index

���� ���� ����

���

��

���� ���� ����

���

��

���� ���� ����

���

��

Performance for varying strength of community structure

A) Mixed Attributes

p=0.9 (strong) and Δμ=4 (strong)

B) Mixed Attributes

p=0.9 (strong) and Δμ=0.5 (weak)

C) Mixed Attributes

p=0.3 (weak) and Δμ=4 (strong)

0.0

0.3

0.6

0.9

0.0

0.3

0.6

0.9

adju

ste

d R

and index

Expected number of links between communitites (z.out)2 4 6 8

Expected number of links between communitites (z.out)2 4 6 8

Expected number of links between communitites (z.out)2 4 6 8

adju

ste

d R

and index

0.0

0.3

0.6

0.9

airMCL

Fast greedy

irMCL

rMCL

airMCL

Fast greedy

irMCL

rMCL

airMCL

Fast greedy

irMCL

rMCL

Fig. 2. Comparison of the performance of airMCL/irMCL (with/without attributes) with rMCL and fastgreedy method in synthetic networks using adjusted Rand index against zout. The attributes are mixed, whichinclude attributes with (A) high-relevance categorical (p = 0.9) and high-relevance continuous (∆µ = 4),(B) high-relevance categorical (p = 0.9) and weak-relevance continuous (∆µ = 0.5), and (C) weak-relevancecategorical (p = 0.3) and high-relevance continuous (∆µ = 4). The horizontal black dashed line indicating theaverage ARI using attribute information alone.

Table 1: Breast Cancer Module Summarization

Module Size Pathways represented P -value

1 8 Hedgehog signaling pathway (hsa04340) 0.021952 27 Pathway in cancers (hsa05200) 0.02195

MAPK signaling pathway (hsa04010)Adherens junction (hsa04520)Regulation of actin cytoskeleton (hsa04810)Melanoma (hsa05218)Prostate cancer(hsa05215)Oocyte meiosis (hsa04114)

3 82 Ribosome pathway (hsa03010) 0.021954 25 Cell cycle pathway (hsa04110) 0.02195

Non-homologous end-joining (hsa03450)5 19 Pathway in cancers (hsa05200) 0.03541

Mismatch repair (hsa03430)Colorectal cancer (hsa05210)Small cell lung cancer (hsa05222)Pancreatic cancer (hsa05212)Thyroid cancer (hsa05216)

6 35 Proteosome pathway (hsa03050) 0.03614

groups. The unadjusted p-values of log-rank tests are shown in Figure 3.

4. Discussion

The design of airMCL is such that the impact of the attributes on community formationdepends on the strength of the association between attributes and network structure. Conse-quently, those weak associations are naturally dampened. Our approach is similar to spirit tothe weighting that is done in neural network via an activation function (usually a sigmoid),

Pacific Symposium on Biocomputing 2016

77

Module 1

0 5 10 15Survival time (years)

0.0

0.2

0.4

0.6

0.8

1.0

Sur

viva

l pro

babi

lity

Module 2 Module 3

Module 4 Module 5 Module 6

0 5 10 15Survival time (years)

0 5 10 15Survival time (years)

0 5 10 15Survival time (years)

0 5 10 15Survival time (years)

0 5 10 15Survival time (years)

0.0

0.2

0.4

0.6

0.8

1.0

Sur

viva

l pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Sur

viva

l pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Sur

viva

l pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Sur

viva

l pro

babi

lity

0.0

0.2

0.4

0.6

0.8

1.0

Sur

viva

l pro

babi

lity

high expressionlow expression

high expressionlow expression

high expressionlow expression

high expressionlow expression

high expressionlow expression

high expressionlow expression

Survival Plots for Significant Modules

P-value = 0.00359 P-value = 0.0115 P-value = 0.000483

P-value = 0.001625 P-value = 0.000172\ P-value = 0..000027

A) B) C)

D) E) F)

Fig. 3. (A-F) Kaplan-Meier survival plots for modules 1−6. Estimate is based on the partition of the sampleinto two groups using median values of overall expression for each module (see methods). Red indicates higherexpression, blue is for lower expression, and the unadjusted P -values for the log-rank tests are shown.

which weights the features in the input layer. In severely weak settings, the airMCL operateslike the irMCL. A challenge attribute information may be irrelevant, or even contradict, thestructure of the network. In our simulations, bringing in attribute with weak signals did notderail performance (Figure 1C,F,G-I). This is important as it is not up to the user to specifywhat attributes are important by weighting, or even eliminating them. In contrast, in thecategorical case, we observed with the ad-hoc weighting can derail performance, especially inlight of weak attribute associations (Figure 1C).

The fit of the logistic model itself reveals the strength of the relationship between attributesimilarity network structure. Examining the regression coefficients (Equation 1) of the modelcan guide in model development, e.g., choice of similarity, subsets of features. For example,hypothesis testing on the coefficients (e.g., H0 : βj = 0) can reveal the significance of theattribute similarity as a predictor of structure. We have found this useful as a way of selectinga similarity measure for the attributes.

An important feature of the airMCL approach is that the derived inputs for the logisticregression can be handled in a flexible manner. If the set of attributes is heterogenous, one canpartition the attributes into multiple subsets, and estimate distance matrices over these subsetsindependently. This approach enables a unique choice of similarity measure most appropriatefor the given attribute or set of attributes. Differences in scales, even within variables of the

Pacific Symposium on Biocomputing 2016

78

same type, can also be managed by subsetting attributes. Collectively, the vectorization ofthe different distances would give rise to multiple predictors for the logistic regression.

In the breast cancer application, some of the identified pathways are consistent with thatreported by Van’t Veer et al.,29 such as pathways in cell cycle regulation (Module 4) and signaltransduction (Module 2). In addition, we also found that ribosome pathway is associated withbreast cancer metastasis. This is consistent with the results reported by Belin et al., thatdysregulation of ribosome biogenesis is related to enhanced tumor aggressivity.30 Activationof hedgehog pathway is also reported in tumors including breast cancers,31,32 and is related tocancer metastasis.33 Figure 3 shows that module over-expression (red) is often associated withhigher hazards of metastasis. The up-regulation of Module 1 (hedgehog signaling pathway)is unexpectedly associated with better prognosis. This can be explained by the fact that up-regulated genes in this module encode inhibitors in this pathway (GAS1, RAB23, and CK1 ),which is biologically plausible.

In our simulations, we have simulated balanced communities of moderate size. However,we have also observed good performance, in terms of computational time and accuracy, in thesimulation of balanced larger communities. In the case of unbalanced communities, we haveachieved good performance in moderate sized simulation networks and real social networks.However, a limitation of our approach is applications to large (1000+ nodes) unbalancednetworks. Addressing this form of scalability will be a direction of future research.

We have focussed on a specific application to gene expression cancer data to showcase ourmethod. However, the airMCL is generalizable in the sense that it can be used in connectionwith data that contains a network structure and a set of attributes. The term attribute can beloosely defined to encompass demographic information, clinical data, omics data, and combi-nations of different types of data. The combination of multiple sources of data is known to bea major challenge, and our approach directly integrates them into the community detection.Framing the problem of relating the attributes to the structure via classification has severaladvantages. Arguably the most important of these advantages is the ability to monitor andquantify loss. Framing the connection between structure and attributes as a supervised learn-ing problem enables the use of statistical classification methods. In this work, we outlined theframework in terms of the classic multiple logistic regression model.22 However, several classi-fication methods may be more or less suitable depending on the dimension of the graph andattributes, and also the correlation of predictors. Within the classification methods frameworkare opportunities to utilize the bias-variance tradeoff for model and feature selection. This isa direction of future research, which we anticipate will guide in elimination of extraneousattributes (and potentially nodes), and protect against overfitting.

5. Acknowledgements

HY and RHB were supported through NSF DMS 1312250 and NSF DMS 1557593.

References

1. L. Danon, A. Diaz-Guilera, J. Duch and A. Arenas, Journal of Statistical Mechanics: Theoryand Experiment 2005, p. P09008 (2005).

Pacific Symposium on Biocomputing 2016

79

2. M. E. Newman, The European Physical Journal B-Condensed Matter and Complex Systems 38,321 (2004).

3. S. E. Schaeffer, Computer Science Review 1, 27 (2007).4. S. Horvath, Weighted Network Analysis: Applications in Genomics and Systems Biology

(Springer Science & Business Media, 2011).5. B. W. Kernighan and S. Lin, Bell system technical journal 49, 291 (1970).6. S. C. Johnson, Psychometrika 32, 241 (1967).7. M. Fiedler, Czechoslovak Mathematical Journal 23, 298 (1973).8. W. E. Donath and A. J. Hoffman, IBM Journal of Research and Development 17, 420 (1973).9. M. Girvan and M. E. Newman, Proceedings of the National Academy of Sciences 99, 7821 (2002).

10. M. E. Newman and M. Girvan, Physical review E 69, p. 026113 (2004).11. A. Clauset, M. E. Newman and C. Moore, Physical review E 70, p. 066111 (2004).12. M. E. Newman, Physical review E 69, p. 066133 (2004).13. M. E. Newman, Proceedings of the National Academy of Sciences 103, 8577 (2006).14. D. Hanisch, A. Zien, R. Zimmer and T. Lengauer, Bioinformatics 18, S145 (2002).15. Y. Zhou, H. Cheng and J. X. Yu, Proceedings of the VLDB Endowment 2, 718 (2009).16. L. Akoglu, H. Tong, B. Meeder and C. Faloutsos, Pics: Parameter-free identification of cohesive

subgroups in large attributed graphs., in SDM , 2012.17. F. Moser, R. Colak, A. Rafiey and M. Ester, Mining cohesive patterns from graphs with feature

vectors., in SDM , 2009.18. E. Georgii, S. Dietmann, T. Uno, P. Pagel and K. Tsuda, Bioinformatics 25, 933 (2009).19. S. Van Dongen, SIAM Journal on Matrix Analysis and Applications 30, 121 (2008).20. V. Satuluri and S. Parthasarathy, Scalable graph clustering using stochastic flows: applications

to community discovery, in Proceedings of the 15th ACM SIGKDD International conference onKnowledge Discovery and Data Mining , 2009.

21. P. J. Rousseeuw, Journal of computational and applied mathematics 20, 53 (1987).22. D. W. Hosmer Jr and S. Lemeshow, Applied logistic regression (John Wiley & Sons, 2004).23. W. M. Rand, Journal of the American Statistical Association 66, 846 (1971).24. L. Hubert and P. Arabie, Journal of classification 2, 193 (1985).25. M. J. Van De Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. Hart, D. W. Voskuil, G. J.

Schreiber, J. L. Peterse, C. Roberts, M. J. Marton et al., New England Journal of Medicine 347,1999 (2002).

26. M. Kanehisa, S. Goto, Y. Sato, M. Kawashima, M. Furumichi and M. Tanabe, Nucleic acidsresearch 42, D199 (2014).

27. D. R. Cox and D. Oakes, Analysis of survival data (CRC Press, 1984).28. Y. Benjamini and Y. Hochberg, Journal of the Royal Statistical Society. Series B (Methodological)

, 289 (1995).29. L. J. Van’t Veer, H. Dai, M. J. Van De Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse,

K. van der Kooy, M. J. Marton, A. T. Witteveen et al., Nature 415, 530 (2002).30. S. Belin, A. Beghin, E. Solano-Gonzalez, L. Bezin, S. Brunet-Manquat, J. Textoris, A.-C. Prats,

H. C. Mertani, C. Dumontet and J.-J. Diaz, PloS one 4, p. e7147 (2009).31. M. Kubo, M. Nakamura, A. Tasaki, N. Yamanaka, H. Nakashima, M. Nomura, S. Kuroki and

M. Katano, Cancer research 64, 6071 (2004).32. J. Taipale and P. A. Beachy, nature 411, 349 (2001).33. J. M. Bailey, P. K. Singh and M. A. Hollingsworth, Journal of cellular biochemistry 102, 829

(2007).

Pacific Symposium on Biocomputing 2016

80