a spectral clustering approach to optimally combining numerical vectors with a modular network

A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka

Bioinformatics Center, ICR, Kyoto University, Japan

KDD 2007, San Jose, California, USA, August 12-15 2007 1

Table of Contents

1. MotivationClustering for heterogeneous data(numerical + network)

2. Proposed method Spectral clustering (numerical vectors + a network)

3. ExperimentsSynthetic data and real data

4. Summary

2

Ex. Gene analysis : gene expression, metabolic pathway, …, etc. Web page analysis : word frequency, hyperlink, …, etc.

Heterogeneous Data ClusteringHeterogeneous data : various information related to an interest

Gene

32

1

4 5

76

1st expression value

…

S-th

val

ueNumericalVectors

3

k-meansSOM, etc.

M. Shiga, I. Takigawa and H. Mamitsuka, ISMB/ECCB 2007.

Gene expression #experiments = S

metabolicpathway Network

Minimum edge cutRatio cut, etc.

To improve clustering accuracy, combine numerical vectors + network

Related work : semi-supervised clustering・ Local property Neighborhood relation -must-link edge, cannot-link edge

4

Proposed method・ Global property (network modularity)・ Soft constraint -Spectral clustering

・ Hard constraint (K. Wagstaff and C. Cardie, 2000.)・ Soft constraint (S. Basu etc., 2004.) - Probabilistic model (Hidden Markov random field)

Table of Contents

5




4. Summary

Spectral Clustering

6

L. Hagen, etc., IEEE TCAD, 1992., J. Shi and J. Malik, IEEE PAMI, 2000.

1. Compute affinity(dissimilarity) matrix M from data

3. Assign a cluster label to each node ( by k-means )

2. To optimize cost J(Z) = tr{ZT M Z} subject to ZTZ=I where Z(i,k) is 1 when node i belong to cluster k, otherwise 0, compute eigen-values and -vectors of matrix M by relaxing Z(i,k) to a real value

Eigen-vector e1

e 2

Each node is by one or more computed eigenvectors

Trace optimization

Cost combining numerical vectors with a network

7

What cost?

To define a cost of a network, use a property of complex networks

networkCost of numerical vector cosine dissimilarity

N : #nodes, Y : inner product of normalized numerical vectors

Complex NetworksEx. Gene networks, WWW, Social networks, …, etc.Property•Small world phenomena•Power law•Hierarchical structure•Network modularity

Ravasz, et al., Science, 2002.Guimera, et al., Nature, 2005.

8

Z : set of whole nodesZk : set of nodes in cluster kL(A,B) : #edges between A and B

Normalized Network Modularity

High Low

= density of intra-cluster edges

Guimera, et al., Nature, 2005., Newman, et al., Phy. Rev. E, 2004. 9

# total edges# intra-edgesnormalize by cluster size

Cost Combining Numerical Vectors with a Network

10Mω

networkCost of numerical vector cosine dissimilarity Normalized modularity

(Negative)

Our Proposed Spectral Clustering1. Compute matrix Mω=2. To optimize cost J(Z) = tr{ZT Mω Z} subjet to ZTZ=I ,

compute eigen-values and -vectors of matrix Mω byrelaxing elements of Z to a real value

Each node is represented by K-1 eigen-vectors3. Assign a cluster label to each node by k-means.

(k-means outputs in spectral space.)

11

for ω = 0…1

end・ Optimize weight ω

Eigen-vector e1

e 2

x

x is sum of dissimilarity (cluster center <-> data)

e1

e 2




4. Summary

Table of Contents

12

Synthetic DataNumerical vectors (von Mises-Fisher distribution)

θ = 1 50 5

x1x2

x3

x1x2

x3

x1x2

x3

Network (Random graph) #nodes = 400, #edges = 1600Modularity = 0.375 0.450 0.525

13

Results for Synthetic Data

ω

NM

IModularity = 0.375

・ Best NMI (Normalized Mutual Information) is in 0 < ω < 1・ Can be optimized using Costspectral 14

Cost

spec

tral

θ = 1

θ = 5 θ = 50

Numerical vectors only (k-means)

Network only(maximum modularity)

θ = 1 5 50

x1x2

x3

x1x2

x3

x1x2

x3

Numerical vectors

#nodes = 400, #edges = 1600Modularity = 0.375

Network

Results for Synthetic Data

ω ωω

NM

I0.450 0.525

・ Best NMI (Normalized Mutual Information) is in 0 < ω < 1・ Can be optimized using Costspectral 15

Cost

spec

tral

θ = 1

θ = 50

Numerical vectors only (k-means)

Network only(maximum modularity)

Modularity = 0.375

θ = 5

Synthetic Data (Numerical Vector) + Real Data (Gene Network)

True cluster(#clusters = 10)

Resultant cluster(ω=0.5, θ=10)

・ Best NMI is in 0 < ω < 1 ・ Can be optimized using Costspectral 16

Gene network by KEGG metabolic pathway N

MI

ωCo

stsp

ectr

al

θ = 10

θ = 102 θ = 103

Summary

• New spectral clustering method proposedcombining numerical vectors with a network・ Global network property (normalized network modularity)・ Clustering can be optimized by the weight

• Performance confirmed experimentally・ Better than numerical vectors only and a network only・ Optimizing the weight with synthetic dataset and semi-real dataset

17

Thank you for your attention!

our poster #16

Spectral Representation of Mω(concentration θ = 5 , Modularity = 0.375)

e1

e2

e3

e1

e2

e3

e1

e2

e3

ω = 0 ω = 0.3 ω = 1

Cost Jω = 0.0932 0.0538 0.0809

Select ω by minimizing Costspectral

(clusters are divided most separately)

Result for Real Genomic Data• Numerical vectors : Hughes’ expression data (Hughes, et al., cell, 2000)

• Gene network : Constructed using KEGG metabolic pathways (M. Kanehisa, etc. NAR, 2006)

NM

I

Cos

t spec

tral

ω ω

Our methodNormalized cutRatio cut

Our method

Evaluation Measure

H(C) : Entropy of probability variable C, C : Estimated clusters, G : Standard clusters

The more similar clustersC and G are, the larger the NMI.

Normalized Mutual Information (NMI) between estimated cluster and the standard cluster

14

Word FrequencyA n(A,2)

B n(B,2)

… …

Web Page Clustering

1

Word FrequencyA n(A,1)

B n(B,1)

… …

2 3

4

6

5

7

…Frequency of word A

Freq

uenc

y of

wor

d Z

NumericalVector

Network

To improve accuracy,combine heterogynous data

Spectral Clustering for Graph PartitioningRatio cut

Normalized cutSubject to

Subject to

L. Hagen, etc., IEEE TCAD, 1992., J. Shi and J. Malik, IEEE PAMI, 2000.

a spectral clustering approach to optimally combining numerical vectors with a modular network

Documents

numerical vectors network

vectors of matrix

spectral clustering

gene networks

real data summary

cluster size9 cost

cluster label

node i