a spectral clustering approach to optimally combining numerical vectors with a modular network

23
A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka Bioinformatics Center, ICR, Kyoto University, Japan KDD 2007, San Jose, California, USA, August 12-15 2007 1

Upload: vidal

Post on 22-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network. Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka Bioinformatics Center, ICR, Kyoto University, Japan. KDD 2007, San Jose, California, USA, August 12-15 2007. 1. Table of Contents. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka

Bioinformatics Center, ICR, Kyoto University, Japan

KDD 2007, San Jose, California, USA, August 12-15 2007 1

Page 2: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Table of Contents

1. MotivationClustering for heterogeneous data(numerical + network)

2. Proposed method Spectral clustering (numerical vectors + a network)

3. ExperimentsSynthetic data and real data

4. Summary

2

Page 3: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Ex. Gene analysis : gene expression, metabolic pathway, …, etc. Web page analysis : word frequency, hyperlink, …, etc.

Heterogeneous Data ClusteringHeterogeneous data : various information related to an interest

Gene

32

1

4 5

76

1st expression value

S-th

val

ueNumericalVectors

3

k-meansSOM, etc.

M. Shiga, I. Takigawa and H. Mamitsuka, ISMB/ECCB 2007.

Gene expression #experiments = S

metabolicpathway Network

Minimum edge cutRatio cut, etc.

To improve clustering accuracy, combine numerical vectors + network

Page 4: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Related work : semi-supervised clustering・ Local property Neighborhood relation -must-link edge, cannot-link edge

4

Proposed method・ Global property (network modularity)・ Soft constraint -Spectral clustering

・ Hard constraint (K. Wagstaff and C. Cardie, 2000.)・ Soft constraint (S. Basu etc., 2004.) - Probabilistic model (Hidden Markov random field)

Page 5: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Table of Contents

5

1. MotivationClustering for heterogeneous data(numerical + network)

2. Proposed method Spectral clustering (numerical vectors + a network)

3. ExperimentsSynthetic data and real data

4. Summary

Page 6: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Spectral Clustering

6

L. Hagen, etc., IEEE TCAD, 1992., J. Shi and J. Malik, IEEE PAMI, 2000.

1. Compute affinity(dissimilarity) matrix M from data

3. Assign a cluster label to each node ( by k-means )

2. To optimize cost J(Z) = tr{ZT M Z} subject to ZTZ=I where Z(i,k) is 1 when node i belong to cluster k, otherwise 0, compute eigen-values and -vectors of matrix M by relaxing Z(i,k) to a real value

Eigen-vector e1

e 2

Each node is by one or more computed eigenvectors

Trace optimization

Page 7: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Cost combining numerical vectors with a network

7

What cost?

To define a cost of a network, use a property of complex networks

networkCost of numerical vector cosine dissimilarity

N : #nodes, Y : inner product of normalized numerical vectors

Page 8: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Complex NetworksEx. Gene networks, WWW, Social networks, …, etc.Property•Small world phenomena•Power law•Hierarchical structure•Network modularity

Ravasz, et al., Science, 2002.Guimera, et al., Nature, 2005.

8

Page 9: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Z : set of whole nodesZk : set of nodes in cluster kL(A,B) : #edges between A and B

Normalized Network Modularity

High Low

= density of intra-cluster edges

Guimera, et al., Nature, 2005., Newman, et al., Phy. Rev. E, 2004. 9

# total edges# intra-edgesnormalize by cluster size

Page 10: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Cost Combining Numerical Vectors with a Network

10Mω

networkCost of numerical vector cosine dissimilarity Normalized modularity

(Negative)

Page 11: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Our Proposed Spectral Clustering1. Compute matrix Mω=2. To optimize cost J(Z) = tr{ZT Mω Z} subjet to ZTZ=I ,

compute eigen-values and -vectors of matrix Mω byrelaxing elements of Z to a real value

Each node is represented by K-1 eigen-vectors3. Assign a cluster label to each node by k-means.

(k-means outputs in spectral space.)

11

for ω = 0…1

end・ Optimize weight ω

Eigen-vector e1

e 2

x

x is sum of dissimilarity (cluster center <-> data)

e1

e 2

Page 12: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

1. MotivationClustering for heterogeneous data(numerical + network)

2. Proposed method Spectral clustering (numerical vectors + a network)

3. ExperimentsSynthetic data and real data

4. Summary

Table of Contents

12

Page 13: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Synthetic DataNumerical vectors (von Mises-Fisher distribution)

θ = 1 50 5

x1x2

x3

x1x2

x3

x1x2

x3

Network (Random graph) #nodes = 400, #edges = 1600Modularity = 0.375 0.450 0.525

13

Page 14: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Results for Synthetic Data

ω

NM

IModularity = 0.375

・ Best NMI (Normalized Mutual Information) is in 0 < ω < 1・ Can be optimized using Costspectral 14

Cost

spec

tral

θ = 1

θ = 5 θ = 50

Numerical vectors only (k-means)

Network only(maximum modularity)

θ = 1 5 50

x1x2

x3

x1x2

x3

x1x2

x3

Numerical vectors

#nodes = 400, #edges = 1600Modularity = 0.375

Network

Page 15: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Results for Synthetic Data

ω ωω

NM

I0.450 0.525

・ Best NMI (Normalized Mutual Information) is in 0 < ω < 1・ Can be optimized using Costspectral 15

Cost

spec

tral

θ = 1

θ = 50

Numerical vectors only (k-means)

Network only(maximum modularity)

Modularity = 0.375

θ = 5

Page 16: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Synthetic Data (Numerical Vector) + Real Data (Gene Network)

True cluster(#clusters = 10)

Resultant cluster(ω=0.5, θ=10)

・ Best NMI is in 0 < ω < 1 ・ Can be optimized using Costspectral 16

Gene network by KEGG metabolic pathway N

MI

ωCo

stsp

ectr

al

θ = 10

θ = 102 θ = 103

Page 17: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Summary

• New spectral clustering method proposedcombining numerical vectors with a network・ Global network property (normalized network modularity)・ Clustering can be optimized by the weight

• Performance confirmed experimentally・ Better than numerical vectors only and a network only・ Optimizing the weight with synthetic dataset and semi-real dataset

17

Page 18: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Thank you for your attention!

our poster #16

Page 19: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Spectral Representation of Mω(concentration θ = 5 , Modularity = 0.375)

e1

e2

e3

e1

e2

e3

e1

e2

e3

ω = 0 ω = 0.3 ω = 1

Cost Jω = 0.0932 0.0538 0.0809

Select ω by minimizing Costspectral

(clusters are divided most separately)

Page 20: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Result for Real Genomic Data• Numerical vectors : Hughes’ expression data (Hughes, et al., cell, 2000)

• Gene network : Constructed using KEGG metabolic pathways (M. Kanehisa, etc. NAR, 2006)

NM

I

Cos

t spec

tral

ω ω

Our methodNormalized cutRatio cut

Our method

Page 21: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Evaluation Measure

H(C) : Entropy of probability variable C, C : Estimated clusters, G : Standard clusters

The more similar clustersC and G are, the larger the NMI.

Normalized Mutual Information (NMI) between estimated cluster and the standard cluster

14

Page 22: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Word FrequencyA n(A,2)

B n(B,2)

… …

Web Page Clustering

1

Word FrequencyA n(A,1)

B n(B,1)

… …

2 3

4

6

5

7

…Frequency of word A

Freq

uenc

y of

wor

d Z

NumericalVector

Network

To improve accuracy,combine heterogynous data

Page 23: A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network

Spectral Clustering for Graph PartitioningRatio cut

Normalized cutSubject to

Subject to

L. Hagen, etc., IEEE TCAD, 1992., J. Shi and J. Malik, IEEE PAMI, 2000.