a spectral clustering approach to optimally combining numerical vectors with a modular network
DESCRIPTION
A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network. Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka Bioinformatics Center, ICR, Kyoto University, Japan. KDD 2007, San Jose, California, USA, August 12-15 2007. 1. Table of Contents. - PowerPoint PPT PresentationTRANSCRIPT
A Spectral Clustering Approach to Optimally Combining Numerical Vectors with a Modular Network
Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka
Bioinformatics Center, ICR, Kyoto University, Japan
KDD 2007, San Jose, California, USA, August 12-15 2007 1
Table of Contents
1. MotivationClustering for heterogeneous data(numerical + network)
2. Proposed method Spectral clustering (numerical vectors + a network)
3. ExperimentsSynthetic data and real data
4. Summary
2
Ex. Gene analysis : gene expression, metabolic pathway, …, etc. Web page analysis : word frequency, hyperlink, …, etc.
Heterogeneous Data ClusteringHeterogeneous data : various information related to an interest
Gene
32
1
4 5
76
1st expression value
…
S-th
val
ueNumericalVectors
3
k-meansSOM, etc.
M. Shiga, I. Takigawa and H. Mamitsuka, ISMB/ECCB 2007.
Gene expression #experiments = S
metabolicpathway Network
Minimum edge cutRatio cut, etc.
To improve clustering accuracy, combine numerical vectors + network
Related work : semi-supervised clustering・ Local property Neighborhood relation -must-link edge, cannot-link edge
4
Proposed method・ Global property (network modularity)・ Soft constraint -Spectral clustering
・ Hard constraint (K. Wagstaff and C. Cardie, 2000.)・ Soft constraint (S. Basu etc., 2004.) - Probabilistic model (Hidden Markov random field)
Table of Contents
5
1. MotivationClustering for heterogeneous data(numerical + network)
2. Proposed method Spectral clustering (numerical vectors + a network)
3. ExperimentsSynthetic data and real data
4. Summary
Spectral Clustering
6
L. Hagen, etc., IEEE TCAD, 1992., J. Shi and J. Malik, IEEE PAMI, 2000.
1. Compute affinity(dissimilarity) matrix M from data
3. Assign a cluster label to each node ( by k-means )
2. To optimize cost J(Z) = tr{ZT M Z} subject to ZTZ=I where Z(i,k) is 1 when node i belong to cluster k, otherwise 0, compute eigen-values and -vectors of matrix M by relaxing Z(i,k) to a real value
Eigen-vector e1
e 2
Each node is by one or more computed eigenvectors
Trace optimization
Cost combining numerical vectors with a network
7
What cost?
To define a cost of a network, use a property of complex networks
networkCost of numerical vector cosine dissimilarity
N : #nodes, Y : inner product of normalized numerical vectors
Complex NetworksEx. Gene networks, WWW, Social networks, …, etc.Property•Small world phenomena•Power law•Hierarchical structure•Network modularity
Ravasz, et al., Science, 2002.Guimera, et al., Nature, 2005.
8
Z : set of whole nodesZk : set of nodes in cluster kL(A,B) : #edges between A and B
Normalized Network Modularity
High Low
= density of intra-cluster edges
Guimera, et al., Nature, 2005., Newman, et al., Phy. Rev. E, 2004. 9
# total edges# intra-edgesnormalize by cluster size
Cost Combining Numerical Vectors with a Network
10Mω
networkCost of numerical vector cosine dissimilarity Normalized modularity
(Negative)
Our Proposed Spectral Clustering1. Compute matrix Mω=2. To optimize cost J(Z) = tr{ZT Mω Z} subjet to ZTZ=I ,
compute eigen-values and -vectors of matrix Mω byrelaxing elements of Z to a real value
Each node is represented by K-1 eigen-vectors3. Assign a cluster label to each node by k-means.
(k-means outputs in spectral space.)
11
for ω = 0…1
end・ Optimize weight ω
Eigen-vector e1
e 2
x
x is sum of dissimilarity (cluster center <-> data)
e1
e 2
1. MotivationClustering for heterogeneous data(numerical + network)
2. Proposed method Spectral clustering (numerical vectors + a network)
3. ExperimentsSynthetic data and real data
4. Summary
Table of Contents
12
Synthetic DataNumerical vectors (von Mises-Fisher distribution)
θ = 1 50 5
x1x2
x3
x1x2
x3
x1x2
x3
Network (Random graph) #nodes = 400, #edges = 1600Modularity = 0.375 0.450 0.525
13
Results for Synthetic Data
ω
NM
IModularity = 0.375
・ Best NMI (Normalized Mutual Information) is in 0 < ω < 1・ Can be optimized using Costspectral 14
Cost
spec
tral
θ = 1
θ = 5 θ = 50
Numerical vectors only (k-means)
Network only(maximum modularity)
θ = 1 5 50
x1x2
x3
x1x2
x3
x1x2
x3
Numerical vectors
#nodes = 400, #edges = 1600Modularity = 0.375
Network
Results for Synthetic Data
ω ωω
NM
I0.450 0.525
・ Best NMI (Normalized Mutual Information) is in 0 < ω < 1・ Can be optimized using Costspectral 15
Cost
spec
tral
θ = 1
θ = 50
Numerical vectors only (k-means)
Network only(maximum modularity)
Modularity = 0.375
θ = 5
Synthetic Data (Numerical Vector) + Real Data (Gene Network)
True cluster(#clusters = 10)
Resultant cluster(ω=0.5, θ=10)
・ Best NMI is in 0 < ω < 1 ・ Can be optimized using Costspectral 16
Gene network by KEGG metabolic pathway N
MI
ωCo
stsp
ectr
al
θ = 10
θ = 102 θ = 103
Summary
• New spectral clustering method proposedcombining numerical vectors with a network・ Global network property (normalized network modularity)・ Clustering can be optimized by the weight
• Performance confirmed experimentally・ Better than numerical vectors only and a network only・ Optimizing the weight with synthetic dataset and semi-real dataset
17
Thank you for your attention!
our poster #16
Spectral Representation of Mω(concentration θ = 5 , Modularity = 0.375)
e1
e2
e3
e1
e2
e3
e1
e2
e3
ω = 0 ω = 0.3 ω = 1
Cost Jω = 0.0932 0.0538 0.0809
Select ω by minimizing Costspectral
(clusters are divided most separately)
Result for Real Genomic Data• Numerical vectors : Hughes’ expression data (Hughes, et al., cell, 2000)
• Gene network : Constructed using KEGG metabolic pathways (M. Kanehisa, etc. NAR, 2006)
NM
I
Cos
t spec
tral
ω ω
Our methodNormalized cutRatio cut
Our method
Evaluation Measure
H(C) : Entropy of probability variable C, C : Estimated clusters, G : Standard clusters
The more similar clustersC and G are, the larger the NMI.
Normalized Mutual Information (NMI) between estimated cluster and the standard cluster
14
Word FrequencyA n(A,2)
B n(B,2)
… …
Web Page Clustering
1
Word FrequencyA n(A,1)
B n(B,1)
… …
2 3
4
6
5
7
…Frequency of word A
Freq
uenc
y of
wor
d Z
NumericalVector
Network
To improve accuracy,combine heterogynous data
Spectral Clustering for Graph PartitioningRatio cut
Normalized cutSubject to
Subject to
L. Hagen, etc., IEEE TCAD, 1992., J. Shi and J. Malik, IEEE PAMI, 2000.