learning spectral clustering, with application to speech separation f. r. bach and m. i. jordan,...
TRANSCRIPT
Learning Spectral Clustering, With Learning Spectral Clustering, With Application to Speech SeparationApplication to Speech Separation
F. R. Bach and M. I. Jordan, F. R. Bach and M. I. Jordan, JMLR 2006JMLR 2006
OutlineOutlineIntroductionIntroduction
Normalized cuts -> Cost functions for spectral Normalized cuts -> Cost functions for spectral clusteringclustering
Learning similarity matrixLearning similarity matrix
Approximate schemeApproximate scheme
ExamplesExamples
ConclusionsConclusions
Introduction Introduction 1/21/2
Traditional spectral clustering techniques:Traditional spectral clustering techniques:– Assume a metric/similarity structure, then use clustering Assume a metric/similarity structure, then use clustering
algorithms.algorithms.– Manual feature selection and weight are time-consuming.Manual feature selection and weight are time-consuming.
Proposed method Proposed method – A general framework for learning the similarity matrix for A general framework for learning the similarity matrix for
spectral clustering from data.spectral clustering from data.Assume given data with known partitions and want to build Assume given data with known partitions and want to build similarity matrices that will lead to these partitions in spectral similarity matrices that will lead to these partitions in spectral clustering.clustering.
– Motivations:Motivations:Hand-labelled databases are available: image, speech.Hand-labelled databases are available: image, speech.
Robust to irrelevant features.Robust to irrelevant features.
Introduction Introduction 2/22/2
What’s new?What’s new?– Two cost functions Two cost functions JJ11(W, E), J(W, E), J22(W, E)(W, E), , WW: similarity matrix, : similarity matrix, EE: :
a partition.a partition.MinMinEE JJ11 New clustering algorithms;New clustering algorithms;
MinMinWW J J1 1 learning the similarity matrix;learning the similarity matrix;
WW is not necessarily positive semidefinite; is not necessarily positive semidefinite;
– Design numerical approximation scheme for large scale.Design numerical approximation scheme for large scale.
Spectral Clustering & NCuts Spectral Clustering & NCuts 1/41/4
R-way Normalized CutsR-way Normalized Cuts– Each data point is one node in a graph, the weight on the edge Each data point is one node in a graph, the weight on the edge
connecting two nodes is the similarity of those two.connecting two nodes is the similarity of those two.– A graph is partitioned into A graph is partitioned into RR disjoint clusters by minimizing the disjoint clusters by minimizing the
normalized cut, cost function, normalized cut, cost function, C(A, W),C(A, W),
V={1,…,P}, V={1,…,P}, index set of all data pointsindex set of all data points
A={AA={Arr}}rrЄЄ{1,…,R}{1,…,R}, , Union ofUnion of A Arr=V.=V.
is total weight betweenis total weight between A A and and B. B.
, , normalized term penalizes normalized term penalizes unbalanced partition.unbalanced partition.
Spectral Clustering & NCuts Spectral Clustering & NCuts 2/42/4
– Another form of Ncuts:Another form of Ncuts:
EE=(=(ee11,…,e,…,eRR), ), eerr is the indicator vector ( is the indicator vector (PP by by 11) for the ) for the rr-th cluster.-th cluster.
Spectral RelaxationSpectral Relaxation
– Removing the constraint Removing the constraint (a)(a), the relaxed optimization problem , the relaxed optimization problem is solved as follows,is solved as follows,
– The relaxed solutions generally are not piecewise constant, so The relaxed solutions generally are not piecewise constant, so have to be projected back to subset defined by (a).have to be projected back to subset defined by (a).
Spectral Clustering & NCuts Spectral Clustering & NCuts 3/43/4
RoundingRounding– Minimization of a metric between the relaxed solution and the Minimization of a metric between the relaxed solution and the
entire set of discrete allowed solutions,entire set of discrete allowed solutions,Relaxed solution: Relaxed solution:
Desired solution:Desired solution:
– Try to compare the subspaces spanned by their columnsTry to compare the subspaces spanned by their columns compare the orthogonal projection operator on those compare the orthogonal projection operator on those subspaces, i.e. Frobenius norm between subspaces, i.e. Frobenius norm between YYeigeigYYeigeig
TT==UUUUTT and and . .
– Cost function is given asCost function is given as
Spectral Clustering & NCuts Spectral Clustering & NCuts 4/44/4
Spectral clustering algorithmsSpectral clustering algorithms– Variational form of cost function,Variational form of cost function,
– An weighted K-means algorithm can be used to solve the An weighted K-means algorithm can be used to solve the minimization.minimization.
Learning the Similarity Matrix Learning the Similarity Matrix 1/21/2
ObjectiveObjective– Assume known partition Assume known partition EE and a parametric form for and a parametric form for WW, learn , learn
parameters that generalized to unseen data sets.parameters that generalized to unseen data sets.
Naïve approachNaïve approach– Minimize the distance between true Minimize the distance between true EE and the output of spectral and the output of spectral
clustering algorithm (function of clustering algorithm (function of WW).).– Hard to optimize because of non continuous cost function.Hard to optimize because of non continuous cost function.
Cost functions as upper bounds of naïve cost functionCost functions as upper bounds of naïve cost function
– Minimize cost function Minimize cost function JJ11(W, E), J(W, E), J22(W, E)(W, E) is equivalent to minimize is equivalent to minimize an upper bound on the true cost function.an upper bound on the true cost function.
Learning the Similarity Matrix Learning the Similarity Matrix 2/22/2
AlgorithmsAlgorithms– Given Given NN data sets data sets DDnn, each , each DDnn is composed of is composed of PPnn points; points;
– Each data set is segmented, known partition Each data set is segmented, known partition EEnn;;
– The cost function isThe cost function is
– L-1 norm: feature selection;L-1 norm: feature selection;– Use steepest descent method to minimize Use steepest descent method to minimize H(H(αα)) w.r.t w.r.t αα..
Approximation SchemeApproximation SchemeLow-rank nonnegative decompositionLow-rank nonnegative decomposition– Approximate each column of Approximate each column of WW by a linear combination by a linear combination
of a set of randomly chosen columns (of a set of randomly chosen columns (II):): wwjj==∑∑iiЄЄIIHHijijwwii , j , jЄЄJJ
H H is chosen so that is minimum.is chosen so that is minimum.
– Decomposition:Decomposition:Randomly select a set of columns (Randomly select a set of columns (II))
Approximate Approximate W(I,J)W(I,J) as as W(I,I)H.W(I,I)H.
Approximate Approximate W(J,J)W(J,J) as as W(J,I)H+HW(J,I)H+HTTW(I,J)W(I,J)..
– Complexity:Complexity:Storage requirement is Storage requirement is O(MP), MO(MP), M is # of selected columnsis # of selected columns..
Overall complexity is Overall complexity is O(MO(M22P)P)..
Line DrawingsLine DrawingsTraining set 1
Favor connectedness
Examples of testing segmentation trained
with Training set 2
Examples of testing segmentation trained
with Training set 1
Training set 2 Favor direction continuity
ConclusionsConclusionsTwo sets of algorithms are presented – one for spectral Two sets of algorithms are presented – one for spectral clustering and one for learning the similarity matrix. clustering and one for learning the similarity matrix.
Minimization of a single cost function w.r.t. its two Minimization of a single cost function w.r.t. its two arguments leads to these algorithms.arguments leads to these algorithms.
The approximation scheme is efficient.The approximation scheme is efficient.
New approach is more robust to irrelevant features than New approach is more robust to irrelevant features than current methods.current methods.