a topographical nonnegative matrix factorization...

A Topographical Nonnegative Matrix Factorizationalgorithm

Rogovschi NicoletaLIPADE, Paris Descartes University

45, rue des Saints Peres75006 Paris, France

{nicoleta.rogovschi}@parisdescartes.fr

Lazhar LabiodLIPADE, Paris Descartes University


{lazhar.labiod}@parisdescartes.fr

Mohamed NadifLIPADE, Paris Descartes University


{mohamed.nadif}@parisdescartes.fr

Abstract—We explore in this paper a novel topological orga-nization algorithm for data clustering and visualization namedTPNMF. It leads to a clustering of the data, as well as theprojection of the clusters on a two-dimensional grid whilepreserving the topological order of the initial data. The proposedalgorithm is based on a NMF (Nonnegative Matrix Factorization)formalism using a neighborhood function which take into accountthe topological order of the data. TPNMF was validated onvariant real datasets and the experimental results show a goodquality of the topological ordering and homogenous clustering.

I. INTRODUCTION

Clustering has received a significant amount of attention asan important problem with many applications, and a numberof different algorithms and methods have emerged over theyears. Recently, the use of NMF for partitional clustering ofnonnegative data has attracted much interest. Some examplescan be found in [1], [2], [3], [4]. However, the popularity ofNMF significantly increased since Lee and Seung publishedsimple multiplicative NMF algorithms which they applied toimage data [5], [6]. At present, NMF and its variants havealready found a wide spectrum of applications in the areassuch as pattern recognition and feature extraction [7], [5],[8], dimensionality reduction, segmentation and clustering [9],[10], [11], language modeling, textmining [3], [12], musictranscription [13], and neurobiology (gene separation) [2].

The concept of matrix factorization is used in a widerange of important applications and each matrix factorizationis a different assumption about the components (factors) ofmatrices and their underlying structures, and this choice isan essential process in each application domain. Very often,the datasets to be analyzed are nonnegative, and sometimesthey also have a sparse representation. In machine learning,sparseness is closely related to feature selection and certaingeneralizations in learning algorithms, while nonnegativityrelates to probability distributions. Ding et al. [14] showed theequivalence between NMF, spectral clustering and K-meansclustering. Zass and Shashua in [15] demonstrated that spectralclustering, normalized cuts, and Kernel K-means are particularcases of the clustering with nonnegative matrix factorizationunder a doubly stochastic constraint. They also consideredthe symmetric matrix decomposition under nonnegativity con-straints similar to that as formulated by Ding et al. However, in

[14] the optimization strategy leads to different multiplicativeupdate rules. The analysis of NMF versus K-means clusteringhas been recently discussed by Kim and Park [16], whoproposed Sparse NMF (SNMF) algorithm for data clustering.Their algorithm outperforms the K-means and ordinary NMFin terms of the consistency of the results. When the data to beclustered are not constrained to be nonnegative, we may scalethe data accordingly or another version of NMF can be used,that is, convex NMF [17]. For document clustering, a typicalmethod is Latent Semantic Indexing (LSI) [18] that involvesa Singular value decomposition (SVD) of the term- documentmatrix. Ding et al. [9] explored the relationship betweenNMF and Probabilistic LSI (PLSI), concluding that the hybridconnections of NMF and PLSI give the best results. Gaussierand Goutte [19] analyzed NMF with respect to ProbabilisticLatent Semantic Analysis (PLSA) [20], and they claimed thatPLSA solves NMF with KL I-divergence, and for this costfunction PLSA provides a better consistency. A comparison ofseveral NMF algorithms with various databases was performedby Li and Ding [21].They concluded that the NMF algorithmsgenerally give better performance than K-means. In fact, theNMF approach is rather equivalent to soft K-means, and alsoPLSI usually gives the same results as NMF.

In this study, we focus on reducing the dimensions of thefeature space as part of the unsupervised learning throughthe matrix factorization and to visualize the results of theclustering in a low dimensional space. The topological learningis one of the most known techniques which allow clusteringand visualization simultaneously. Topological learning is arecent direction in Machine Learning which aims to developmethods grounded on statistics to recover the topologicalinvariants from the observed data points. Most of the existedtopological learning approaches are based on graph theory orgraph-based clustering methods. At the end of the topographiclearning, the similar data will be collect in clusters, whichcorrespond to the sets of similar observations. These clusterscan be represented by more concise information than the brutallisting of their patterns, such as their gravity center or differentstatistical moments. As expected, this information is easier tomanipulate than the original data points.

In this paper we propose a new approach called TPNMF(Topographical Projective NMF). The proposed model allows

Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013

978-1-4673-6129-3/13/$31.00 ©2013 IEEE 1022

simultaneously, to achieve data clustering and visualization,indeed, it automatically provided a natural partition and a self-organization of the clusters on a two-dimensional map whilepreserving the a priori topological data structure (i.e two closeclusters on the map consist of close observations in the inputspace).

The rest of paper is organized as follows. Section 2 in-troduces the formalism of the traditional K-means algorithmin algebraic term. Section 3 describes the NMF frameworkfor K-means. Section 4 introduces our proposed notation anddescribes the topographical clustering model providing detailson the proposed TPNMF algorithm. The results obtained onreal datasets are presented in section 5. Finally, the conclusionsummarizes the advantages of our contribution.

II. K-MEANS

Given a data matrix A = (aij) ∈ RN×M+ , the aim of

clustering is to cluster the rows or the columns of A, so as tooptimize the difference between A = (aij) and the clusteredmatrix revealing significant block structure. More formally,we seek to partition the set of rows I = {1, . . . , N} into Kclusters C = {C1, . . . , CK} . The partitioning naturally induceclustering index matrix R = (rik) ∈ RN×K

+ , defined as binaryclassification matrix such as

∑Kk=1 rik = 1. Specifically, we

have rik = 1, if the row ai ∈ Ck, and 0 otherwise. On theother hand, we note S = (skj) ∈ RK×M

+ a reduced matrixspecifying the cluster representation.

The detection of homogeneous clusters of objects can bereached by looking for the two matrices R and S minimizingthe total squared residue measure

Jkmeans =∑k

∑i|rik=1

||ai − sk||2.

With matrix argument

Jkmeans = J (A,RS) = ||A−RS||2 (1)

The term RS characterizes the information of A that can bedescribed by the cluster structures.

III. NMF FRAMEWORK FOR K-MEANS

We can show kmeans as a constrained NMF problem. Thedetection of homogeneous blocks in A can be reached bylooking for the two matrices R and S minimizing the totalsquared residue measure

J (A,RS) = ||A−RS||2, (2)

where RN×K is the clustering index matrix of A while SK×Mis its summary.

A. Classical NMF

The clustering problem can be formulated as a matrix ap-proximation problem where the clustering aim is to minimizethe approximation error between the original data A and thereconstructed matrix based on the cluster structures. For anonnegative data A, if we relax the binarity constraint on Rand by considering the non negativity of R and S, we obtainthe NMF model proposed by Lee and Seung [6].

minR,S≥0

||A−RS||2. (3)

B. Projective NMF

In this subsection, a projective nonnegative matrix factor-ization (PNMF) is derived from the K-means objective. ThenK-means is shown as an algebraic optimization problem undersome suitable constraints. For a fixed clustering, the matrixsummary S can be expressed as S = D−1r RTA. WhereD−1r ∈ RK×K is a diagonal matrix defined as follow

D−1r = diag−1(RT1),

where 1 is a vector with appropriate dimensions where all itsvalues are equal to one. Plugging S into the objective functionequation (2), the expression to optimize becomes

||A−RD−1r RTA||2 = ||A−RRTA||2, (4)

where R = RD−0.5r .Note that this formulation holds even if A is not nonnega-tive, i.e., A has mixed signs entries. In order to solve thisoptimization problem we need to spell out proper constraintson R reported in Table I. The properties of R can be easilyproved.

TABLE IPROPERTIES OF R

Properties RNon-negativity R ≥ 0Orthonormality ||Rk||2 = 1Orthogonality RTR = IK

Bi-stochasticity RRT1 = 1

Trace Trace(RRT ) = KNorm ||RRT ||2 = K

Idempotence (RRT )2 = RRT

This new formulation of the K-means objective functionhighlights some good properties of the matrix R. Theseinteresting properties are rich in opportunities, in fact, theconsideration of these constraints on R allows us to developdifferent variants of NMF algorithms.

Given a nonnegative matrix A, with respect to the nonneg-ativity of R the minimization of (4) leads to the followingupdate rule

R← R� 2AATR

RRTAATR+AATRRTR. (5)

This update multiplicative formula is similar to that obtainedby using the projective nonnegative matrix factorization [22].

1023

We can also show the connection between NMF and PNMFas follow:

minR,S≥0

||A−RS||2 ⇔ minR≥0,S=RT

||A−RRTA||2. (6)

IV. TPNMF: TOPOGRAPHIC PROJECTIVE NMF

The TPNMF incorporates neighborhood connections be-tween PNMF basis functions arranged on a 2D topographicmap, the objective to optimize becomes

||A−RHRTA||2, (7)

where A, R are a non-negative input matrix and a non-negativecoefficient matrix, respectively, as in the original NMF. Thenew term H = (hrs) is a K × K non-negative dimensionalmatrix that defines neighborhood connections between Kbasis functions. Choosing H as the identity matrix reducesthe TPNMF to the PNMF. We arranged basis functions ona two-dimensional square-lattice topographic map, and setneighborhood connection weights to be normal distribution(Gaussian) functions on the map. The model consists of adiscrete set C of cells called ”map”. This map has a discretetopology defined by an undirected graph, which usually is aregular grid in two dimensions. For each pair of cells (r,s) onthe map, the distance δ(r, s) is defined as the length of theshortest chain linking cells r and s on the grid. For each cellthis distance defines a neighbor cell; in order to control theneighborhood area, we introduce a kernel positive function h(h ≥ 0 and lim

|y|→∞h(y) = 0). We define the mutual influence

of two cells r and s by hr,s. In practice, as for traditionaltopological maps we use a smooth function to control the sizeof the neighborhood as:

H = (hr,s) = exp(−δ(r, s)

T).

Using this kernel function, T becomes a parameter of themodel and as in the Kohonen [23] algorithm, we decreaseT from an initial value Tmax to a final value Tmin. Theminimization of (7) leads to the following update rule

R← R� 2AATRH

RHRTAATRH+AATRHRTRH. (8)

Hereafter, the pseudo code of the proposed algorithm.

V. NUMERICAL EXPERIMENTS

To evaluate the quality of clustering, we adopt the approachof comparing the results to a ”ground truth”. We use theclustering accuracy for measuring the clustering results. Thisis a common approach in the general area of data clustering.In general, the result of clustering is usually assessed onthe basis of some external knowledge about how clustersshould be structured. This may imply evaluating separation,density, connectedness, and so on. The only way to assess theusefulness of a clustering result is indirect validation, wherebyclusters are applied to the solution of a problem and thecorrectness is evaluated against objective external knowledge.This procedure is defined by [24] as ”validating clustering by

Algorithm 1: TPNMFInput: data A ∈ Rm×n and K ≤ min(m,n)Output: R,HInitialize: select random nonnegative R ∈ RN×K

+ andH ∈ RK×K

+ . Choose Tmax, Tmin and Niter.repeat

T = Tmax

(Tmin

Tmax

) tNiter−1

(9)

H = (hr,s) = exp(−δ(r, s)

T) (10)

R← R� 2AATRH

RHRTAATRH+AATRHRTRH(11)

until stabilization of R (t ≤ Niter).Classification-step: For i = 1, . . . , N each ai is assignedto the kth cluster, according to:

k = argmaxk′

Rik′ , k′ = 1, . . . ,K

extrinsic classification”, and has been followed in many otherstudies [25], [26]. We feel that this approach is reasonableone if we don’t want to judge clustering results by somecluster validity index, which is nothing but a bias toward somepreferred cluster property (e.g., compact, or well separated, orconnected). Thus, to adopt this approach we need labelleddata sets, where the external (extrinsic) knowledge is the classinformation provided by labels. Hence, if the TPNMF findssignificant clusters in the data, these will be reflected by thedistribution of classes. Therefore we operate a vote step forclusters and compare them to the behavior methods from theliterature. The so-called vote step consists in the following.For each cluster ck ∈ C:• Count the number of observation of each class ` (call itNk`).

• Count the total number of observation assigned to the cellk (call it Nk).

• Compute the proportion of observations of each class(call it Sk` = Nk`/Nk).

• Assign to the cluster the label of the most representedclass as follows: `∗ = argmax`(Sk`).

A cluster k for which Sk` = 1 for some class labelled `is usually termed a ”pure” cluster, and a purity measure canbe expressed as the percentage of elements of the assignedclass in a cluster. The experimental results are then expressedas the fraction of observations falling in clusters which arelabelled with a class different from that of the observation.This quantity is expressed as a percentage and termed ”errorpercentage” (indicated as Err% in the results). Regarding theevaluation method, we choose not to perform cross-validationor similar procedures, considering that the algorithm is trainedin a completely unsupervised manner, and calibration alreadyoccurs (in a sense) on an external validation data set, that

1024

is the set of class labels. Cross-validation or resamplingmethods, however, could be very useful to assess the stabilityof the proposed method, by comparing clustering structures inrepeated experiments.

A. Textual datasets

In order to compare the performances of TPNMF withother traditional unsupervised clustering algorithms, we usemany text datasets, which represent the frequency of words indocuments.

We used eight datasets for document clustering. ”Clas-sic30”,”Classic150”, ”Classic300”, ”Classic400” are an extractof Classic3 [27] which contains three classes denoted Medline,Cisi, Cranfield as their original database source. Classic30consists of 30 random documents described by 1000 wordsand Classic150 consists of 150 random documents describedby 3625 words. Tr11 and TR12 were extracted from the”Cluto toolkit”. Finally, NG5 (5 classes) is a subset of 20-Newsgroup data NG20 and composed by 500 documentsdescribed by 2000 words, concerning talk.politics.mideastand talk.politics.misc. A short description of these datasetsis presented in the table II. Note that the normalized cutweighting defined by [28] is applied to data before applyingclustering algorithms. To compute the quality of the performedclustering we adopted an evaluation approach which usesexternal knowledge (the class information provided by labels).Thus we use the purity index to evaluate the results ofthe documents clustering. We compared our method withSpherical K-means [29] and SOM [30] approaches. The tableIII presents the performances obtained by our method. Weobserve an improvement of the purity on all the databases.

TABLE IIDESCRIPTION OF THE DATABASES USED FOR THE EVALUATION; #

DENOTES THE CARDINALITY.

Databases # Documents # Words #classesClassic30 30 1073 3Classic150 150 3625 3Classic300 300 5577 3Classic400 400 6205 3

NG5 878 7453 5Tr11 414 6424 5Tr12 313 5799 5

TABLE IIICOMPARISON OF THE PURITY INDEX FOR SPERICAL K-MEANS, SOM AND

TPNF METHODS

Purity: % Spherical K-means Size of the map SOM TPNMFClassic30 83.31 (2× 3) 83.33 87

Classic150 90 (3× 3) 90.24 94Classic300 93.66 (5× 6) 82.4 94.33Classic400 82 (5× 5) 85.16 98.5

NG5 56.91 (6× 7) 67.31 69.3TR11 53.86 (4× 4) 55.42 57.8TR12 58.14 (7× 6) 54.61 59.2

B. Facial images data

To illustrate the visualization of the clusters, we propose toapply our algorithm to the FERET database of facial images[31]. After face segmentation, 2409 frontal facial images(poses fa and fb) of 867 subjects were stored in the databasefor the experiments. For the study we have obtained thecoordinates of the eyes from the ground truth data of theFERET collection and calibrated the head rotation so that allfaces are upright. All face boxes were normalized to the sizeof 3232, with fixed locations for the left eye (26,9) and theright eye (7,9).

In order to analyze the convergence of our algorithm weplot in figure 1 the GOF (goodness of fit). As we can seeafter only few iterations our algorithm converges quickly tothe optimum solution.

Fig. 1. The representation of the convergence of our algorithm.

Our results on this dataset is presented in the figure 2, whereeach image corresponds to the most representative objectthanks to the Classification-step. We can observe on this mapthat similar cells are grouped together. For instance, we canobserve on the up left corner individuals with smiling facialexpression characterized by prominent cheekbones and largenose. In bottom right corner of the map, we can observe, acluster which represents the men with mustache. We can dothe same analysis for the rest of the ”groups”of the map. Thefigure 2 represents the results obtained by classical NMF onthe FERET database, each image corresponding to the mostrepresentative for each cell. In order to compare classical NMFwith our approach we choose the same parameters, i.e the samenumber of clusters (25 cells). Compared to TPNMF, as we cansee in the figure 2 there is clearly no topological link betweenthe images (cells). Further, we also show that through differentmeans of visualization, the TPNMF algorithm gives variousinformation that could be used in practical applications. Inorder to analyze the clustering capabilities of our approach,we can also visualize the images captured by each cell. Anexample of the images captured by cells 11 and 20 is repre-sented in the figure 4. We can see that cell 11 captured similar

1025

images, representing smiling facial expression, while the cell20 captured images with a more serious facial expression.

VI. CONCLUSION

In this paper we proposed a new approach of topologicallearning in a NMF style.This approach was derived from thecost function of K-means.

We started by presenting K-means as a problem of algebraicoptimization under certain constraints. Afterward we haveintroduced a neighborhood function which allows the localitypreserving between data. The proposed algorithm TPNMFdoes not involve an implicit orthogonality step, the resultingmatrix R has hight sparsity, locality and orthogonality. Allthe topographic visualizations show that the topological orderobtained can be used for meaningful interpretation of TPNMFclustering.

As future work, the different constraints of R should beconsidered according to this new formulation of the objectivefunction. In fact, each time we will change constraints ananother model will be obtained, and it’ll be interesting toexplore and analyze the effect of each model on clusteringand topology.

REFERENCES

[1] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering withbregman divergences,” Journal of Machine Learning Research, vol. 6,pp. 1705–1749, 2005.

[2] H. Cho, I. Dhillon, Y. Guan, and S. Sra, “Minimum sum squared residuebased co-clustering of gene expression data.” in In Proc. 4th SIAMInternational Conference on Data Mining (SDM), Florida, 2004, pp.114–125.

[3] F. Shahnaz, M. Berry, P. Pauca, and R. Plemmons, “Document clusteringusing non-negative matrix factorization.” vol. 42, 2006, pp. 373–386.

[4] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, NonnegativeMatrix and Tensor Factorizations -Applications to Exploratory Multi-way Data Analysis and Blind Source Separation-. Wiley, 2009.

[5] D. Lee and H. S. Seung, “Learning the parts of objects by nonnegativematrix factorization,” Nature, vol. 401, pp. 788–791, 1999.

[6] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrixfactorization,” in Advances in Neural Information Processing Systems,vol. 13, 2001, pp. 556–562.

[7] W. Liu and N. Zheng, “Non-negative matrix factorizationbased methods for object recognition,” Pattern Recogn. Lett.,vol. 25, no. 8, pp. 893–897, Jun. 2004. [Online]. Available:http://dx.doi.org/10.1016/j.patrec.2004.02.002

[8] M. W. Spratling, “Learning image components for object recognition,”J. Mach. Learn. Res., vol. 7, pp. 793–815, Dec. 2006. [Online].Available: http://dl.acm.org/citation.cfm?id=1248547.1248575

[9] C. Ding, T. Li, and W. Peng., “Nonnegative matrix factorization andprobabilistic latent semantic indexing: Equivalence,chi-square statistic,and a hybrid method,” in Proc. of AAAI National Conf. on ArtificialIntelligence, vol. 42, 2006, pp. 137–143.

[10] F. T. J. M. P. Carmona-Saez, R. D. Pascual-Marqui and A. Pascual-Montano, “Biclustering of gene expression data by non-smooth nonneg-ative matrix factorization,” BMC Bioinformatics, vol. 7(78), pp. 1–18,2006.

[11] J. V. D. Guillamet and B. Schiele, “Introducing a weighted nonnega-tive matrix factorization for image classification,” Pattern RecognitionLetters, vol. 24(14), pp. 2447–2454, 2003.

[12] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons, “Documentclustering using nonnegative matrix factorization,” Information Process-ing and Management, vol. 42, no. 2, pp. 373–386, 2006.

[13] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization forpolyphonic music transcription,” in Proceedings of IEEE Workshop onApplication of Signal Processing to Audio and Acoustics, 2003.

[14] C. Ding, X. He, and H. D. Simon, “On the equivalence of nonnegativematrix factorization and spectral clustering,” in Proceedings of the SIAMData Mining Conference, 2005, pp. 606–610.

[15] R. Zass and A. Shashua, “A unifying approach to hard and probabilisticclustering,” in 10th IEEE International Conference on Computer Vision(ICCV 2005), 17-20 October 2005, Beijing, China. IEEE ComputerSociety, 2005, pp. 294–301.

[16] J. Kim and H. Park., “Sparse nonnegative matrix factorization forclustering,” in Technical report, Georgia Institute of Technology, 2008.

[17] C. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrixfactorizations,” Tech. Rep., 2006.

[18] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman, “Indexing by latent semantic analysis,” Journal of theAmerican Society for Information Science, vol. 41, no. 6, pp. 391–407,1990.

[19] E. Gaussier and C. Goutte, “Relation between plsa and nmf andimplications,” in Proceedings of the 28the annual international ACMSIGIR conference on Research and development in information retrieval,2005, pp. 601–602.

[20] T. Hofmann, “Unsupervised learning by probabilistic latent semanticanalysis,” Machine Learning, vol. 42, pp. 177–196, 2001.

[21] T. Li and C. Ding., “The relationships among various nonnegative matrixfactorization methods for clustering,” in 6th International Conferenceon Data Mining (ICDM06),Washington,DC, USA,. IEEE ComputerSociety, 2006, pp. 362–371.

[22] Z. Y. Zhirong Yang and J. Laaksonen, “Projective nonnegative matrixfactorization with applications to facial image processing,” Journal ofPattern Recognition and Artificial Intelligence, vol. 21(8), pp. 1353–1362, 2007.

[23] T. Kohonen, S. Kaski, and H. Lappalainen, “Self-organized formationof various invariant-feature filters in the adaptive-subspace som,” NeuralComput., vol. 9, no. 6, pp. 1321–1344, 1997.

[24] A. K. Jain and R. C. Dubes, Algorithms for clustering data. UpperSaddle River, NJ, USA: Prentice-Hall, Inc., 1988.

[25] S. S. Khan and S. Kant, “Computation of initial modes for k-modesclustering algorithm using evidence accumulation.” in IJCAI, 2007, pp.2784–2789.

[26] B. Andreopoulos, A. An, and X. Wang, “Bi-level clustering of mixedcategorical and numerical biomedical data.” International Journal ofData Mining and Bioinformatics, vol. 1, no. 1, pp. 19 – 56, 2006.

[27] I. Dhillon, “Co-clustering documents and words using bipartite spectralgraph partitioning,” ACM SIGKDD International Conference, pp. 269–274, 2001.

[28] W. Xu, X. Liu, and Y. Gong, “Document-clustering based on non-negative matrix factorization,” in Proceedings of SIGIR03, 2003, pp.267–273.

[29] I. S. Dhillon and D. S. Modha, “Concept decompositions for large sparsetext data using clustering,” Machine Learning, pp. 143–175, 2001.

[30] T. Kohonen, Self-organizing Maps. Springer Berlin, 2001.[31] P. J. Phillips, H. Moon, P. Rauss, and S. A. Rizvi, “The feret evalu-

ation methodology for face-recognition algorithms,” in Proceedings ofComputer Vision and Pattern Recognition, vol. 42, 1997, pp. 137–143.

1026

Fig. 2. The TPNMF 5× 5 map.

Fig. 3. The representative image of microclusters for classical NMF.

Fig. 4. The images captured by cell 11 and cell 20.

1027