relevance learning in generative topographic mapping

8
Relevance learning in generative topographic mapping Andrej Gisbrecht, Barbara Hammer University of Bielefeld, CITEC Cluster of Excellence, Germany article info Available online 22 February 2011 Keywords: GTM Topographic mapping Relevance learning Supervised visualization abstract Generative topographic mapping (GTM) provides a flexible statistical model for unsupervised data inspection and topographic mapping. Since it yields to an explicit mapping of a low-dimensional latent space to the observation space and an explicit formula for a constrained Gaussian mixture model induced thereof, it offers diverse functionalities including clustering, dimensionality reduction, topographic mapping, and the like. However, it shares the property of most unsupervised tools that noise in the data cannot be recognized as such and, in consequence, is visualized in the map. The framework of visualization based on auxiliary information and, more specifically, the framework of learning metrics as introduced in [14,21] constitutes an elegant way to shape the metric according to auxiliary information at hand such that only those aspects are displayed in distance-based approaches which are relevant for a given classification task. Here we introduce the concept of relevance learning into GTM such that the metric is shaped according to auxiliary class labels. Relying on the prototype- based nature of GTM, efficient realizations of this paradigm are developed and compared on a couple of benchmarks to state-of-the-art supervised dimensionality reduction techniques. & 2011 Elsevier B.V. All rights reserved. 1. Introduction Generative topographic mapping (GTM) has been introduced as a generative statistical model corresponding to the classical self-organizing map for unsupervised data inspection and topo- graphic mapping [1]. An explicit statistical model has the benefit of great flexibility and easy adaptability to complex situations by means of appropriate statistical assumptions. Further, by offering an explicit mapping of latent space to observation space and a constrained Gaussian mixture model based thereof, GTM offers diverse functionality including visualization, clustering, topo- graphic mapping, and various forms of data inspection. Like standard unsupervised machine learning and data inspection methods, however, GTM shares the garbage in - garbage out problem: the information inherent in the data is displayed independent of the specific user intention. Hence, if ‘garbage’ is present in the data, this noise is presented to the user since the statistical model has no way to identify the noise as such. The domain of data visualization by means of dimensionality reduction techniques constitutes a matured field of research, many powerful nonlinear reduction techniques as well as a Matlab imple- mentation being readily available, see e.g. [27, 28, 16, 35, 12, 4, 22, 36, 26]. However, researchers in the community start to appreciate that the inherently ill-posed problem of unsupervised data visualization and dimensionality reduction has to be shaped according to the user’s needs to arrive at optimum results. This is particularly pronounced for real-life data sets which frequently do not allow a widely loss-free embedding into low dimensionality. Therefore, it has to be specified which parts of the available information should be preserved while embedding. On the one hand, formal evaluation measures have been developed which allow an explicit formulation and evaluation based on the desired result, see e.g. [3133,17]. On the other hand, researchers start to develop methods which can take auxiliary information into account. This way, the user can specify which information in the data is interesting for the current situation at hand by means of e.g. labeled data. There exist a few classical mechanisms which take class labeling into account to reduce the data dimensionality: Feature selection constitutes one specific type of dimensionality reduc- tion. Feature selection constitutes a well investigated research topic with numerous proposals based on general principles such as information theory or dedicated approaches developed for specific classifiers; see e.g. [13] for an overview. However, this way, the dimensionality reduction is restricted to very simple projections to coordinate axes. More variable albeit still linear projection methods are in the focus of several classical discriminative dimensionality reduction tools: Fisher’s linear discriminant analysis (LDA) projects data such that within class distances are minimized while between class distances are maximized. One important restriction of LDA is given by the fact that, this way, a meaningful projection to Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.12.015 Corresponding author. Tel.: + 49 521 106 12115; fax: + 49 521 106 12181. E-mail address: [email protected] (B. Hammer). Neurocomputing 74 (2011) 1351–1358

Upload: andrej-gisbrecht

Post on 26-Jun-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Relevance learning in generative topographic mapping

Neurocomputing 74 (2011) 1351–1358

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

� Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

Relevance learning in generative topographic mapping

Andrej Gisbrecht, Barbara Hammer �

University of Bielefeld, CITEC Cluster of Excellence, Germany

a r t i c l e i n f o

Available online 22 February 2011

Keywords:

GTM

Topographic mapping

Relevance learning

Supervised visualization

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2010.12.015

esponding author. Tel.: +49 521 106 12115;

ail address: [email protected]

a b s t r a c t

Generative topographic mapping (GTM) provides a flexible statistical model for unsupervised data

inspection and topographic mapping. Since it yields to an explicit mapping of a low-dimensional latent

space to the observation space and an explicit formula for a constrained Gaussian mixture model

induced thereof, it offers diverse functionalities including clustering, dimensionality reduction,

topographic mapping, and the like. However, it shares the property of most unsupervised tools that

noise in the data cannot be recognized as such and, in consequence, is visualized in the map. The

framework of visualization based on auxiliary information and, more specifically, the framework of

learning metrics as introduced in [14,21] constitutes an elegant way to shape the metric according to

auxiliary information at hand such that only those aspects are displayed in distance-based approaches

which are relevant for a given classification task. Here we introduce the concept of relevance learning

into GTM such that the metric is shaped according to auxiliary class labels. Relying on the prototype-

based nature of GTM, efficient realizations of this paradigm are developed and compared on a couple of

benchmarks to state-of-the-art supervised dimensionality reduction techniques.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Generative topographic mapping (GTM) has been introducedas a generative statistical model corresponding to the classicalself-organizing map for unsupervised data inspection and topo-graphic mapping [1]. An explicit statistical model has the benefitof great flexibility and easy adaptability to complex situations bymeans of appropriate statistical assumptions. Further, by offeringan explicit mapping of latent space to observation space and aconstrained Gaussian mixture model based thereof, GTM offersdiverse functionality including visualization, clustering, topo-graphic mapping, and various forms of data inspection. Likestandard unsupervised machine learning and data inspectionmethods, however, GTM shares the garbage in - garbage outproblem: the information inherent in the data is displayedindependent of the specific user intention. Hence, if ‘garbage’ ispresent in the data, this noise is presented to the user since thestatistical model has no way to identify the noise as such.

The domain of data visualization by means of dimensionalityreduction techniques constitutes a matured field of research, manypowerful nonlinear reduction techniques as well as a Matlab imple-mentation being readily available, see e.g. [27,28,16,35,12,4,22,36,26].However, researchers in the community start to appreciate that theinherently ill-posed problem of unsupervised data visualization and

ll rights reserved.

fax: +49 521 106 12181.

e (B. Hammer).

dimensionality reduction has to be shaped according to the user’sneeds to arrive at optimum results. This is particularly pronounced forreal-life data sets which frequently do not allow a widely loss-freeembedding into low dimensionality. Therefore, it has to be specifiedwhich parts of the available information should be preserved whileembedding.

On the one hand, formal evaluation measures have beendeveloped which allow an explicit formulation and evaluationbased on the desired result, see e.g. [31–33,17]. On the otherhand, researchers start to develop methods which can takeauxiliary information into account. This way, the user can specifywhich information in the data is interesting for the currentsituation at hand by means of e.g. labeled data.

There exist a few classical mechanisms which take classlabeling into account to reduce the data dimensionality: Featureselection constitutes one specific type of dimensionality reduc-tion. Feature selection constitutes a well investigated researchtopic with numerous proposals based on general principles suchas information theory or dedicated approaches developed forspecific classifiers; see e.g. [13] for an overview. However, thisway, the dimensionality reduction is restricted to very simpleprojections to coordinate axes.

More variable albeit still linear projection methods are in thefocus of several classical discriminative dimensionality reductiontools: Fisher’s linear discriminant analysis (LDA) projects datasuch that within class distances are minimized while betweenclass distances are maximized. One important restriction of LDA isgiven by the fact that, this way, a meaningful projection to

Page 2: Relevance learning in generative topographic mapping

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581352

dimensionality at most c�1, c being the number of classes, can beobtained. Hence, for two class problems only a linear visualizationis found. Partial least squares regression (PLS) constitutes anotherclassical method which objective is to maximize the covariance ofthe projected data and the given auxiliary information. It is alsosuited for situations where data dimensionality is larger than thenumber of data points; in such cases a linear projection is oftensufficient and the problem is to find good regularizations to adjustthe parameters accordingly. Informed projection [9] extendsprincipal component analysis (PCA) to also minimize the sumsquared error of data projections and the mean value of givenclasses, this way achieving a compromise of dimensionalityreduction and clustering in the projection space. Another techni-que relies on metric learning according to auxiliary class informa-tion. For a metric which corresponds to a global linear matrixtransform to low dimensionality this results in a linear discrimi-native projection of data, as proposed e.g. in [11,7].

Modern techniques extend these settings to general nonlinearprojection of data into low dimensionality such that the givenauxiliary information is taken into account. One way to extendlinear approaches to nonlinear settings is offered by kernelization.This incorporates an implicit nonlinear mapping to a high-dimensional feature space together with the linear low-dimen-sional mapping. It can be used for every linear approach whichrelies on dot products in the feature space only such that anefficient computation is possible, such as several variants ofkernel LDA [18,3]. However, it is not clear how to choose thekernel since its form severely influences the final shape of thevisualization. In addition, the method has quadratic complexitywith respect to the number of data due to its dependency on thefull Gram matrix.

Another principled way to extend dimensionality reduction toauxiliary information is offered by an adaptation of the under-lying metric which measures similarity in the original data space,see e.g., [6,34]. The principle of learning metrics has beenintroduced in [20,21]: the standard Riemanian metric of the givendata manifold is substituted by a form which measures theinformation of the data for the given classification task. The Fisherinformation matrix induces the local structure of this metric andit can be expanded globally in terms of path integrals. This metricis integrated into self-organizing maps (SOM), multidimensionalscaling (MDS), and a recent information theoretic model for datavisualization which directly relies on the metric in the dataspace [20,21,33]. A drawback of the proposed method is its highcomputational complexity due to the dependency of the metricon path integrals or approximations thereof. A slightly differentapproach is taken in [10]: Instead of learning the metric, an adhoc adaptation is used which also takes given class labeling intoaccount. The corresponding metric induces a k-nearest neighborgraph which is shaped according to the given auxiliary informa-tion. This can directly be integrated into a supervised version ofIsomap. The principle of discriminative visualization by means ofa change of the metric is considered in more generality in theapproach [8]. Here, a metric induced by prototype-based matrixadaptation as introduced e.g. in [24,25] is integrated in severalpopular visualization schemes including Isomap, manifold chart-ing, locally linear embedding, etc.

Alternative approaches to incorporate auxiliary informationmodify the cost function of dimensionality reduction tools toinclude the given class information. The approaches introducedin [15,19] can both be understood as extensions of stochasticneighbor embedding (SNE). SNE tries to minimize the deviation ofthe distribution of data induced by pairwise distances in theoriginal data space and projection space, respectively. Parametricembedding (PE) substitutes these distributions by conditionalprobabilities of classes, given a data point, this way mapping

both, data points and class centers at the same time. For thisprocedure, however, an assignment of data to unimodal classcenters needs to be known in advance. Multiple relationalembedding (MRE) incorporates several dissimilarity structuresin the data space induced by labeling, for example, into one latentspace representation. For this purpose, the difference of thedistribution of each dissimilarity matrix and the distribution ofan appropriate transform of the latent space are accumulated,whereby the transform is adapted during training according tothe given task. The weighting of the single components is takenaccording to the task at hand, whereby the authors report an onlymild influence of the weighting on the final outcome. It is notclear, however, how to pick the form of the transformation to takeinto account multimodal classes.

Colored maximum variance unfolding (MVU) incorporatesauxiliary information into MVU by substituting the raw datawhich is unfolded in MVU by the combination of the data and thecovariance matrix induced by the given auxiliary information.This way, differences which should be emphasized in the visua-lization are weighted by the differences given by the priorlabeling. Like MVU, however, the method depends on the fullGram matrix and is computationally demanding, such thatapproximations have to be used.

These approaches constitute promising candidates whichemphasize the relevance of discriminative nonlinear dimension-ality reduction. Only few of these methods allow an easy exten-sion to new data points or approximate inverse mappings.Further, most methods suffer from high computational costswhich make them infeasible for large data sets.

In this contribution, we extend GTM to the principle oflearning metrics by combining the technique of relevance learn-ing as introduced in supervised prototype-based classificationschemes and the prototype-based unsupervised representation ofdata as provided by GTM. We propose two different ways to adaptthe relevance terms which rely on different cost functions con-nected to prototype-based classification of data. Unlike [2], wherea separate supervised model is trained to arrive at appropriatemetrics for unsupervised data visualization, we can directlyintegrate the metric adaptation step into GTM due to the proto-type-based nature of GTM. We test the ability of the model tovisualize and cluster given data sets on a couple of benchmarks. Itturns out that, this way, an efficient and flexible discriminativedata mining and visualization technique arises.

2. The generative topographic mapping

The GTM as introduced in [1] models data xARD by means of amixture of Gaussians which is induced by a lattice of points w in alow-dimensional latent space which can be used for visualization.The lattice points are mapped via w/t¼ yðw,WÞ to the dataspace, where the function is parameterized by W; one can, forexample, pick a generalized linear regression model based onGaussian base functions

y : w/FðwÞ �W ð1Þ

where the base functions F are equally spaced Gaussians withvariance s�1. Every latent point induces a Gaussian

pðxjw,W,bÞ ¼b

2p

� �D=2

exp �b2Jx�yðw,WÞJ2

� �ð2Þ

with variance b�1, which gives the data distribution as mixture ofK modes

pðxjW,bÞ ¼XK

k ¼ 1

pðwkÞpðxjwk,W,bÞ ð3Þ

Page 3: Relevance learning in generative topographic mapping

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358 1353

where, usually, pðwkÞ is taken as uniform distribution of theprototypes. Training of GTM optimizes the data log-likelihood

lnYN

n ¼ 1

XK

k ¼ 1

pðwkÞpðxnjwk,W,bÞ

! !ð4Þ

by means of an expectation maximization (EM) approach withrespect to the parameters W and b. In the E step, the responsi-bility of mixture component k for point n is determined as

rkn ¼ pðwkjxn,W,bÞ ¼pðxnjwk,W,bÞpðwkÞPk0pðx

njwk0,W,bÞpðwk0Þð5Þ

In the M step, the weights W are determined solving the equality

UT GoldUWTnew ¼UT RoldX ð6Þ

where U refers to the matrix of base functions F evaluated at pointswk, X to the data points, R to the responsibilities, and G is a diagonalmatrix with accumulated responsibilities Gnn ¼

PkrknðW,bÞ. The

variance can be computed by solving

1

bnew

¼1

ND

Xk,n

rknðWold,boldÞJFðwkÞWnew�xnJ2

ð7Þ

where D is the data dimensionality and N is the number of datapoints.

cycle

log-

likel

ihoo

d

-4

-2

0

2

4

diffe

renc

e

1-500

0

500

1000

1500

llh beforellh afterdiff

× 105

10 20 30 40 50 60 70 80 90 100

Fig. 1. Log-likelihood (scale on the left) using the adapted metric before and after

metric learning for every epoch of the adaptation of prototypes of GTM by means

of EM, and the difference (scale on the right). Obviously, the difference is positive

in all but the first few epochs, in which the decrease of the log-likelihood is very

small as compared to its size. Hence this way of adapting the metric parameters

seems reasonable.

3. Relevance learning

The principle of relevance learning has been introduced in [14]as a particularly simple and efficient method to adapt the metricof prototype-based classifiers according to the given situation athand. It takes into account a relevance scheme of the datadimensionalities by substituting the squared Euclidean metricby the weighted form

dkðx,tÞ ¼XD

d ¼ 1

l2dðxd�tdÞ

2ð8Þ

In [14], the Euclidean metric is substituted by the more generalform (8) and, parallel to prototype updates, the metric parametersk are adapted according to the given classification task. Theprinciple is extended in [24,25] to the more general metric form

dXðx,tÞ ¼ ðx�tÞTXTXðx�tÞ ð9Þ

Using a square matrix X, a positive semi-definite matrix whichgives rise to a valid pseudo-metric is achieved this way. In [24,25],these metrics are considered in local and global form, i.e. theadaptive metric parameters can be identical for the full model, orthey can be attached to every prototype present in the model.Here we introduce the same principle into GTM.

3.1. Labeling of GTM

Assume that data point xn is equipped with label informationln which is an element of a finite set of different labels. Byposterior labeling, GTM gives rise to a probabilistic classificationof data points, assigning the label of prototype tk to data pointxn with probability rkn. Thereby, posterior labeling of GTM can bedone in such a way that the classification error

PNn ¼ 1

PKk ¼ 1

rknskðxnÞ is minimized with skðx

nÞ equal to zero, if the prototypetk has the same label as xn and equal to one otherwise. Thus theprototype tk ¼ yðwk,WÞ is labeled

cðtkÞ ¼ argmaxc

Xnjln ¼ c

rkn

0@

1A ð10Þ

3.2. Metric adaptation in GTM

We can introduce relevance learning into GTM by substitutingthe Euclidean metric in the Gaussian functions (2) by moregeneral diagonal metric (8) which includes relevance terms orthe metric induced by a full matrix (9) which can also takecorrelations of the dimensions into account. Thereby, we canintroduce one global metric for the full model, or, alternatively,we can introduce local metric parameters kk or Xk, respectively,for every prototype tk. We refer to the latter version as localmethod.

Using this posterior labeling of the prototypes, the parametersof the GTM model should be adapted such that the data log-likelihood is optimum. Analogous to [1], it can be seen thatoptimization of the parameters W and b of GTM can be done inthe same way as beforehand, whereby the new metricstructure (8) and (9) has to be used when computing theresponsibilities (5).

We assume that the metric is changed during this optimiza-tion process on a slower time scale such that the auxiliaryinformation is mirrored in the metric parameters. Thereby, weassume quasi-stationarity of metric parameters when performingoriginal EM training. A similar procedure has been used in [24] forsimultaneous metric and prototype learning, and [30] provides anexplanation in how far this procedure is reasonable in the contextof self-organizing maps. Essentially, the adaptation can be under-stood as an adiabatic process this way [5] overlaying fast para-meter adaptation by EM optimization of the log-likelihood andslow metric adaptation according to the objectives as will bedetailed below. This assumption is substantiated if the datalog-likelihood is evaluated in a concrete learning process. Fig. 1displays the value of the data log-likelihood before and aftermetric adaptation for every epoch in a typical learning process aswill be detailed below (the adaptation concerns local full matricesadapted using the soft robust learning vector quantization costfunction for the letter data set using the parameters as detailed inthe experiments). Obviously, the data log-likelihood increasesusing metric adaptation for all but the first few epochs, in whichthe size of the decrease is almost negligible as compared to thesize of the log-likelihood.

Page 4: Relevance learning in generative topographic mapping

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581354

Now the question is how to design an efficient scheme formetric learning based on the structure as provided by GTM andthe given auxiliary labeling. Unlike approaches such as fuzzy-labeled SOM [23], we use a fully supervised scheme to learnmetric parameters. Unlike the original framework of learningmetrics [20,21], however, we make use of the prototype-basedstructure of the induced GTM classifier which allows us toefficiently update local metric parameters without the necessityof a computationally costly approximation of the Riemannianmetric induced by the general Fisher information. For thispurpose, we introduce two different cost functions E (see below)motivated from prototype-based learning which are used tooptimize the metric parameters.

For metric adaptation, we simply use a stochastic gradientdescent of the cost functions. Naturally, more advanced schemeswould be possible, but a simple gradient descent already leads tosatisfactory results, as we will demonstrate in experiments. Toavoid convergence to trivial optima such as zero we poseconstraints on the metric parameters of the form JkJ¼ 1 ortraceðXTXÞ2 ¼ 1, respectively. This is achieved by normalizationof the values, i.e. after every gradient step, k is divided by itslength, and X is divided by the square root of traceðXTXÞ. Thus, ahigh-level description of the algorithm is possible as depictedin Table 1. Usually, we alternate between one EM step, one epochof gradient descent, and normalization in our experiments. SinceEM optimization is much faster than gradient descent, this way,we can enforce that the metric parameters are adapted on aslower time scale. Hence we can assume an approximatelyconstant metric for the EM optimization, i.e. the EM schemeoptimizes the likelihood as before. Metric adaptation takes placeconsidering quasi stationary states of the GTM solution due to theslower time scale.

Note that metric adaptation introduces a large number ofadditional parameters into the model depending on the inputdimensionality. One can raise the question whether this leads tostrong overfitting of the model. We will see in experiments thatthis is not the case: when evaluating the clustering performanceof the resulting GTM, the training error is representative for thegeneralization error. One can substantiate this experimentalfinding with a theoretical counterpart: using posterior labeling,GTM offers a prototype-based classification scheme with localadaptive metrics. This function class has a supervised pendant:generalized matrix learning vector quantization as introducedin [24]. The worst case generalization ability of the latter class canbe investigated based on classical computational learning theory.It turns out that its generalization ability does not depend on thenumber of parameters adapted during training, rather, largemargin generalization bounds can be derived. In consequence,very good generalization ability can be proved (and experimen-tally observed) as detailed in [24]. Since the formal argumenta-tion in [24] depends on the considered function class only and notthe way in which training takes place, the same generalizationbounds apply to GTM with adaptive metrics as introduced here.

Now, we discuss concrete cost functions E for the metricadaptation.

Table 1Integration of relevance learning into GTM.

INIT

REPEAT

E-STEP: DETERMINE rkn BASED ON THE GENERAL METRIC

M-STEP: DETERMINE W AND b AS IN GTM

LABEL PROTOTYPES

ADAPT METRIC PARAMETERS BY STOCHASTIC GRADIENT DESCENT OF E

NORMALIZE THE METRIC PARAMETERS

3.3. Generalized relevance GTM (GRGTM)

Metric parameters have the form k or kk for a diagonalmetric (8) and X or Xk for a full matrix (9), depending onwhether a local or global scheme is considered. In the following,we define the general parameter Yk which can be chosen as oneof these four possibilities depending on the given setting.Thereby, we can assume that Yk can be realized by a matrixwhich has diagonal form (for relevance learning) or full matrixform (for matrix updates).

The cost function of generalized relevance GTM is taken fromgeneralized relevance learning vector quantization (GRLVQ),which can be interpreted as maximizing the hypothesis marginof a prototype-based classification scheme such as LVQ [14,24].The cost function has the form

EðYÞ ¼X

n

EnðYÞ ¼X

n

sgddYþ ðx

n,tþ Þ�dY� ðxn,t�Þ

dYþ ðxn,tþ ÞþdY� ðxn,t�Þ

� �ð11Þ

where sgdðxÞ ¼ ð1þexpð�xÞÞ�1, tþ is the closest prototype in thedata space with the same label as xn and t� is the closestprototype with a different label.

The adaptation formulas can be derived thereof by taking thederivatives. Depending on the form of the metric, the derivative ofthe metric is

@dkðx,tÞ

@li¼ 2liðxi�tiÞ

2ð12Þ

for a diagonal metric and

@dXðx,tÞ

@Oij¼ 2ðxj�tjÞ

Xd

Oidðxd�tdÞ ð13Þ

for a full matrix.For simplicity, we denote the respective squared distances to

the closest correct and wrong prototype, respectively, bydþ ¼ dYþ ðx

n,tþ Þ and d� ¼ dY� ðxn,t�Þ. The term sgd0 is a shorthandnotation for sgd0ððdþ�d�Þ=ðdþ þd�ÞÞ. Given a data point xn thederivative of the corresponding summand of cost function E withrespect to metric parameters yields

@En

@Yþ¼ 2sgd0 �

d�

ðdþ þd�Þ2�@dþ

@Yþð14Þ

for the parameters of the closest correct prototype and

@En

@Y�¼�2sgd0 �

ðdþ þd�Þ2�@d�

@Y�ð15Þ

for the parameters attached to the closest wrong prototype. Allother parameters are not affected. These updates take place forthe local modeling of parameters, which we refer to by localgeneralized relevance GTM (LGRGTM) or local generalized matrixGTM (LGMGTM), respectively. If metric parameters are global, theupdate corresponds to the sum of these two derivatives, referredto by generalized relevance GTM (GRGTM) or generalized matrixGTM (GMGTM), respectively.

3.4. Robust soft GTM (RSGTM)

Unlike GRLVQ, robust soft LVQ (RSLVQ) [29] has the goal tooptimize a statistical model which defines the data distribution. Itis assumed that data are given by a Gaussian mixture of proto-types which are labeled. The objective is to maximize thelogarithm of the probability of a data point being generated bya prototype of the correct class versus the overall probability. Inthe limit of small variance of the Gaussians, a learning rule whichis similar to the standard LVQ rule results. The objective for ageneral variance b�1 of the Gaussian modes corresponds to the

Page 5: Relevance learning in generative topographic mapping

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358 1355

following cost function:

EðYÞ ¼X

n

EnðYÞ ¼X

n

log

PkjcðtkÞ ¼ ln pðwkÞpðxnjwk,W,bÞ

pðxnjW,bÞ

!ð16Þ

Here, we can choose Gaussian modes as provided by GTM, i.e. themodes and corresponding mixture are given in analogy toformulas (2) and (3) where the new parameterizedmetric (8) and (9) as well as the labeling (10) of GTM is used.

We obtain the update rules by taking the derivatives, asbeforehand:

@En

@Yk¼ ðskðx

nÞðqkn�rknÞ�ð1�skðxnÞÞrknÞ

1

Sk�@Sk

@Yk�b2�@dYk ðxn,tkÞ

@Yk

!

ð17Þ

where skðxnÞ indicates whether prototype and data label coincide,

qkn ¼pðxnjwk,W,bÞpðwkÞP

k0 jcðtk0 Þ ¼ ln pðxnjwk0,W,bÞpðwk0Þð18Þ

refers to the probability of mode k among the correct modes, and

Sk ¼b

2p

� �D=2

� detðYkÞ ð19Þ

normalizes the Gaussian modes to arrive at valid probabilities.The derivative is

1

S�@S

@li¼

1

lið20Þ

for a relevance vector and

1

S�@S

@Oij¼O�1

ji ð21Þ

for full matrices.

Table 2Parameters used for training.

Data Number of prototypes Number of base functions

Landsat 10�10 4�4

Phoneme 10�10 4�4

Letter 30�30 30�30

Fig. 2. Mean accuracy of the classification obtained by diverse supervised GTM sch

We refer to this version as local relevance robust soft GTM(LRSGTM) and local matrix robust soft GTM (LMRSGTM), respectively.The global versions can be obtained by adding the derivatives, werefer to these algorithms as relevance robust soft GTM (RSGTM) andmatrix robust soft GTM (MRSGTM), respectively.

4. Experiments

4.1. Classification

We test the efficiency of relevance learning in GTM on threebenchmark data sets as described in [21,33]: Landsat Satellite data

with 36 dimensions, 6 classes, and 6435 samples, Letter Recogni-

tion data with 16 dimensions, 26 classes, and 20,000 samples, andPhoneme data with 20 dimensions, 13 classes, and 3656 samples.Prior to training all data sets were normalized by subtracting themean and dividing by standard deviation. GTM is initialized usingthe first two principal components. The mapping yðw,WÞ isinduced by generalized linear regression based on Gaussianbase functions. The learning rate of the gradient descent forthe metric parameters has been optimized for the data and ischosen in the range of 10�6 to 10�2. More precisely, an exhaus-tive search of the parameter range is done and the value ispicked for the learning rate which leads to the best convergenceof the relevance profile. Thereby, the number of epochs is chosenas 100, which is sufficient to allow convergence of matrixparameters; typically, convergence of the EM scheme can beobserved at a faster scale. The number of prototypes andbase functions has been taken to suite the size of the data, it isshown in Table 2. Due to the complexity of the training, anexhaustive search of these parameters has been avoided, butreasonable numbers have been chosen. Typically, the results areonly mildly influenced by small changes of these numbers.The variance of the Gaussian base functions has been chosensuch that it coincides with the distance between neighbored basefunctions.

We report the results of a repeated stratified 10-fold cross-validation with one repeat (letter) and 10 repeats (phoneme,landsat), respectively, reporting also the variance over therepeats. We evaluate the models in comparison to several recentalternative supervised visualization tools by means of the test error

emes as introduced in this article and alternative state-of-the-art approaches.

Page 6: Relevance learning in generative topographic mapping

Fig. 3. Visualization of the result of GTM (top) and robust soft GTM with local

matrix learning (bottom) on the MNIST data set. Pie charts give the responsibility

of the prototypes for the given classes. Supervision achieves a better separation of

the classes within receptive fields of prototypes, introducing dead units if

necessary.

Fig. 4. Visualization of the result of GTM (top) and robust soft GTM with local

matrix learning (bottom) on the Phoneme data set. Pie charts give the responsi-

bility of the prototypes for the given classes. Supervision achieves a better

separation of the classes within receptive fields as can be seen by the pie charts.

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581356

obtained in the cross-validation. These alternatives are takenfrom [33].1 The alternative methods include parametric embedding(PE), supervised Isomap (S-Isomap), colored maximum varianceunfolding (MUHSIC), multiple relational embedding (MRE), neigh-borhood component analysis (NCA), and supervised neighborhoodretrieval visualizer (SNeRV) based on different weighting of retrieval

1 We would like to thank the authors of [34] for providing the results.

objectives in the cost function of the model (l) and a Riemannianmetric based on the Fisher information matrix

The results are shown in Fig. 2. In two of the three cases, metricadaptation improves the classification accuracy compared to simpleGTM. Thereby, matrix adaptation yields to superior results comparedto the adaptation of a simple relevance vector. Further, results basedon the robust soft learning vector quantization seem slightly betterfor all data sets. For all three data sets, we obtain state-of-the-artresults which are comparable to the best alternative supervisedvisualization tools which are currently available in the literature.

Page 7: Relevance learning in generative topographic mapping

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358 1357

We also report the results obtained by a five-nearest neighborclassifier for the original data in the original (high-dimensional) dataspace. Interestingly, for all cases, the supervised results using the fullinformation are only slightly better than the results obtained by GTMwith local matrix adaptation in two dimensions. This demonstratesthe high quality of the supervised visualization. Note that, unlike theexperiments from [33] which restrict to a subset of 1500 samples inall cases due to complexity issues, we can train on the full dataset dueto the efficiency of relevance GTM, and, in two of the three cases, wecan even perform a 10-fold repeat of the experiments inreasonable time.

4.2. Visualization

The result of a visualization of the Phoneme data set and theMNIST data set (this data set consists of 60,000 points with 768dimensions representing the 10 digits, a subsample of 6000 imageswas used in this case) using robust soft GTM with local matrixadaptation is shown in Figs. 3 and 4, whereby the full data set is usedfor training. A comparison to simple GTM shows the ability of matrixlearning to arrive at a topographic mapping which better mirrors theunderlying class structures: the pie charts display the percentage ofpoints of the different classes assigned to the respective prototypebased on the receptive fields. Interestingly, in both cases, the piecharts obtained with metric learning display less classes for the singleprototypes corresponding to better separated receptive fields,whereas the classes are spread among the prototypes if metricadaptation does not take place. This is also mirrored in the betterclassification accuracy of GTM with matrix learning. The arrangementof the classes on the map differs for the different visualizations. Formetric learning, multiple modes of the classes can be observed. Forstandard GTM, the distribution is less clear since the single prototypescombine different classes in their receptive fields.

5. Discussion

In this contribution, a method has been proposed to integrateauxiliary information in terms of relevance updates into GTM; thebenefit of this approach has been demonstrated on severalbenchmarks. Unlike approaches such as fuzzy-labeled SOM [24],metric parameters are adapted in a supervised fashion basedon the classification ability of the model. As [22], the work isbased on adaptive metrics to incorporate auxiliary informationinto the model. Unlike this latter work [22], however, theproposed method relies on the prototype-based nature of GTMand transfers the relevance update scheme of supervised learningschemes such as [15,25] to this setting, resulting in an efficienttopological mapping. As demonstrated on several benchmarks,the classification accuracy is competitive to state-of-the-artmethods for supervised visualization, whereby GTM providesadditional functionality due to the explicit topographic mappingof the latent space into the observation space accompanied byan explicit generative statistical model. As demonstrated bymeans of visualization, the class separation is much more accu-rate for supervised GTM compared to the original method, thusclearly focussing on the relevant aspects for the givenclassification.

However, the evaluation as proposed in this contribution can onlyserve as indicator whether useful mappings are obtained by relevanceGTM. Since data visualization is an inherently ill-posed problem, aclear evaluation by means of a single (or few) quantitative measuresseems hardly satisfactory, the respective goal being often situationdependent. For a proper evaluation of the model for concrete tasks, anempirical user study would be interesting.

References

[1] C. Bishop, M. Svensen, C. Williams, The generative topographic map, NeuralComputation 10 (1) (1998) 215–234.

[2] K. Bunte, B. Hammer, P. Schneider, M. Biehl, Nonlinear discriminativedata visualization, in: M. Verleysen (Ed.), ESANN 2009, d-side Publishing, 2009,pp. 65–70.

[3] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernelapproach, Neural Computation 12 (2000) 2385–2404.

[4] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction anddata representation, Neural Computation 15 (2003) 1373–1396.

[5] M. Born, V.A. Fock, Beweis des Adiabatensatzes, Zeitschrift fur Physik aHadrons and Nuclei 51 (3–4) (1928) 165–180.

[6] K. Bunte, B. Hammer, T. Villmann, M. Biehl, A. Wismuller, Exploratoryobservation machine (XOM) with Kullback–Leibler divergence for dimen-sionality reduction and visualization, in: M. Verleysen (Ed.), ESANN’10, Dside, 2010, pp. 87–92.

[7] K. Bunte, B. Hammer, M. Biehl, Nonlinear dimension reduction and visualiza-tion of labeled data, in: X. Jiang, N. Petkov (Eds.), International Conference onComputer Analysis of Images and Patterns, Springer, 2009, pp. 1162–1170.

[8] K. Bunte, B. Hammer, A. Wismueller, M. Biehl, Adaptive local dissimilaritymeasures for discriminative dimension reduction of labeled data, Neurocom-puting 73 (7–9) (2010) 1074–1092.

[9] D. Cohn, Informed projections, in: S. Becker, S. Thrun, K. Obermayer (Eds.),NIPS, MIT Press, 2003, pp. 849–856.

[10] X. Geng, D.-C. Zhan, Z.-H. Zhou, Supervised nonlinear dimensionality reduc-tion for visualization and classification, IEEE Transactions on Systems, Man,and Cybernetics, Part B 35 (6) (2005) 1098–1107.

[11] J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighbourhood com-ponents analysis, Advances in Neural Information Processing Systems, vol.17, MIT Press, 2004, pp. 513–520.

[12] A. Gorban, B. Kegl, D. Wunsch, A. Zinoyev (Eds.), Principle Manifolds for DataVisualization and Dimensionality Reduction, Springer, , 2008.

[13] I. Guyon, A. Elisseeff, An introduction to variable and feature selection,Journal of Machine Learning Research 3 (2003) 1157–1182.

[14] B. Hammer, T. Villmann, Generalized relevance learning vector quantization,Neural Networks 15 (8–9) (2002) 1059–1068.

[15] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T.L. Griffiths, J.B. Tenenbaum,Parametric embedding for class visualization, Neural Computation 19 (9)(2007) 2536–2556.

[16] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007.[17] J.A. Lee, M. Verleysen, Quality Assessment of Dimensionality Reduction: Rank-

based Criteria Neurocomputing, vol. 72(7–9), Elsevier, 2009, pp. 1431–1443.[18] B. Ma, H. Qu, H. Wong, Kernel clustering-based discriminant analysis, Pattern

Recognition 40 (1) (2007) 324–327.[19] R. Memisevic, G. Hinton, Multiple relational embedding, in: L.K. Saul,

Y. Weiss, L. Bottou (Eds.), Advances in Neural Information ProcessingSystems, vol. 17, MIT Press, Cambridge, MA, 2005, pp. 913–920.

[20] J. Peltonen, Data exploration with learning metrics, D.Sc. Thesis, Dissertationsin Computer and Information Science, Report D7, Espoo, Finland, 2004.

[21] J. Peltonen, A. Klami, S. Kaski, Improved learning of Riemannian metrics forexploratory analysis, Neural Networks 17 (2004) 1087–1100.

[22] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linearembedding, Science 290 (2000) 2323–2326.

[23] F.-M. Schleif, B. Hammer, M. Kostrzewa, T. Villmann, Exploration of mass-spectrometric data in clinical proteomics using learning vector quantizationmethods, Briefings in Bioinformatics 9 (2) (2008) 129–143.

[24] P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learningvector quantization, Neural Computation 21 (2009) 3532–3561.

[25] P. Schneider, M. Biehl, B. Hammer, Distance learning in discriminative vectorquantization, Neural Computation 21 (2009) 2942–2969.

[26] J. Tenenbaum, V. da Silva, J. Langford, A global geometric framework fornonlinear dimensionality reduction, Science 290 (2000) 2319–2323.

[27] L. van der Maaten, G. Hinton, Visualizing high-dimensional data using t-sne,Journal of Machine Learning Research 9 (2008) 2579–2605.

[28] L. van der Maaten, E. Postma, H. van den Herik, Dimensionality reduction: acomparative review, Technical report, Tilburg University Technical Report,TiCC-TR 2009-005, 2009.

[29] S. Seo, K. Obermayer, Soft learning vector quantization, Neural Computation15 (7) (2003) 1589–1604.

[30] A. Spitzner, D. Polani, Order parameters for self-organizing maps, in: L. Niklasson,M. Boden, T. Ziemke (Eds.), Proceedings of the 8th International Conference onArtificial Neural Networks (ICANN 98), vol. 2, Springer, 1998, pp. 517–522.

[31] J. Venna, Dimensionality reduction for visual exploration of similarity structures,Ph.D. Thesis, Helsinki University of Technology, Espoo, Finland, 2007.

[32] J. Venna, S. Kaski, Local multidimensional scaling, Neural Networks 19 (2006)89–99.

[33] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, Information retrievalperspective to nonlinear dimensionality reduction for data visualization,Journal of Machine Learning Research 11 (2010) 451–490.

[34] T. Villmann, B. Hammer, F.-M. Schleif, T. Geweniger, W. Herrmann, Fuzzyclassification by fuzzy labeled neural gas, Neural Networks 19 (2006) 772–779.

[35] M. Ward, G. Grinstein, D.A. Keim, Interactive Data Visualization: Foundations,Techniques, and Application, A.K. Peters, Ltd., 2010.

Page 8: Relevance learning in generative topographic mapping

A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581358

[36] K.Q. Weinberger, L.K. Saul, An introduction to nonlinear dimensionalityreduction by maximum variance unfolding, in: Unfolding, Proceedings ofthe 21st National Conference on Artificial Intelligence, AAAI, 2006.

Andrej Gisbrecht received his Diploma in ComputerScience in 2009 from the Clausthal University ofTechnology, Germany, and continued there as aPh.D.-student. Since early 2010 he is a Ph.D.-studentat the Cognitive Interaction Technology Center ofExcellence at Bielefeld University, Germany.

Barbara Hammer received her Ph.D. in ComputerScience in 1995 and her venia legendi in ComputerScience in 2003, both from the University of Osnab-rueck, Germany. From 2000–2004, she was a leader ofthe junior research group ‘Learning with Neural Meth-ods on Structured Data’ at University of Osnabrueckbefore accepting an offer as Professor for TheoreticalComputer Science at Clausthal University of Technol-ogy, Germany, in 2004. Since 2010, she is holding aprofessorship for Theoretical Computer Science forCognitive Systems at the CITEC Cluster of Excellenceat Bielefeld University, Germany. Several research

stays have taken her to Italy, UK, India, France, the

Netherlands, and the USA. Her areas of expertise include hybrid systems, self-organizing maps, clustering, and recurrent networks as well as applications inbioinformatics, industrial process monitoring, or cognitive science.