relevance learning in generative topographic mapping
TRANSCRIPT
Neurocomputing 74 (2011) 1351–1358
Contents lists available at ScienceDirect
Neurocomputing
0925-23
doi:10.1
� Corr
E-m
journal homepage: www.elsevier.com/locate/neucom
Relevance learning in generative topographic mapping
Andrej Gisbrecht, Barbara Hammer �
University of Bielefeld, CITEC Cluster of Excellence, Germany
a r t i c l e i n f o
Available online 22 February 2011
Keywords:
GTM
Topographic mapping
Relevance learning
Supervised visualization
12/$ - see front matter & 2011 Elsevier B.V. A
016/j.neucom.2010.12.015
esponding author. Tel.: +49 521 106 12115;
ail address: [email protected]
a b s t r a c t
Generative topographic mapping (GTM) provides a flexible statistical model for unsupervised data
inspection and topographic mapping. Since it yields to an explicit mapping of a low-dimensional latent
space to the observation space and an explicit formula for a constrained Gaussian mixture model
induced thereof, it offers diverse functionalities including clustering, dimensionality reduction,
topographic mapping, and the like. However, it shares the property of most unsupervised tools that
noise in the data cannot be recognized as such and, in consequence, is visualized in the map. The
framework of visualization based on auxiliary information and, more specifically, the framework of
learning metrics as introduced in [14,21] constitutes an elegant way to shape the metric according to
auxiliary information at hand such that only those aspects are displayed in distance-based approaches
which are relevant for a given classification task. Here we introduce the concept of relevance learning
into GTM such that the metric is shaped according to auxiliary class labels. Relying on the prototype-
based nature of GTM, efficient realizations of this paradigm are developed and compared on a couple of
benchmarks to state-of-the-art supervised dimensionality reduction techniques.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction
Generative topographic mapping (GTM) has been introducedas a generative statistical model corresponding to the classicalself-organizing map for unsupervised data inspection and topo-graphic mapping [1]. An explicit statistical model has the benefitof great flexibility and easy adaptability to complex situations bymeans of appropriate statistical assumptions. Further, by offeringan explicit mapping of latent space to observation space and aconstrained Gaussian mixture model based thereof, GTM offersdiverse functionality including visualization, clustering, topo-graphic mapping, and various forms of data inspection. Likestandard unsupervised machine learning and data inspectionmethods, however, GTM shares the garbage in - garbage outproblem: the information inherent in the data is displayedindependent of the specific user intention. Hence, if ‘garbage’ ispresent in the data, this noise is presented to the user since thestatistical model has no way to identify the noise as such.
The domain of data visualization by means of dimensionalityreduction techniques constitutes a matured field of research, manypowerful nonlinear reduction techniques as well as a Matlab imple-mentation being readily available, see e.g. [27,28,16,35,12,4,22,36,26].However, researchers in the community start to appreciate that theinherently ill-posed problem of unsupervised data visualization and
ll rights reserved.
fax: +49 521 106 12181.
e (B. Hammer).
dimensionality reduction has to be shaped according to the user’sneeds to arrive at optimum results. This is particularly pronounced forreal-life data sets which frequently do not allow a widely loss-freeembedding into low dimensionality. Therefore, it has to be specifiedwhich parts of the available information should be preserved whileembedding.
On the one hand, formal evaluation measures have beendeveloped which allow an explicit formulation and evaluationbased on the desired result, see e.g. [31–33,17]. On the otherhand, researchers start to develop methods which can takeauxiliary information into account. This way, the user can specifywhich information in the data is interesting for the currentsituation at hand by means of e.g. labeled data.
There exist a few classical mechanisms which take classlabeling into account to reduce the data dimensionality: Featureselection constitutes one specific type of dimensionality reduc-tion. Feature selection constitutes a well investigated researchtopic with numerous proposals based on general principles suchas information theory or dedicated approaches developed forspecific classifiers; see e.g. [13] for an overview. However, thisway, the dimensionality reduction is restricted to very simpleprojections to coordinate axes.
More variable albeit still linear projection methods are in thefocus of several classical discriminative dimensionality reductiontools: Fisher’s linear discriminant analysis (LDA) projects datasuch that within class distances are minimized while betweenclass distances are maximized. One important restriction of LDA isgiven by the fact that, this way, a meaningful projection to
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581352
dimensionality at most c�1, c being the number of classes, can beobtained. Hence, for two class problems only a linear visualizationis found. Partial least squares regression (PLS) constitutes anotherclassical method which objective is to maximize the covariance ofthe projected data and the given auxiliary information. It is alsosuited for situations where data dimensionality is larger than thenumber of data points; in such cases a linear projection is oftensufficient and the problem is to find good regularizations to adjustthe parameters accordingly. Informed projection [9] extendsprincipal component analysis (PCA) to also minimize the sumsquared error of data projections and the mean value of givenclasses, this way achieving a compromise of dimensionalityreduction and clustering in the projection space. Another techni-que relies on metric learning according to auxiliary class informa-tion. For a metric which corresponds to a global linear matrixtransform to low dimensionality this results in a linear discrimi-native projection of data, as proposed e.g. in [11,7].
Modern techniques extend these settings to general nonlinearprojection of data into low dimensionality such that the givenauxiliary information is taken into account. One way to extendlinear approaches to nonlinear settings is offered by kernelization.This incorporates an implicit nonlinear mapping to a high-dimensional feature space together with the linear low-dimen-sional mapping. It can be used for every linear approach whichrelies on dot products in the feature space only such that anefficient computation is possible, such as several variants ofkernel LDA [18,3]. However, it is not clear how to choose thekernel since its form severely influences the final shape of thevisualization. In addition, the method has quadratic complexitywith respect to the number of data due to its dependency on thefull Gram matrix.
Another principled way to extend dimensionality reduction toauxiliary information is offered by an adaptation of the under-lying metric which measures similarity in the original data space,see e.g., [6,34]. The principle of learning metrics has beenintroduced in [20,21]: the standard Riemanian metric of the givendata manifold is substituted by a form which measures theinformation of the data for the given classification task. The Fisherinformation matrix induces the local structure of this metric andit can be expanded globally in terms of path integrals. This metricis integrated into self-organizing maps (SOM), multidimensionalscaling (MDS), and a recent information theoretic model for datavisualization which directly relies on the metric in the dataspace [20,21,33]. A drawback of the proposed method is its highcomputational complexity due to the dependency of the metricon path integrals or approximations thereof. A slightly differentapproach is taken in [10]: Instead of learning the metric, an adhoc adaptation is used which also takes given class labeling intoaccount. The corresponding metric induces a k-nearest neighborgraph which is shaped according to the given auxiliary informa-tion. This can directly be integrated into a supervised version ofIsomap. The principle of discriminative visualization by means ofa change of the metric is considered in more generality in theapproach [8]. Here, a metric induced by prototype-based matrixadaptation as introduced e.g. in [24,25] is integrated in severalpopular visualization schemes including Isomap, manifold chart-ing, locally linear embedding, etc.
Alternative approaches to incorporate auxiliary informationmodify the cost function of dimensionality reduction tools toinclude the given class information. The approaches introducedin [15,19] can both be understood as extensions of stochasticneighbor embedding (SNE). SNE tries to minimize the deviation ofthe distribution of data induced by pairwise distances in theoriginal data space and projection space, respectively. Parametricembedding (PE) substitutes these distributions by conditionalprobabilities of classes, given a data point, this way mapping
both, data points and class centers at the same time. For thisprocedure, however, an assignment of data to unimodal classcenters needs to be known in advance. Multiple relationalembedding (MRE) incorporates several dissimilarity structuresin the data space induced by labeling, for example, into one latentspace representation. For this purpose, the difference of thedistribution of each dissimilarity matrix and the distribution ofan appropriate transform of the latent space are accumulated,whereby the transform is adapted during training according tothe given task. The weighting of the single components is takenaccording to the task at hand, whereby the authors report an onlymild influence of the weighting on the final outcome. It is notclear, however, how to pick the form of the transformation to takeinto account multimodal classes.
Colored maximum variance unfolding (MVU) incorporatesauxiliary information into MVU by substituting the raw datawhich is unfolded in MVU by the combination of the data and thecovariance matrix induced by the given auxiliary information.This way, differences which should be emphasized in the visua-lization are weighted by the differences given by the priorlabeling. Like MVU, however, the method depends on the fullGram matrix and is computationally demanding, such thatapproximations have to be used.
These approaches constitute promising candidates whichemphasize the relevance of discriminative nonlinear dimension-ality reduction. Only few of these methods allow an easy exten-sion to new data points or approximate inverse mappings.Further, most methods suffer from high computational costswhich make them infeasible for large data sets.
In this contribution, we extend GTM to the principle oflearning metrics by combining the technique of relevance learn-ing as introduced in supervised prototype-based classificationschemes and the prototype-based unsupervised representation ofdata as provided by GTM. We propose two different ways to adaptthe relevance terms which rely on different cost functions con-nected to prototype-based classification of data. Unlike [2], wherea separate supervised model is trained to arrive at appropriatemetrics for unsupervised data visualization, we can directlyintegrate the metric adaptation step into GTM due to the proto-type-based nature of GTM. We test the ability of the model tovisualize and cluster given data sets on a couple of benchmarks. Itturns out that, this way, an efficient and flexible discriminativedata mining and visualization technique arises.
2. The generative topographic mapping
The GTM as introduced in [1] models data xARD by means of amixture of Gaussians which is induced by a lattice of points w in alow-dimensional latent space which can be used for visualization.The lattice points are mapped via w/t¼ yðw,WÞ to the dataspace, where the function is parameterized by W; one can, forexample, pick a generalized linear regression model based onGaussian base functions
y : w/FðwÞ �W ð1Þ
where the base functions F are equally spaced Gaussians withvariance s�1. Every latent point induces a Gaussian
pðxjw,W,bÞ ¼b
2p
� �D=2
exp �b2Jx�yðw,WÞJ2
� �ð2Þ
with variance b�1, which gives the data distribution as mixture ofK modes
pðxjW,bÞ ¼XK
k ¼ 1
pðwkÞpðxjwk,W,bÞ ð3Þ
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358 1353
where, usually, pðwkÞ is taken as uniform distribution of theprototypes. Training of GTM optimizes the data log-likelihood
lnYN
n ¼ 1
XK
k ¼ 1
pðwkÞpðxnjwk,W,bÞ
! !ð4Þ
by means of an expectation maximization (EM) approach withrespect to the parameters W and b. In the E step, the responsi-bility of mixture component k for point n is determined as
rkn ¼ pðwkjxn,W,bÞ ¼pðxnjwk,W,bÞpðwkÞPk0pðx
njwk0,W,bÞpðwk0Þð5Þ
In the M step, the weights W are determined solving the equality
UT GoldUWTnew ¼UT RoldX ð6Þ
where U refers to the matrix of base functions F evaluated at pointswk, X to the data points, R to the responsibilities, and G is a diagonalmatrix with accumulated responsibilities Gnn ¼
PkrknðW,bÞ. The
variance can be computed by solving
1
bnew
¼1
ND
Xk,n
rknðWold,boldÞJFðwkÞWnew�xnJ2
ð7Þ
where D is the data dimensionality and N is the number of datapoints.
cycle
log-
likel
ihoo
d
-4
-2
0
2
4
diffe
renc
e
1-500
0
500
1000
1500
llh beforellh afterdiff
× 105
10 20 30 40 50 60 70 80 90 100
Fig. 1. Log-likelihood (scale on the left) using the adapted metric before and after
metric learning for every epoch of the adaptation of prototypes of GTM by means
of EM, and the difference (scale on the right). Obviously, the difference is positive
in all but the first few epochs, in which the decrease of the log-likelihood is very
small as compared to its size. Hence this way of adapting the metric parameters
seems reasonable.
3. Relevance learning
The principle of relevance learning has been introduced in [14]as a particularly simple and efficient method to adapt the metricof prototype-based classifiers according to the given situation athand. It takes into account a relevance scheme of the datadimensionalities by substituting the squared Euclidean metricby the weighted form
dkðx,tÞ ¼XD
d ¼ 1
l2dðxd�tdÞ
2ð8Þ
In [14], the Euclidean metric is substituted by the more generalform (8) and, parallel to prototype updates, the metric parametersk are adapted according to the given classification task. Theprinciple is extended in [24,25] to the more general metric form
dXðx,tÞ ¼ ðx�tÞTXTXðx�tÞ ð9Þ
Using a square matrix X, a positive semi-definite matrix whichgives rise to a valid pseudo-metric is achieved this way. In [24,25],these metrics are considered in local and global form, i.e. theadaptive metric parameters can be identical for the full model, orthey can be attached to every prototype present in the model.Here we introduce the same principle into GTM.
3.1. Labeling of GTM
Assume that data point xn is equipped with label informationln which is an element of a finite set of different labels. Byposterior labeling, GTM gives rise to a probabilistic classificationof data points, assigning the label of prototype tk to data pointxn with probability rkn. Thereby, posterior labeling of GTM can bedone in such a way that the classification error
PNn ¼ 1
PKk ¼ 1
rknskðxnÞ is minimized with skðx
nÞ equal to zero, if the prototypetk has the same label as xn and equal to one otherwise. Thus theprototype tk ¼ yðwk,WÞ is labeled
cðtkÞ ¼ argmaxc
Xnjln ¼ c
rkn
0@
1A ð10Þ
3.2. Metric adaptation in GTM
We can introduce relevance learning into GTM by substitutingthe Euclidean metric in the Gaussian functions (2) by moregeneral diagonal metric (8) which includes relevance terms orthe metric induced by a full matrix (9) which can also takecorrelations of the dimensions into account. Thereby, we canintroduce one global metric for the full model, or, alternatively,we can introduce local metric parameters kk or Xk, respectively,for every prototype tk. We refer to the latter version as localmethod.
Using this posterior labeling of the prototypes, the parametersof the GTM model should be adapted such that the data log-likelihood is optimum. Analogous to [1], it can be seen thatoptimization of the parameters W and b of GTM can be done inthe same way as beforehand, whereby the new metricstructure (8) and (9) has to be used when computing theresponsibilities (5).
We assume that the metric is changed during this optimiza-tion process on a slower time scale such that the auxiliaryinformation is mirrored in the metric parameters. Thereby, weassume quasi-stationarity of metric parameters when performingoriginal EM training. A similar procedure has been used in [24] forsimultaneous metric and prototype learning, and [30] provides anexplanation in how far this procedure is reasonable in the contextof self-organizing maps. Essentially, the adaptation can be under-stood as an adiabatic process this way [5] overlaying fast para-meter adaptation by EM optimization of the log-likelihood andslow metric adaptation according to the objectives as will bedetailed below. This assumption is substantiated if the datalog-likelihood is evaluated in a concrete learning process. Fig. 1displays the value of the data log-likelihood before and aftermetric adaptation for every epoch in a typical learning process aswill be detailed below (the adaptation concerns local full matricesadapted using the soft robust learning vector quantization costfunction for the letter data set using the parameters as detailed inthe experiments). Obviously, the data log-likelihood increasesusing metric adaptation for all but the first few epochs, in whichthe size of the decrease is almost negligible as compared to thesize of the log-likelihood.
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581354
Now the question is how to design an efficient scheme formetric learning based on the structure as provided by GTM andthe given auxiliary labeling. Unlike approaches such as fuzzy-labeled SOM [23], we use a fully supervised scheme to learnmetric parameters. Unlike the original framework of learningmetrics [20,21], however, we make use of the prototype-basedstructure of the induced GTM classifier which allows us toefficiently update local metric parameters without the necessityof a computationally costly approximation of the Riemannianmetric induced by the general Fisher information. For thispurpose, we introduce two different cost functions E (see below)motivated from prototype-based learning which are used tooptimize the metric parameters.
For metric adaptation, we simply use a stochastic gradientdescent of the cost functions. Naturally, more advanced schemeswould be possible, but a simple gradient descent already leads tosatisfactory results, as we will demonstrate in experiments. Toavoid convergence to trivial optima such as zero we poseconstraints on the metric parameters of the form JkJ¼ 1 ortraceðXTXÞ2 ¼ 1, respectively. This is achieved by normalizationof the values, i.e. after every gradient step, k is divided by itslength, and X is divided by the square root of traceðXTXÞ. Thus, ahigh-level description of the algorithm is possible as depictedin Table 1. Usually, we alternate between one EM step, one epochof gradient descent, and normalization in our experiments. SinceEM optimization is much faster than gradient descent, this way,we can enforce that the metric parameters are adapted on aslower time scale. Hence we can assume an approximatelyconstant metric for the EM optimization, i.e. the EM schemeoptimizes the likelihood as before. Metric adaptation takes placeconsidering quasi stationary states of the GTM solution due to theslower time scale.
Note that metric adaptation introduces a large number ofadditional parameters into the model depending on the inputdimensionality. One can raise the question whether this leads tostrong overfitting of the model. We will see in experiments thatthis is not the case: when evaluating the clustering performanceof the resulting GTM, the training error is representative for thegeneralization error. One can substantiate this experimentalfinding with a theoretical counterpart: using posterior labeling,GTM offers a prototype-based classification scheme with localadaptive metrics. This function class has a supervised pendant:generalized matrix learning vector quantization as introducedin [24]. The worst case generalization ability of the latter class canbe investigated based on classical computational learning theory.It turns out that its generalization ability does not depend on thenumber of parameters adapted during training, rather, largemargin generalization bounds can be derived. In consequence,very good generalization ability can be proved (and experimen-tally observed) as detailed in [24]. Since the formal argumenta-tion in [24] depends on the considered function class only and notthe way in which training takes place, the same generalizationbounds apply to GTM with adaptive metrics as introduced here.
Now, we discuss concrete cost functions E for the metricadaptation.
Table 1Integration of relevance learning into GTM.
INIT
REPEAT
E-STEP: DETERMINE rkn BASED ON THE GENERAL METRIC
M-STEP: DETERMINE W AND b AS IN GTM
LABEL PROTOTYPES
ADAPT METRIC PARAMETERS BY STOCHASTIC GRADIENT DESCENT OF E
NORMALIZE THE METRIC PARAMETERS
3.3. Generalized relevance GTM (GRGTM)
Metric parameters have the form k or kk for a diagonalmetric (8) and X or Xk for a full matrix (9), depending onwhether a local or global scheme is considered. In the following,we define the general parameter Yk which can be chosen as oneof these four possibilities depending on the given setting.Thereby, we can assume that Yk can be realized by a matrixwhich has diagonal form (for relevance learning) or full matrixform (for matrix updates).
The cost function of generalized relevance GTM is taken fromgeneralized relevance learning vector quantization (GRLVQ),which can be interpreted as maximizing the hypothesis marginof a prototype-based classification scheme such as LVQ [14,24].The cost function has the form
EðYÞ ¼X
n
EnðYÞ ¼X
n
sgddYþ ðx
n,tþ Þ�dY� ðxn,t�Þ
dYþ ðxn,tþ ÞþdY� ðxn,t�Þ
� �ð11Þ
where sgdðxÞ ¼ ð1þexpð�xÞÞ�1, tþ is the closest prototype in thedata space with the same label as xn and t� is the closestprototype with a different label.
The adaptation formulas can be derived thereof by taking thederivatives. Depending on the form of the metric, the derivative ofthe metric is
@dkðx,tÞ
@li¼ 2liðxi�tiÞ
2ð12Þ
for a diagonal metric and
@dXðx,tÞ
@Oij¼ 2ðxj�tjÞ
Xd
Oidðxd�tdÞ ð13Þ
for a full matrix.For simplicity, we denote the respective squared distances to
the closest correct and wrong prototype, respectively, bydþ ¼ dYþ ðx
n,tþ Þ and d� ¼ dY� ðxn,t�Þ. The term sgd0 is a shorthandnotation for sgd0ððdþ�d�Þ=ðdþ þd�ÞÞ. Given a data point xn thederivative of the corresponding summand of cost function E withrespect to metric parameters yields
@En
@Yþ¼ 2sgd0 �
d�
ðdþ þd�Þ2�@dþ
@Yþð14Þ
for the parameters of the closest correct prototype and
@En
@Y�¼�2sgd0 �
dþ
ðdþ þd�Þ2�@d�
@Y�ð15Þ
for the parameters attached to the closest wrong prototype. Allother parameters are not affected. These updates take place forthe local modeling of parameters, which we refer to by localgeneralized relevance GTM (LGRGTM) or local generalized matrixGTM (LGMGTM), respectively. If metric parameters are global, theupdate corresponds to the sum of these two derivatives, referredto by generalized relevance GTM (GRGTM) or generalized matrixGTM (GMGTM), respectively.
3.4. Robust soft GTM (RSGTM)
Unlike GRLVQ, robust soft LVQ (RSLVQ) [29] has the goal tooptimize a statistical model which defines the data distribution. Itis assumed that data are given by a Gaussian mixture of proto-types which are labeled. The objective is to maximize thelogarithm of the probability of a data point being generated bya prototype of the correct class versus the overall probability. Inthe limit of small variance of the Gaussians, a learning rule whichis similar to the standard LVQ rule results. The objective for ageneral variance b�1 of the Gaussian modes corresponds to the
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358 1355
following cost function:
EðYÞ ¼X
n
EnðYÞ ¼X
n
log
PkjcðtkÞ ¼ ln pðwkÞpðxnjwk,W,bÞ
pðxnjW,bÞ
!ð16Þ
Here, we can choose Gaussian modes as provided by GTM, i.e. themodes and corresponding mixture are given in analogy toformulas (2) and (3) where the new parameterizedmetric (8) and (9) as well as the labeling (10) of GTM is used.
We obtain the update rules by taking the derivatives, asbeforehand:
@En
@Yk¼ ðskðx
nÞðqkn�rknÞ�ð1�skðxnÞÞrknÞ
1
Sk�@Sk
@Yk�b2�@dYk ðxn,tkÞ
@Yk
!
ð17Þ
where skðxnÞ indicates whether prototype and data label coincide,
qkn ¼pðxnjwk,W,bÞpðwkÞP
k0 jcðtk0 Þ ¼ ln pðxnjwk0,W,bÞpðwk0Þð18Þ
refers to the probability of mode k among the correct modes, and
Sk ¼b
2p
� �D=2
� detðYkÞ ð19Þ
normalizes the Gaussian modes to arrive at valid probabilities.The derivative is
1
S�@S
@li¼
1
lið20Þ
for a relevance vector and
1
S�@S
@Oij¼O�1
ji ð21Þ
for full matrices.
Table 2Parameters used for training.
Data Number of prototypes Number of base functions
Landsat 10�10 4�4
Phoneme 10�10 4�4
Letter 30�30 30�30
Fig. 2. Mean accuracy of the classification obtained by diverse supervised GTM sch
We refer to this version as local relevance robust soft GTM(LRSGTM) and local matrix robust soft GTM (LMRSGTM), respectively.The global versions can be obtained by adding the derivatives, werefer to these algorithms as relevance robust soft GTM (RSGTM) andmatrix robust soft GTM (MRSGTM), respectively.
4. Experiments
4.1. Classification
We test the efficiency of relevance learning in GTM on threebenchmark data sets as described in [21,33]: Landsat Satellite data
with 36 dimensions, 6 classes, and 6435 samples, Letter Recogni-
tion data with 16 dimensions, 26 classes, and 20,000 samples, andPhoneme data with 20 dimensions, 13 classes, and 3656 samples.Prior to training all data sets were normalized by subtracting themean and dividing by standard deviation. GTM is initialized usingthe first two principal components. The mapping yðw,WÞ isinduced by generalized linear regression based on Gaussianbase functions. The learning rate of the gradient descent forthe metric parameters has been optimized for the data and ischosen in the range of 10�6 to 10�2. More precisely, an exhaus-tive search of the parameter range is done and the value ispicked for the learning rate which leads to the best convergenceof the relevance profile. Thereby, the number of epochs is chosenas 100, which is sufficient to allow convergence of matrixparameters; typically, convergence of the EM scheme can beobserved at a faster scale. The number of prototypes andbase functions has been taken to suite the size of the data, it isshown in Table 2. Due to the complexity of the training, anexhaustive search of these parameters has been avoided, butreasonable numbers have been chosen. Typically, the results areonly mildly influenced by small changes of these numbers.The variance of the Gaussian base functions has been chosensuch that it coincides with the distance between neighbored basefunctions.
We report the results of a repeated stratified 10-fold cross-validation with one repeat (letter) and 10 repeats (phoneme,landsat), respectively, reporting also the variance over therepeats. We evaluate the models in comparison to several recentalternative supervised visualization tools by means of the test error
emes as introduced in this article and alternative state-of-the-art approaches.
Fig. 3. Visualization of the result of GTM (top) and robust soft GTM with local
matrix learning (bottom) on the MNIST data set. Pie charts give the responsibility
of the prototypes for the given classes. Supervision achieves a better separation of
the classes within receptive fields of prototypes, introducing dead units if
necessary.
Fig. 4. Visualization of the result of GTM (top) and robust soft GTM with local
matrix learning (bottom) on the Phoneme data set. Pie charts give the responsi-
bility of the prototypes for the given classes. Supervision achieves a better
separation of the classes within receptive fields as can be seen by the pie charts.
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581356
obtained in the cross-validation. These alternatives are takenfrom [33].1 The alternative methods include parametric embedding(PE), supervised Isomap (S-Isomap), colored maximum varianceunfolding (MUHSIC), multiple relational embedding (MRE), neigh-borhood component analysis (NCA), and supervised neighborhoodretrieval visualizer (SNeRV) based on different weighting of retrieval
1 We would like to thank the authors of [34] for providing the results.
objectives in the cost function of the model (l) and a Riemannianmetric based on the Fisher information matrix
The results are shown in Fig. 2. In two of the three cases, metricadaptation improves the classification accuracy compared to simpleGTM. Thereby, matrix adaptation yields to superior results comparedto the adaptation of a simple relevance vector. Further, results basedon the robust soft learning vector quantization seem slightly betterfor all data sets. For all three data sets, we obtain state-of-the-artresults which are comparable to the best alternative supervisedvisualization tools which are currently available in the literature.
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–1358 1357
We also report the results obtained by a five-nearest neighborclassifier for the original data in the original (high-dimensional) dataspace. Interestingly, for all cases, the supervised results using the fullinformation are only slightly better than the results obtained by GTMwith local matrix adaptation in two dimensions. This demonstratesthe high quality of the supervised visualization. Note that, unlike theexperiments from [33] which restrict to a subset of 1500 samples inall cases due to complexity issues, we can train on the full dataset dueto the efficiency of relevance GTM, and, in two of the three cases, wecan even perform a 10-fold repeat of the experiments inreasonable time.
4.2. Visualization
The result of a visualization of the Phoneme data set and theMNIST data set (this data set consists of 60,000 points with 768dimensions representing the 10 digits, a subsample of 6000 imageswas used in this case) using robust soft GTM with local matrixadaptation is shown in Figs. 3 and 4, whereby the full data set is usedfor training. A comparison to simple GTM shows the ability of matrixlearning to arrive at a topographic mapping which better mirrors theunderlying class structures: the pie charts display the percentage ofpoints of the different classes assigned to the respective prototypebased on the receptive fields. Interestingly, in both cases, the piecharts obtained with metric learning display less classes for the singleprototypes corresponding to better separated receptive fields,whereas the classes are spread among the prototypes if metricadaptation does not take place. This is also mirrored in the betterclassification accuracy of GTM with matrix learning. The arrangementof the classes on the map differs for the different visualizations. Formetric learning, multiple modes of the classes can be observed. Forstandard GTM, the distribution is less clear since the single prototypescombine different classes in their receptive fields.
5. Discussion
In this contribution, a method has been proposed to integrateauxiliary information in terms of relevance updates into GTM; thebenefit of this approach has been demonstrated on severalbenchmarks. Unlike approaches such as fuzzy-labeled SOM [24],metric parameters are adapted in a supervised fashion basedon the classification ability of the model. As [22], the work isbased on adaptive metrics to incorporate auxiliary informationinto the model. Unlike this latter work [22], however, theproposed method relies on the prototype-based nature of GTMand transfers the relevance update scheme of supervised learningschemes such as [15,25] to this setting, resulting in an efficienttopological mapping. As demonstrated on several benchmarks,the classification accuracy is competitive to state-of-the-artmethods for supervised visualization, whereby GTM providesadditional functionality due to the explicit topographic mappingof the latent space into the observation space accompanied byan explicit generative statistical model. As demonstrated bymeans of visualization, the class separation is much more accu-rate for supervised GTM compared to the original method, thusclearly focussing on the relevant aspects for the givenclassification.
However, the evaluation as proposed in this contribution can onlyserve as indicator whether useful mappings are obtained by relevanceGTM. Since data visualization is an inherently ill-posed problem, aclear evaluation by means of a single (or few) quantitative measuresseems hardly satisfactory, the respective goal being often situationdependent. For a proper evaluation of the model for concrete tasks, anempirical user study would be interesting.
References
[1] C. Bishop, M. Svensen, C. Williams, The generative topographic map, NeuralComputation 10 (1) (1998) 215–234.
[2] K. Bunte, B. Hammer, P. Schneider, M. Biehl, Nonlinear discriminativedata visualization, in: M. Verleysen (Ed.), ESANN 2009, d-side Publishing, 2009,pp. 65–70.
[3] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernelapproach, Neural Computation 12 (2000) 2385–2404.
[4] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction anddata representation, Neural Computation 15 (2003) 1373–1396.
[5] M. Born, V.A. Fock, Beweis des Adiabatensatzes, Zeitschrift fur Physik aHadrons and Nuclei 51 (3–4) (1928) 165–180.
[6] K. Bunte, B. Hammer, T. Villmann, M. Biehl, A. Wismuller, Exploratoryobservation machine (XOM) with Kullback–Leibler divergence for dimen-sionality reduction and visualization, in: M. Verleysen (Ed.), ESANN’10, Dside, 2010, pp. 87–92.
[7] K. Bunte, B. Hammer, M. Biehl, Nonlinear dimension reduction and visualiza-tion of labeled data, in: X. Jiang, N. Petkov (Eds.), International Conference onComputer Analysis of Images and Patterns, Springer, 2009, pp. 1162–1170.
[8] K. Bunte, B. Hammer, A. Wismueller, M. Biehl, Adaptive local dissimilaritymeasures for discriminative dimension reduction of labeled data, Neurocom-puting 73 (7–9) (2010) 1074–1092.
[9] D. Cohn, Informed projections, in: S. Becker, S. Thrun, K. Obermayer (Eds.),NIPS, MIT Press, 2003, pp. 849–856.
[10] X. Geng, D.-C. Zhan, Z.-H. Zhou, Supervised nonlinear dimensionality reduc-tion for visualization and classification, IEEE Transactions on Systems, Man,and Cybernetics, Part B 35 (6) (2005) 1098–1107.
[11] J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighbourhood com-ponents analysis, Advances in Neural Information Processing Systems, vol.17, MIT Press, 2004, pp. 513–520.
[12] A. Gorban, B. Kegl, D. Wunsch, A. Zinoyev (Eds.), Principle Manifolds for DataVisualization and Dimensionality Reduction, Springer, , 2008.
[13] I. Guyon, A. Elisseeff, An introduction to variable and feature selection,Journal of Machine Learning Research 3 (2003) 1157–1182.
[14] B. Hammer, T. Villmann, Generalized relevance learning vector quantization,Neural Networks 15 (8–9) (2002) 1059–1068.
[15] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T.L. Griffiths, J.B. Tenenbaum,Parametric embedding for class visualization, Neural Computation 19 (9)(2007) 2536–2556.
[16] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007.[17] J.A. Lee, M. Verleysen, Quality Assessment of Dimensionality Reduction: Rank-
based Criteria Neurocomputing, vol. 72(7–9), Elsevier, 2009, pp. 1431–1443.[18] B. Ma, H. Qu, H. Wong, Kernel clustering-based discriminant analysis, Pattern
Recognition 40 (1) (2007) 324–327.[19] R. Memisevic, G. Hinton, Multiple relational embedding, in: L.K. Saul,
Y. Weiss, L. Bottou (Eds.), Advances in Neural Information ProcessingSystems, vol. 17, MIT Press, Cambridge, MA, 2005, pp. 913–920.
[20] J. Peltonen, Data exploration with learning metrics, D.Sc. Thesis, Dissertationsin Computer and Information Science, Report D7, Espoo, Finland, 2004.
[21] J. Peltonen, A. Klami, S. Kaski, Improved learning of Riemannian metrics forexploratory analysis, Neural Networks 17 (2004) 1087–1100.
[22] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linearembedding, Science 290 (2000) 2323–2326.
[23] F.-M. Schleif, B. Hammer, M. Kostrzewa, T. Villmann, Exploration of mass-spectrometric data in clinical proteomics using learning vector quantizationmethods, Briefings in Bioinformatics 9 (2) (2008) 129–143.
[24] P. Schneider, M. Biehl, B. Hammer, Adaptive relevance matrices in learningvector quantization, Neural Computation 21 (2009) 3532–3561.
[25] P. Schneider, M. Biehl, B. Hammer, Distance learning in discriminative vectorquantization, Neural Computation 21 (2009) 2942–2969.
[26] J. Tenenbaum, V. da Silva, J. Langford, A global geometric framework fornonlinear dimensionality reduction, Science 290 (2000) 2319–2323.
[27] L. van der Maaten, G. Hinton, Visualizing high-dimensional data using t-sne,Journal of Machine Learning Research 9 (2008) 2579–2605.
[28] L. van der Maaten, E. Postma, H. van den Herik, Dimensionality reduction: acomparative review, Technical report, Tilburg University Technical Report,TiCC-TR 2009-005, 2009.
[29] S. Seo, K. Obermayer, Soft learning vector quantization, Neural Computation15 (7) (2003) 1589–1604.
[30] A. Spitzner, D. Polani, Order parameters for self-organizing maps, in: L. Niklasson,M. Boden, T. Ziemke (Eds.), Proceedings of the 8th International Conference onArtificial Neural Networks (ICANN 98), vol. 2, Springer, 1998, pp. 517–522.
[31] J. Venna, Dimensionality reduction for visual exploration of similarity structures,Ph.D. Thesis, Helsinki University of Technology, Espoo, Finland, 2007.
[32] J. Venna, S. Kaski, Local multidimensional scaling, Neural Networks 19 (2006)89–99.
[33] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, Information retrievalperspective to nonlinear dimensionality reduction for data visualization,Journal of Machine Learning Research 11 (2010) 451–490.
[34] T. Villmann, B. Hammer, F.-M. Schleif, T. Geweniger, W. Herrmann, Fuzzyclassification by fuzzy labeled neural gas, Neural Networks 19 (2006) 772–779.
[35] M. Ward, G. Grinstein, D.A. Keim, Interactive Data Visualization: Foundations,Techniques, and Application, A.K. Peters, Ltd., 2010.
A. Gisbrecht, B. Hammer / Neurocomputing 74 (2011) 1351–13581358
[36] K.Q. Weinberger, L.K. Saul, An introduction to nonlinear dimensionalityreduction by maximum variance unfolding, in: Unfolding, Proceedings ofthe 21st National Conference on Artificial Intelligence, AAAI, 2006.
Andrej Gisbrecht received his Diploma in ComputerScience in 2009 from the Clausthal University ofTechnology, Germany, and continued there as aPh.D.-student. Since early 2010 he is a Ph.D.-studentat the Cognitive Interaction Technology Center ofExcellence at Bielefeld University, Germany.
Barbara Hammer received her Ph.D. in ComputerScience in 1995 and her venia legendi in ComputerScience in 2003, both from the University of Osnab-rueck, Germany. From 2000–2004, she was a leader ofthe junior research group ‘Learning with Neural Meth-ods on Structured Data’ at University of Osnabrueckbefore accepting an offer as Professor for TheoreticalComputer Science at Clausthal University of Technol-ogy, Germany, in 2004. Since 2010, she is holding aprofessorship for Theoretical Computer Science forCognitive Systems at the CITEC Cluster of Excellenceat Bielefeld University, Germany. Several research
stays have taken her to Italy, UK, India, France, theNetherlands, and the USA. Her areas of expertise include hybrid systems, self-organizing maps, clustering, and recurrent networks as well as applications inbioinformatics, industrial process monitoring, or cognitive science.