relational generative topographic mapping

13
Relational generative topographic mapping $ Andrej Gisbrecht, Bassam Mokbel, Barbara Hammer CITEC Cluster of Excellence, Bielefeld University, Germany article info Available online 24 February 2011 Keywords: Topographic mapping Dissimilarity data Relational data mining abstract The generative topographic mapping (GTM) has been proposed as a statistical model to represent high- dimensional data by a distribution induced by a sparse lattice of points in a low-dimensional latent space, such that visualization, compression, and data inspection become possible. The formulation in terms of a generative statistical model has the benefit that relevant parameters of the model can be determined automatically based on an expectation maximization scheme. Further, the model offers a large flexibility such as a direct out-of-sample extension and the possibility to obtain different degrees of granularity of the visualization without the need of additional training. Original GTM is restricted to Euclidean data points in a given Euclidean vector space. Often, data are not explicitly embedded in a Euclidean vector space, rather pairwise dissimilarities of data can be computed, i.e. the relations between data points are given rather than the data vectors themselves. We propose a method which extends the GTM to relational data and which allows us to achieve a sparse representation of data characterized by pairwise dissimilarities, in latent space. The method, relational GTM, is demonstrated on several benchmarks. & 2011 Elsevier B.V. All rights reserved. 1. Introduction More and more electronic data become available in virtually all areas of life including, for example, biomedical domains, robotics, the web, or multimedia applications, such that powerful data mining tools are needed to support humans to inspect and interpret this information. Also, rapidly increasing technology such as improved sensor technology and advanced methods of data preprocessing and data storage make the data more and more complex, concerning data dimensionality and information content contained in the representation. Therefore, often, a simple comparison of data in terms of the Euclidean norm and a standard representation by means of Euclidean vectors is no longer appro- priate to capture the relevant aspects of the data. Rather, dissim- ilarity measures which are adjusted to the data type and application area at hand should be used, including, for example, alignment distances for genomic sequence analysis in bioinfor- matics, the compression distance to compare texts, or structure kernels to compare complex graphs and tree structures. For this reason, data mining tools which rely solely on a dissimilarity representation of data offer powerful methods for problem adapted data modeling via the canonical interface offered by the dissimilarity matrix. Classical data mining tools such as the self-organizing map (SOM) or its statistical counterpart, the generative topographic mapping (GTM) provide a sparse representation of high-dimen- sional data by means of latent points arranged in a low-dimen- sional neighborhood structure which is useful for visualization. However, they have been introduced for Euclidean vectors only [20,3]. Several extensions of SOM to the more general setting of data characterized by pairwise relations have been proposed, including median SOM which restricts prototype locations to data points [21], online and batch SOM using a kernelization of the classical approach [4,30], and methods which rely on determinis- tic annealing techniques borrowed from statistical physics [7]. These methods have the drawback that they can deal with discrete and restricted prototypes only (median SOM), they are restricted to kernels (kernel SOM), or they require an additional inner loop due to the necessary annealing step (deterministic annealing techniques). For specific data types such as recursive structures, the dynamics of SOM can be extended to incorporate the dependencies of data constituents. See e.g. the over- views [1,10]. For GTM, a complex noise model as proposed in [29] allows the extension of the method to discrete structures such as sequences. Further, a kernelization of the methods is possible as described in [4,24]. These proposals, however, are applicable to specific (recursive) data structures or kernels only. Recently, an intuitive extension of SOM to dissimilarity data has been proposed in [11] which relies on techniques as Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/neucom Neurocomputing 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.12.011 $ This work has been supported by the DFG under Grant no. HA2719/4-1 and by the Cluster of Excellence 277 Cognitive Interaction Technology funded in the framework of the German Excellence Initiative. Corresponding author. Tel.: + 49 521 106 12115; fax: + 49 521 106 12181. E-mail address: [email protected] (B. Hammer). Neurocomputing 74 (2011) 1359–1371

Upload: andrej-gisbrecht

Post on 21-Jun-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Relational generative topographic mapping

Neurocomputing 74 (2011) 1359–1371

Contents lists available at ScienceDirect

Neurocomputing

0925-23

doi:10.1

$This

the Clu

framew� Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

Relational generative topographic mapping$

Andrej Gisbrecht, Bassam Mokbel, Barbara Hammer �

CITEC Cluster of Excellence, Bielefeld University, Germany

a r t i c l e i n f o

Available online 24 February 2011

Keywords:

Topographic mapping

Dissimilarity data

Relational data mining

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2010.12.011

work has been supported by the DFG under G

ster of Excellence 277 Cognitive Interaction

ork of the German Excellence Initiative.

esponding author. Tel.: +49 521 106 12115;

ail address: [email protected]

a b s t r a c t

The generative topographic mapping (GTM) has been proposed as a statistical model to represent high-

dimensional data by a distribution induced by a sparse lattice of points in a low-dimensional latent

space, such that visualization, compression, and data inspection become possible. The formulation in

terms of a generative statistical model has the benefit that relevant parameters of the model can be

determined automatically based on an expectation maximization scheme. Further, the model offers a

large flexibility such as a direct out-of-sample extension and the possibility to obtain different degrees

of granularity of the visualization without the need of additional training. Original GTM is restricted to

Euclidean data points in a given Euclidean vector space. Often, data are not explicitly embedded in a

Euclidean vector space, rather pairwise dissimilarities of data can be computed, i.e. the relations

between data points are given rather than the data vectors themselves. We propose a method which

extends the GTM to relational data and which allows us to achieve a sparse representation of data

characterized by pairwise dissimilarities, in latent space. The method, relational GTM, is demonstrated

on several benchmarks.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

More and more electronic data become available in virtuallyall areas of life including, for example, biomedical domains,robotics, the web, or multimedia applications, such that powerfuldata mining tools are needed to support humans to inspect andinterpret this information. Also, rapidly increasing technologysuch as improved sensor technology and advanced methods ofdata preprocessing and data storage make the data more andmore complex, concerning data dimensionality and informationcontent contained in the representation. Therefore, often, a simplecomparison of data in terms of the Euclidean norm and a standardrepresentation by means of Euclidean vectors is no longer appro-priate to capture the relevant aspects of the data. Rather, dissim-ilarity measures which are adjusted to the data type andapplication area at hand should be used, including, for example,alignment distances for genomic sequence analysis in bioinfor-matics, the compression distance to compare texts, or structurekernels to compare complex graphs and tree structures. For thisreason, data mining tools which rely solely on a dissimilarityrepresentation of data offer powerful methods for problem

ll rights reserved.

rant no. HA2719/4-1 and by

Technology funded in the

fax: +49 521 106 12181.

e (B. Hammer).

adapted data modeling via the canonical interface offered by thedissimilarity matrix.

Classical data mining tools such as the self-organizing map(SOM) or its statistical counterpart, the generative topographicmapping (GTM) provide a sparse representation of high-dimen-sional data by means of latent points arranged in a low-dimen-sional neighborhood structure which is useful for visualization.However, they have been introduced for Euclidean vectorsonly [20,3]. Several extensions of SOM to the more general settingof data characterized by pairwise relations have been proposed,including median SOM which restricts prototype locations to datapoints [21], online and batch SOM using a kernelization of theclassical approach [4,30], and methods which rely on determinis-tic annealing techniques borrowed from statistical physics [7].These methods have the drawback that they can deal withdiscrete and restricted prototypes only (median SOM), they arerestricted to kernels (kernel SOM), or they require an additionalinner loop due to the necessary annealing step (deterministicannealing techniques). For specific data types such as recursivestructures, the dynamics of SOM can be extended to incorporatethe dependencies of data constituents. See e.g. the over-views [1,10]. For GTM, a complex noise model as proposedin [29] allows the extension of the method to discrete structuressuch as sequences. Further, a kernelization of the methods ispossible as described in [4,24]. These proposals, however, areapplicable to specific (recursive) data structures or kernels only.

Recently, an intuitive extension of SOM to dissimilarity datahas been proposed in [11] which relies on techniques as

Page 2: Relational generative topographic mapping

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–13711360

introduced in [12]: assume that only a dissimilarity matrixcharacterizes the data and an explicit vectorial representation isunknown. If prototypes have the special form of convex combina-tions of data points, classical SOM can be computed indirectly byadapting the coefficient vectors without any explicit reference tothe underlying vector space or an explicit formula of the dissim-ilarity measure. The resulting algorithm, relational SOM, arrives ata sparse representation of dissimilarity data in terms of virtualprototypes represented by coefficient vectors. Unlike medianSOM, a continuous adaptation of prototypes is possible via theimplicit representation of prototypes in terms of coefficientvectors. Interestingly, the algorithm can be interpreted as animplicit application of the SOM algorithm for an unknown vectorspace embedding of the underlying data, as shown in [11]. Sincethe algorithm relies on the dissimilarities only, this shows theinvariance of the method with respect to the chosen embedding.The algorithm can be extended to an approximate iterativescheme which drastically reduces the computation time andspace requirement, resulting in a linear algorithm for dissimilaritydata, as proposed in [11]. This way, an efficient data miningmethod for very large dissimilarity data results.

SOM has the drawback that it relies on a heuristic motivation,albeit a foundation of a slightly altered version in terms of a costfunction is possible [13]. The generative topographic mappingoffers an alternative based on a generative statistical model [3]. Itmodels a restricted Gaussian mixture model where the Gaussiancenters are induced by a mapping of prototypes from a low-dimensional latent space. This way, visualization and sparserepresentation of data becomes possible. Unlike SOM, GTMtraining can be derived as a maximization of the data log-likelihood by an expectation maximization scheme. Further, anexplicit mapping of the latent space to the data space is learnedsuch that data can be visualized at any desired degree ofgranularity by choosing appropriate lattice points in thelatent space.

In this contribution, we extend the principle of relational dataprocessing by means of an implicit representation of prototypesto GTM. For this purpose, we use the trick of an indirectrepresentation of prototypes in the image space in terms of linearcombinations of data points and the associated possibility tocompute distances in the space without an explicit reference tothe vector representation of points. This way, the EM scheme ofGTM can be transferred to the new setting to obtain theparameters of the model by maximizing the data log-likelihood.The efficiency and feasibility of this method, relational GTM, isdemonstrated on several benchmark data sets given by dissim-ilarity matrices.

2. The generative topographic mapping

The GTM [3] provides a generative stochastic model of dataxARD which is induced by a mixture of Gaussians with centersinduced by a regular lattice of points w in latent space. These aremapped to prototypical target vectors

w/t¼ yðw,WÞ ð1Þ

in the data space, where the function y is parameterized by W.Typically a generalized linear regression model

y : w/FðwÞ �W ð2Þ

induced by base functions F such as equally spaced Gaussianswith variance s�1 is chosen. Every latent point induces a Gaussiandistribution

pðxjw,W,bÞ ¼b

2p

� �D=2

exp �b2Jx�yðw,WÞJ2

� �ð3Þ

with variance b�1, which generates a mixture of K modes

pðxjW,bÞ ¼XK

k ¼ 1

pðwkÞpðxjwk,W,bÞ ð4Þ

where p(wk) is often chosen as uniform distribution, i.e.pðwkÞ ¼ 1=K . GTM training optimizes the data log-likelihood

lnYN

n ¼ 1

XK

k ¼ 1

pðwkÞpðxnjwk,W,bÞ

! !ð5Þ

with respect to W and b where independence of the data points xn

is assumed. This can be done by means of an EM approach whichtreats the generative mixture component wk for a data point xn ashidden parameter. Choosing a generalized linear regressionmodel and a distribution of the latent points which is uniformlypeaked at the lattice positions, EM training can be computedexplicitly. It in turn computes the responsibilities

RknðW,bÞ ¼ pðwkjxn,W,bÞ ¼pðxnjwk,W,bÞpðwkÞPk0pðxnjwk0 ,W,bÞpðwk0 Þ

ð6Þ

of component k for point number n, and the model parameters bymeans of the formulas

UT GoldUWnew ¼UT RoldX ð7Þ

for W where U refers to the matrix of base functions F evaluatedat points wk, X to the data points, R to the responsibilities, and Gis a diagonal matrix with accumulated responsibilitiesGnn ¼

PnRknðW,bÞ. The variance can be computed by

1

bnew

¼1

ND

Xk,n

RknðWold,boldÞJFðwkÞWnew�xnJ2

ð8Þ

where D is the data dimensionality and N the number of datapoints.

3. Relational GTM

We assume that data x are given only indirectly in terms ofpairwise dissimilarities dij ¼ Jxi�xjJ

2, but the vector representa-tion x of the data is unknown. We assume, however, that vectorsx exist which yield the dissimilarity matrix, albeit their repre-sentation is not known. Thus, for general prototypes t, theprobability (3) cannot be computed, nor is it possible to deter-mine prototypical targets at all, if no embedding vector space isknown. In [12], the following fundamental observation is pre-sented: assume that prototypes are restricted to linear combina-tions of data points of the following form:

tk ¼XN

n ¼ 1

aknxn whereXN

n ¼ 1

akn ¼ 1 ð9Þ

Then, the prototypes tk can be represented indirectly by means ofthe coefficient vector ak and, further, distances of data points andprototypes can be computed as in [12]

Jxn�tkJ2¼ ½Dak�n�

12 � a

Tk Dak ð10Þ

where D refers to the matrix of pairwise dissimilarities of datapoints and ½��i is component i of the vector. It has been shownin [11], that relation (10) even holds for every vector spaceequipped with bilinear form if the targets fulfill (9), wherebycoefficients aij must sum to 1 but they can be negative. Thisobservation has been used in [11] to derive a relational variant ofSOM. We show that the same principle allows us to generalizeGTM to relational data described by a dissimilarity matrix D.

Thus, we assume a dissimilarity matrix D is given. We restrictprototype vectors tk to linear combinations of data points as

Page 3: Relational generative topographic mapping

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–1371 1361

in (9). That means, the relation

T¼ a � X ð11Þ

holds where T denotes the target vectors, a denotes the matrix oftheir implicit coefficient-based representation in terms of ak, andX is the matrix of observed data vectors. Note that the coefficientsare not restricted to nonnegative values, since target vectors canlie outside the convex hull of the data points, i.e. aknAR.Depending on the smoothness of the mapping of the latent spaceto the data space, this fact seems reasonable to arrive at atopology representing map. We can represent these prototypesindirectly in terms of coefficients ak without any reference to anexplicit vectorial representation.

As before, the targets tk induce a distribution in the data spacegiven by a mixture of Gaussians centered around these points (3).The targets tk are restricted to images of data points on a regularlattice in a low-dimensional latent space, i.e. they are obtained viaa generalized linear regression model of points w in latent space.Since the embedding space of tk is not known, we directly treatthe mapping of latent points to prototype points as mapping ofthe latent space to the coefficients which represent the targets:

y : wk/ak ¼FðwkÞ �W ð12Þ

where, now, WARd�N . This corresponds to a generalized linearregression of the latent space into the (unknown) surroundingvector space due to the linear dependency of the targets andcoefficients (11). As before, F refers to base functions such asequally spaced Gaussians with variance s�1 in the latent space. Inthe a-space of linear combinations of data points, data points xi

itself are represented by unit vectors, i.e. ð0, . . . ,0,1,0, . . . ,0Þ inconsequence, the data matrix X is now the identity matrix I.

To apply (10), we put the restrictionXn

½FðwkÞ �W�n ¼ 1 ð13Þ

This way, the likelihood function (5) can be computed basedon (3) where the distance computation can be performed indir-ectly using (10).

As for GTM, we can use an EM optimization scheme to arrive atsolutions for the parameters b and W, where, again, the mode wk

responsible for data point xn serves as hidden parameter. An EMalgorithm in turn computes the responsibilities (6) using thealternative formula for the distances (10), and it optimizes theexpectationXk,n

RknðWold,boldÞlnpðxnjwk,Wnew,bnewÞ ð14Þ

with respect to W and b under the constraint (13). This latterproblem reads as

max lðWÞ :¼ lnYN

n ¼ 1

XK

k ¼ 1

pðwkÞpðxnjwk,W,bÞ

! !ð15Þ

subject to

gkðWÞ :¼X

n

½FðwkÞ �W�n�1¼ 0 ð16Þ

The resulting Lagrangian function is

LðW,mÞ :¼ lðWÞþX

k

mkgkðWÞ ð17Þ

The optimum of this function can be derived by solvingrW,mL¼ 0. This leads to the equation

rWL¼ 0 3 bUT GUW¼ bUT RI�UTm1TN ð18Þ

and

rm ¼ 0 3 UW1N ¼ 1K ð19Þ

By left multiplying (19) with bUT G we get

bUT GUW1N ¼ bUT G1K

A substitution with (18) leads to

ðbUT RI�UTm1TNÞ1N ¼ bUT G1K ð20Þ

bUTðRI1N�G1K Þ ¼UTmN ð21Þ

In Eqs. (18)– (21) I refers to a representation of the data points inthe space of linear combinations, it is the identity matrix. Inconsequence, the left side vanishes. Because of the linear inde-pendence of the base functions F, m must be a zero vector. Thus, itfollows that the standard solution of this cost function withoutconstraints automatically fulfills the given constraints. Hence themodel parameters can be determined in analogy to (7), (8) where,now, functions F map from the latent space to the space ofcoefficients a, i.e. we solve

UT GoldUWnew ¼UT RoldI ð22Þ

for the weights and we calculate

1

bnew

¼1

ND

Xk,n

RknðWold,boldÞJFðwkÞWnew�xnJ2

ð23Þ

for the variance where we use (10) to compute the dissimilarity.Here, D denotes the intrinsic dimensionality of the space ofcoefficients. This is upper bounded by the number of data pointsbut in general smaller. It has to be estimated based on the givendata set. In practice, setting D to a larger value or even the upperbound N does hardly affect the result of the method. We refer tothis iterative update scheme as relational GTM (RGTM). Thepseudo-code of the full algorithm is shown as Algorithm 1.

Algorithm 1. Relational GTM.

input

symmetric dissimilarity matrix DARN�N

begin

generate the grid of latent points fwkg,k¼ 1, . . . ,K; prepare the generalized linear regression model; init W, using MDS;

init b;

compute ak ¼ ½U�kW;

compute Dist where Jxn�tkJ2¼ ½Dak�n�

12 � a

Tk Dak;

for i¼1:epochs do

compute R from (6) using Dist and b;P compute G where Gnn ¼ nRkn;

compute W¼ ðUT GUÞ�1UT R where ð�Þ�1 denotes thepseudo-inverse;

compute ak ¼ ½U�kW;

compute Dist where Jxn�tkJ2¼ ½Dak�n�

12 � a

Tk Dak;

compute b¼NDðP

k,nRknDistkn�1;

end

return U, W and b;

end.

3.1. Initialization

Original GTM is initialized based on a principal componentanalysis (PCA) to avoid convergence to local optima. The risk ofconvergence to local optima would be large, otherwise, due to therapid convergence of the EM scheme. PCA is applied to the matrixof vectorial data points and yields eigenvalues e and eigenvectorsA of the covariance matrix, which correspond to the first twoprincipal components of the data. The Euclidean GTM is

Page 4: Relational generative topographic mapping

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–13711362

initialized by solving

UW¼ VATð24Þ

where the left hand side denotes the nonlinear mapping of the latentpoints to the data space and the right hand side denotes the linearprojection of latent points V¼ ðwiÞi to the two primary componentsA. To obtain an appropriate scaling of the grid, the eigenvectors [A]i

are multiplied with the square roots of their eigenvalues ei and thelatent points V are normalized to have zero mean and standarddeviation one. Afterwards the mean of the data, which is removed bythe PCA, is added to the linear component of W.

A similar principle can be applied to RGTM: In the case ofRGTM we can obtain the first two principal components of thedata points which are given only indirectly by using multidimen-sional scaling (MDS). MDS is applied to the dissimilarity matrixand yields matrix A, which N rows are two dimensional repre-sentations of N data points. The columns of A denote the twoprinciple components of the data, given as the linear combinationof the data points. Based on this observation, the initialization ofRGTM is done via (24), where, now, the left hand side denotesthe nonlinear mapping of the latent points to the space of thecoefficients, i.e. affine combinations, and the right hand sidedenotes the linear projection of the latent points V to the twoprimary components of the data in the affine space. To obtain thesame scaling as in the vectorial case, the latent points V arenormalized as above and the columns of the matrix A are multi-plied by their standard deviation and divided by the correspond-ing eigenvalues of AAT.

The data points in the space of affine combinations lie on thehyperplane, which has the distance

ffiffiffiffiffiffiffiffiffi1=N

pfrom the origin. Since

MDS removes the mean of the data, the resulting mapped manifoldcontains the origin. It should be shifted to match the data. This canbe achieved by adding 1/N to the linear component of W.

4. Convergence of RGTM

4.1. Euclidean case

RGTM has been derived under the assumption that a vectorspace exists with data points xi such that the dissimilarities can

-3-3

-2

-1

0

1

2

-3

-2

-1

0

1

2

-2 -1 0 1 2

Fig. 1. Comparison of (a) GTM and (b) RGTM on an Euclidean toy data set. The grid w

identical.

be expressed as dij ¼ Jxi�xjJ2. Under this assumption, instead of

performing GTM in the unknown vector space, RGTM optimizesthe data log-likelihood implicitly in the space of coefficientvectors ai which induce prototypes t in the vector space by alinear combination T¼ a � X, T being the matrix of prototypevectors in the (unknown) data space. The procedure of RGTM isequivalent to original GTM in the data space due to the followingreasons:

-3

as p

The constraintP

nakn is automatically fulfilled for solutions ofGTM. Therefore, because of the equality of the distances (10)there is a one–one correspondence of target vectors found byGTM and coefficient vectors found by RGTM.

� The solution found be RGTM depends on the model for y. We

can choose it as generalized linear regression model for GTMand for RGTM. Since targets and coefficient vectors are linearlydependent, these two choices correspond to each other.

� PCA initialization of weights W based on the data points X

corresponds to an MDS initialization of weights according tothe dissimilarity matrix D.

� The variance b�1 is adapted using the dimensionality of the

data. If the intrinsic dimensionality is known, then the varianceis computed in the same way for RGTM.

Therefore, if an Euclidean embedding of data exists, convergenceof RGTM is guaranteed, and the procedure implicitly optimizesthe data log-likelihood of the underlying (unknown) data space.GTM and RGTM yield the same results. We demonstrate this factin a simple example involving a two-dimensional mixture ofGaussians (see Fig. 1). The results of GTM and RGTM are exactlyidentical as can be seen in Fig. 1.

4.2. Pseudo-Euclidean case

In general, a Euclidean embedding of an arbitrary dissimilaritymatrix D need not exist. We assume that D has zero diagonaldii ¼ 0 and symmetric entries dij ¼ dji. In this case a so-calledpseudo-Euclidean embedding in a vector space can be foundas explained e.g. in [25]. That means, one can find a vectorspace with a bilinear form. This form need not be positivedefinite, but there can exist negative eigenvalues, more precisely

-2 -1 0 1 2

lotted in the original data space. Obviously, for Euclidean data, the results are

Page 5: Relational generative topographic mapping

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–1371 1363

p components are positive, q are negative. In formulas, thebilinear form is given as

/x,ySp,q ¼Xp

i ¼ 1

xiyi�Xq

i ¼ pþ1

xiyi ð25Þ

In this setting, we can find vectors xi in pseudo-Euclidean spacewith the property dij ¼/xi�xj,xi�xjSp,q. The bilinear form neednot correspond to a positive semidefinite form, the number ofnegative contributions q in (25) referring to the necessarycorrections of the data to achieve Euclideanity. Data are Euclideaniff q¼ 0.

In this general setting, RGTM can in principle be applied asbeforehand since all operations of RGTM are defined. In fact, alloperations of GTM being vector space operations (and thus beingwell defined also in pseudo-Euclidean space), RGTM correspondsto an application of the GTM algorithm in the pseudo-Euclideanvector space also for this general setting—however, without theguarantee that the data log-likelihood is optimized by thisprocedure. In fact, a few things can happen:

The distance as computed by (10) can become negative due tothe negative eigenvalues of the bilinear form. Then the corre-sponding probability (3) does not constitute a valid probability.Experimental observations indicate that this situation happensin practical experiments, but it does not seem to harm theresult. The mathematical counterpart of this operation, how-ever, is not clear, one major problem being that an appropriatemathematical definition of probability measures in pseudo-Euclidean space does not yet exist [25]. The problem of non-Euclideanity and, in consequence, no well-defined probability,could be cured by a transformation of the negative parts ofpseudo-Euclidean space, operations such as flipping negativeeigenvalues, clipping negative eigenvalues, or performing aspread transformation having been proposed in the litera-ture [26,5]. However, important information can be lost thisway and results which incorporate the full information can bebetter, as demonstrated in [22,11]. � b as computed in (8) can become negative. In this case,

numerical problems occur apart from the fact that a negativevariance does no longer correspond to a valid probabilitymeasure. Although this is a theoretical possibility, we neverobserved this behavior in practice.

� The algorithm can diverge since it does not optimize the log-

likelihood by an EM-scheme. Like its deterministic counterpart,the relational SOM, this setting can be observed in theoreticalmodel situations [11]. It is due to the fact that the weightmatrix as computed by (7) can correspond to a saddle point ofthe maximization step. However, as also reported for itsdeterministic counterpart [11], we never observed this beha-vior for any real-life data set due to the fact that the positiveparts of the pseudo-Euclidean space usually outweigh contri-butions due to the negative eigenvalues.

Thus, it seems possible to safely use RGTM also for generaldissimilarity data sets, albeit a clear mathematical foundation interms of a likelihood optimization is so far not available inthis case.

4.3. Complexity

The memory complexity of the original GTM algorithm isO(K N), where K is the number of latent points. This is due tothe storage of the pairwise distances of prototypes and datapoints, as well as the corresponding responsibilities. The compu-tational complexity of PCA, which is used for the initialization, is

O(N). The most demanding task in training is the computation ofthe distances, which is O(KN). The matrix inversion necessary forcomputing W is cubic in the number of basis functions, which issmall for a small amount of basis functions. Thus, the overallcomplexity of GTM is linear in the number of data points.

RGTM requires more memory than GTM. In addition to thedistances and responsibilities, the larger parameter matrix W andthe coefficients a have to be stored. These matrices are O(MN) andO(KN) respectively, where M denotes the number of basis func-tions. Still, the memory complexity stays linear. In RGTM, insteadof PCA, MDS is used for initialization. Its computational complex-ity is O(N3), but since we project into 2D, only the first twocomponents are needed, which can be computed in O(N2). Sincethe distances are calculated using the dissimilarity matrix andcoefficients, the complexity becomes O(KN2). Hence, the overallcomputational complexity of RGTM per epoch is dominated byO(N2) for large data sets. That means, while RGTM extends theapplicability of GTM to settings where pairwise dissimilaritiesrather than vectors are given, it pays the prize of an increasedquadratic instead of linear complexity. This is still more efficientthan an explicit embedding of the dissimilarity data into a vectorspace which would be O(N3).

In addition, this effort is comparable to the effort of currentstate of the art visualization techniques for dissimilarity data.The computational complexity of t-distributed stochasticneighbor embedding (t-SNE) as one of the most popular visuali-zation tools [23] is O(N2) per epoch, being a gradient methodfor a cost function involving O(N2) terms. The memory require-ment is dominated by the embedding coefficients, i.e. O(N).Thus, assuming K is constant, the requirements of RGTM andt-SNE match.

Note that RGTM can be accelerated to linear time using theNystrom approximation, which is a popular technique to speedup kernel methods. An investigation of the Nystrom method fordissimilarities in the context of RGTM can be found in [18].

5. Experiments

5.1. Evaluation on benchmark data sets

First, we test RGTM on several benchmark dissimilarity datasets as introduced in [5,9]:

Cat cortex data: The cat cortex data originates from anatomicstudies of cats’ brains. The dissimilarity matrix displays theconnection strength between 65 cortical areas [7]. For ourpurposes, a preprocessed version as presented in [8] was used.The matrix is symmetric with zero diagonal, but the triangleinequality does not hold. The data is labeled with four classes. � Protein data: The protein data set, as described in [27], consists

of 226 globin proteins which are compared based on theirevolutionary distance. The samples originate from differentprotein families: hemoglobin-a, hemoglobin-b, myoglobin, etc.Here, we distinguish five classes as proposed in [8]: HA, HB, MY,GG/GP, and others. Unlike the other data sets considered here,the protein data set has a highly unbalanced class structure,with class distribution HA (31.86%), HB (31.86%), MY (17.26%),GG/GP (13.27%), and others (5.75%).

� Aural sonar data: The aural sonar data set, as described in [5],

consists of 100 returns from a broadband active sonar system,which are labeled in two classes, target-of-interest versusclutter. The dissimilarity is scored by two independent humansubjects each resulting in a dissimilarity score in f0,0:1, . . . ,1g.

� Patrol data: The patrol data set describes 241 members of seven

patrol units and one class corresponding to people not in any

Page 6: Relational generative topographic mapping

TaMe

val

C

P

A

P

V

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–13711364

unit, i.e. eight classes. Dissimilarities are computed based onevery person in the patrol units naming five other persons intheir unit, whereby the responses were partially inaccurate.Every mentioning yields an entry of the dissimilarity matrix,see [5]. Data are sparse in the sense that most entries of thematrix correspond to the maximum dissimilarity which weset to 3.

� Voting data: The voting data set describes a two-class classifica-

tion problem incorporating 435 samples which are given by 16categorical features with three different possible values each.The dissimilarity is determined based on the value differencemetric, see [5].

If necessary, the data sets were linearly transformed fromsimilarities to dissimilarities prior to training. Also, in case the

ble 1an classification accuracy on the data sets obtained by a repeated cross-

idation, the standard deviation is given in parenthesis.

RNG DA RGTM

at cortex 0.698 (0.076) 0.803 (0.083) 0.765 (0.063)

roteins 0.919 (0.016) 0.907 (0.008) 0.936 (0.004)

ural sonar 0.834 (0.014) 0.856 (0.026) 0.837 (0.026)

atrol 0.665 (0.024) 0.521 (0.051) 0.666 (0.046)

oting 0.950 (0.004) 0.951 (0.005) 0.938 (0.006)

Fig. 2. The cat cortex and proteins benchmark data sets v

dissimilarities were not symmetric, we symmetrized the dissim-ilarity matrix D by setting ~D ¼ ðDþDT

Þ=2. Diagonal values wereset to zero, ignoring any self-dissimilarities: ~d ii ¼ 0 8iAf1 . . .Ng.

For one data point, we refer to the nearest prototype in dataspace as the winner. In case of RGTM, the winner for a certainpoint is therefore the latent point with the highest responsibilitywith respect to the data point. Since all benchmark data sets arelabeled, it is possible to evaluate the clustering result by theclassification accuracy obtained by posterior labeling. For RGTM,one can choose different labeling strategies depending on theapplication context: We can use standard labeling given by amajority vote as usually done for crisp approaches such as self-organizing maps of neural gas. As an alternative, we can rely onthe averaged responsibilities of prototypes for data Rkn and labelprototypes according to accumulated responsibilities. Here weused majority voting to ensure comparability to neural gas andsimilarly a latent point is assigned the label which the majority ofdata points in its receptive field carries; all these are data pointsfor which it is the winner.

We report the results of a repeated cross-validation with 10repeats, where we used two folds for the cat cortex data and auralsonar data and 10 folds for the other data sets to maintaincomparability with the results from [9]. For the cross-validation,out-of-sample extensions of the assignments can be computed inthe same way as for relational neural gas, see [9]. To classify out-of-sample data, we assigned the class label of the closest proto-type in data space that does carry a class label. The classification

isualized by (a,d) RGTM, (b,e) t-SNE and (c,f) RSOM.

Page 7: Relational generative topographic mapping

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–1371 1365

accuracy obtained on the respective test set is listed in Table 1.For comparison, we report the classification accuracy of determi-nistic annealing (DA) and relational neural gas (RNG) as presentedin [9]. In the RGTM, we used 900 latent points (a 30-by-30 regulargrid) and four Gaussian base functions (a 2-by-2 grid) for all datasets. The amount of base functions implies the degree of freedomof the manifold in data space, to which the latent points aremapped to. The number of base functions was generally chosen assmall as possible to preserve the topology of the data. The targetmanifold therefore has a low degree of freedom, so a certainamount of unlabeled prototypes are to be expected, since they aremapped to locations with no data in their receptive fields. Thisfact justifies the use of more prototypes in RGTM in comparisonwith the experiments performed with RNG in [9], the latter notbeing restricted by topological constraints. The variance of thebase functions, s�1, has been chosen such that it fits the distancebetween neighboring base function centers. This parameter set-ting was chosen with regard to all data sets.

The classification accuracy on the test set and the correspond-ing standard deviation is reported in Table 1. Obviously, theresults of RGTM are comparable to these two alternatives andare even better for two of the five classification tasks. Hence,RGTM offers a feasible alternative to DA and RNG, where RGTMprovides additional functionality such as topographic mappingand visualization due to an explicit modeling by means of a low-dimensional latent space.

Fig. 3. The aural sonar and patrol benchmark data sets v

5.1.1. Visualization experiments

In Figs. 2, 3, and 4, we show mappings of the introducedbenchmark data sets obtained by RGTM in comparison with therespective visualizations by t-distributed stochastic neighborembedding (t-SNE) as one of the currently best nonlinear datavisualization techniques and the relational self-organizing map(RSOM) as an alternative model which relies on topology pre-servation, see [23,11,20]. In addition to each map, we displayqualitative measures for the given visualization in the graphsin Figs. 5 and 6.

Several quantitative evaluation measures for data visualizationhave recently been proposed: these include techniques which relyon neighborhood ranking such as the trustworthiness and con-tinuity [14,15], which can be put into a very elegant more generalframework by means of the co-ranking matrix as recentlyproposed in [16]. As an alternative, the contribution [17] proposesan information theoretic point of view, measuring precision andrecall of local neighborhoods induced by the low-dimensionalvisualization as compared to the original data. In our case, theprojection of data is induced be a clustering in low-dimensionalspace such that several data are represented by the same locationin low dimensions. In this case, the evaluation measure asproposed by [16] is not applicable (see also [19]). Hence we usethe information retrieval perspective on visualization, see [17].

The framework yields the mean precision and mean recall fordimensionality reduction scenarios, by evaluating errors in the

isualized by (a,d) RGTM, (b,e) t-SNE and (c,f) RSOM.

Page 8: Relational generative topographic mapping

Fig. 4. The voting benchmark data set visualized by (a) RGTM, (b) t-SNE and (c) RSOM.

0 10 20 30 40 50 60 700

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Pre

cisi

on

Precision

RGTMRSOMt−SNE

0 10 20 30 40 50 60 700

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Rec

all

Recall

RGTMRSOMt−SNE

0 50 100 150 200 2500

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Pre

cisi

on

Precision

RGTMRSOMt−SNE

0 50 100 150 200 2500

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Rec

all

Recall

RGTMRSOMt−SNE

0 10 20 30 40 50 60 70 80 90 1000

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Pre

cisi

on

Precision

RGTMRSOMt−SNE

0 10 20 30 40 50 60 70 80 90 1000

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Rec

all

Recall

RGTMRSOMt−SNE

Fig. 5. Mean precision and mean recall for the RGTM, RSOM, and t-SNE mappings of three benchmark data sets. (a) Cat cortex data set, (b) proteins data set, (c) aural sonar

data set.

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–13711366

Page 9: Relational generative topographic mapping

0 50 100 150 200 2500

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Pre

cisi

on

Precision

RGTMRSOMt−SNE

0 50 100 150 200 2500

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Rec

all

Recall

RGTMRSOMt−SNE

0 50 100 150 200 250 300 350 400 4500

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Pre

cisi

on

Precision

RGTMRSOMt−SNE

0 50 100 150 200 250 300 350 400 4500

0.10.20.30.40.50.60.70.80.9

1

Number of nearest neighbors k

Rec

all

Recall

RGTMRSOMt−SNE

Fig. 6. Mean precision and mean recall for the RGTM, RSOM, and t-SNE mappings of two benchmark data sets. (a) Patrol data set, (b) voting data set.

Table 2Overview of the parameter settings used in the visualization experiments on the

benchmark data sets.

Method Parameter Setting

RGTM Number of latent points 900 (30-by-30 grid)

RGTM Number of base functions 4 (2-by-2 grid)

RGTM Number of training epochs 30

RSOM Number of neurons 900 (30-by-30 grid)

RSOM Number of training epochs 500

RSOM Initial neighborhood range N/2

t-SNE Initial dimensionality bN=4c

t-SNE Perplexity 30

Table 3The topographic products for the RSOM and RGTM grids produced in the

visualization experiments on the benchmark data sets, see Figs. 2–4.

RSOM RGTM

Cat cortex 0.03285 0.00008Proteins �0.03613 0.00019Aural sonar �0.02585 0.00034Patrol �0.14677 0.00009Voting �0.00197 �0.00426

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–1371 1367

proximity relationships between data points, occurring underspatial transformation. For every data point, a neighborhood isdefined in the original data space, and, correspondingly, in thevisualization space. Following the information retrieval perspec-tive, the former represents the truthful information, and the latteris viewed as the retrieval result. Their consistence can be com-pared in terms of true positives, false positives, and misses, whichyields the basis for precision and recall. The neighborhood of adata point is the set of all other points within a fixed radius fromthe respective data point, where the radius can be defined by anyconsistent notion of proximity. One possibility is to use a rank-based distance for the radius, i.e. define the neighborhood set asthe k nearest neighbors, breaking the ties deterministically. There-fore, the mean precision and mean recall is calculated for everypossible neighborhood radius kAf1 . . .N�1g, where the respectivemean is taken over the k-ary neighborhoods of every data point.

In our case, we obtain a visualization of data by the prescrip-tion x/wk such that Jx�yðwk,WÞJ2 is minimum (computedimplicitly as given by (10)). Hence data points are displayed atthe position of their winning prototype point in the grid. Notethat, this way, all points in the receptive field of prototype wk aredisplayed at the same position. As we will see later, it is possibleto increase the granularity of the RGTM grid without retraining,such that a finer resolution of the grid structure would lead to afiner resolution of the representation of data. To evaluate theprecision and recall, we use the definitions in [17] which areapplicable to this setting as discussed in [19]. The graphsin Figs. 5 and 6 show the measurements for every number k ofnearest neighbors, ranging from 1 to N�1. For both, the meanprecision and mean recall, a value of 1 represents the highest, and0 the lowest quality, while the combination of high precision andhigh recall is the desired situation, since it marks the highestpreservation of spatial relationships. For further evaluation, wereport the topographic product, see [2], for each RGTM and RSOMvisualization in Table 3. The topographic product constitutes anefficient measurement which approximately measures the degreeof neighborhood preservation as given by the grid structure. Here,

a value of 0 refers to a perfect preservation of the map topology.To calculate the topographic product properly, only absolutevalues of distances between prototype positions in data spacewere considered, since these distances can become negative dueto the non-Euclidean characteristics of the data space. Also, if thevalues were smaller than 10�7, we reset them to this value toavoid numerical instabilities.

The parameter settings used in the experiments are listedin Table 2. As before, we used the majority vote for posteriorlabeling, and the variance of the base functions, s�1, was set such

Page 10: Relational generative topographic mapping

Fig. 7. RGTM (left) and RSOM (right) visualization of classical (First Viennese School) and baroque sonatas by Beethoven (102), Haydn (172), Mozart (147), Bach (92), and

Scarlatti (555). The prototypes in the grid are marked using posterior labeling by the majority vote principle. The RGTM grid shows a noticeable separation of the musical

pieces by composer, where mostly the comprehensive work of Bach marks a blend between the classical and baroque era. The arrangement seems meaningful since Bach’s

work is considered influential for both musical eras. Also the distinct style of Scarlatti is represented. In the grid on the right, generated with RSOM, the separation of the

composers is less distinct. (a) RGTM, (b) legend, (c) RSOM.

1 http://www.kunstderfuge.com

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–13711368

that it fits the distance between neighboring base functioncenters. For the RSOM, the initial neighborhood range r0 was setto one-half of the number of data points, as stated in Table 2. Theneighborhood range defines how much the update process of oneneuron influences the neighboring neurons in the RSOM grid, fordetails, see [11]. It is annealed exponentially to 0.01 duringtraining, by calculating the range for the current epoch ec asrc ¼ r0 � ð0:01=r0Þ

ec=e, where e refers to the total number of epochs.As it is common when applying t-SNE, the dimensionality of thedata is first reduced by PCA, before the t-SNE mapping iscalculated. This initial projection dimensionality was set to one-fourth of the number of data points, as stated in Table 2.

As can be seen from the visualization and the evaluation,RGTM displays clear class structures and clear separations of theclusters in the form of unlabeled units, if appropriate. WhileRSOM tries to optimize the quantization error in the limit, thisway spreading all prototypes among the data even at the cost oftopological deformations or neighborhood deformations, RGTMpreserves the topology quite well, resulting in visualizationswhich are much closer to the corresponding t-SNE visualizationas compared to RSOM. This fact is mirrored in the values of thetopographic product as shown in Table 3, for which GTM yieldsmuch smaller values for all but one example. Unlike t-SNE,however, RGTM provides additional functionality such as group-ing and an explicit lattice structure.

Interestingly, the quality as evaluated by precision and recall ismuch less clear. Here, RGTM leads to worse values for very smallneighborhood range k as compared to t-SNE and RSOM in almostall cases. This can be attributed to two facts: on the one hand, itclusters points such that, due to identical positions for severaldata points, the precision is low for small neighborhood sizes. Inaddition, RGTM is quite constrained in its local projectionsprovided a small number of base functions is chosen such thatalmost locally linear projections for the data manifold areobserved. While this ensures an excellent global image andtopology preservation as measured by the topographic product,

local nonlinearities cannot precisely be captured. Due to theclustering which prevents very good values at small ranges k,the interesting region for medium sized k starting from aboutk¼10 displays a different behavior. RGTM is at least competitiveto t-SNE and RSOM, being even clearly superior to the latter insome cases, approaching the quality of t-SNE in these cases.

5.1.2. Visualization of MIDI files

To further demonstrate the visualization features of RGTM, weshow the topographic mapping for a dissimilarity data set ofclassical music, similar to [28]. It is composed of pairwisedissimilarities between 1068 sonatas from the classical periodof the First Viennese School (by Beethoven, Mozart and Haydn)and the baroque era (by Scarlatti and Bach). The musical pieceswere given in the MIDI file format, taken from the online MIDIcollection Kunst der Fuge.1 Their mutual dissimilarities weremeasured with the normalized compression distance (NCD),see [6], using a specific preprocessing, which provides meaningfulinvariances for music information retrieval, such as invariance topitch translation (transposition) and time scaling. This methoduses a graph-based representation of the musical pieces toconstruct reasonable strings as input for the NCD, see [28]. Sincethere is no ground truth for this kind of musical dissimilaritymeasure, no direct quantitative interpretation and evaluation ofthe results is possible. Still, the visualization features of RGTMwhich result in clearer class structures of the pieces according tothe composers can be demonstrated in comparison to the existingRSOM, as shown in Fig. 7. Here, RGTM and RSOM were againtrained with the parameters listed in Table 2.

On this data set, we additionally demonstrate different label-ing strategies, which can be used for RGTM after the training.Labeling by majority vote – as previously applied – means that aprototype in the grid is assigned the label of the majority of data

Page 11: Relational generative topographic mapping

Fig. 9. The trained RGTM on the cat cortex data reused for new mappings in

different grid sizes. (a) 30-by-30 grid, (b) 10-by-10 grid (Original), (c) 3-by-3.

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–1371 1369

points in its receptive field. This is displayed in Fig. 7 for both, theRGTM and RSOM. Alternatively, one can label the RGTM proto-types as follows: a latent point in the grid is assigned the classlabel, which is carried by the data points that have the highestaccumulated responsibility for this latent point. Note that everylatent point will have at least some small responsibility withrespect to any data point. Therefore, all prototypes in the mapwould get assigned a class label, and no unlabeled points (deadunits) would appear. To control this behavior for visualizationtasks, it is useful to set a threshold for the value of responsibility,below which the latent point remains without any label, i.e. aclass label is only assigned to a latent point, if the responsibility ofat least one data point for this latent point exceeds the threshold.Thus, adjusting this threshold controls how many dead units willappear in the map eventually. The resulting maps are shownin Fig. 8 for two different threshold values. These visualizationsemphasize more the overall class distribution, as opposed to themap with majority vote labeling in Fig. 7, where local spatial andstructural relationships are more accurately represented.

5.1.3. Parameters for data visualization

The parameter s, which determines the variance of the basefunctions, has only a slight effect on the algorithm, if it stays in areasonable interval. Throughout the experiments, it was set to fitthe distance between neighboring base function centers, and thenumber of base functions was chosen as small as possible topreserve the topology of the data. Changing the number of latentpoints generally changes only the sampling of the data butqualitatively the shape of the map stays the same. So with asmaller number, the algorithm is faster and sparsity of therepresentation is increased; with a larger number, the algorithmis slower but more details in data relations can be discovered.

As already mentioned before, data points are projected to thegrid position of its respective winning prototype in latent space.Interestingly, it is possible to use different grid resolutions fordata display based on a trained map (trained using only one fixedresolution): RGTM yields an explicit mapping of latent space tothe data space by means of the function (1). This can directly be

Fig. 8. Two visualizations of the sonatas data set using the posterior labeling by responsibilities with different threshold values. On the left, where the threshold is higher

(10�3), there are more unlabeled prototypes than in the map on the right, where the threshold is set lower (10�8). For comparison see Fig. 7 (left), where the posterior

labeling was done by majority vote. (a) RGTM-responsibility 10�3, (b) legend, (c) RGTM-responsibility threshold 10�8.

Page 12: Relational generative topographic mapping

Fig. 10. Visual comparison of the mapping of the cat cortex data, obtained by t-SNE and RGTM. Here, the RGTM was trained using a fine grid resolution, and only winner

prototypes were labeled after training. Therefore, the ratio of labeled prototypes to original data points is approximately 1–1.18, so a one-to-one mapping is nearly

achieved. (a) Projection of data points to their position on a 30-by-30 grid by RGTM, (b) t-SNE projection.

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–13711370

used to create an image of any grid in latent space. This behavioris displayed in Fig. 9 on the cat cortex data, trained originally witha 10-by-10 grid. Here, the grid size is changed and the trainedmapping of latent points into the data space is reused to do theposterior labeling for the new grids. The labeling procedure is thesame as in Fig. 8, the prototype gets the class label of the datapoints with the highest accumulated responsibility. Obviously,the overall structure of the map remains the same, but it ispossible to focus on a different level of detail on demand using anappropriate grid resolution.

By setting the grid resolution reasonably large, a very detailedresolution, in the limit of a one-to-one mapping of data points toprototypes and corresponding grid positions can be achieved, i.e. alldata points are individually mapped to different latent points. Tofavor such a mapping in an experiment, the number of latent pointsin the grid has to be larger than the total number of data points. Ofcourse, depending on the topology, idle prototypes are present inthe map to represent empty space. Thus, the grid resolution shouldbe chosen as a multiple of the number of data. The standard courseof action would be, to first choose a reasonable grid size, andincrease the grid size, if the data are not yet represented individu-ally. The latter is possible without retraining, relying on the explicitmapping of the latent space to the data space. Such an almost one-to-one mapping is presented for the cat cortex data in Fig. 10 incomparison with a t-SNE mapping. Here, a latent point is onlyassigned a label, if it is the winner for some data point. This way,RGTM arrives at an almost individual projection of points, ratherthan prototypes which deliver a compressed display of the data.

6. Conclusions

In this contribution, the generative topographic mapping hasbeen extended towards data given by a dissimilarity matrix ratherthan Euclidean vectors. The resulting algorithm, relational GTM,can be used directly on the dissimilarity matrix. It has beendemonstrated in the experiments that RGTM provides a reason-able topographic mapping of the data which is competitive to

alternatives for clustering of dissimilarity data such as determi-nistic annealing or relational neural gas, and to alternatives forprojection of dissimilarity data such as t-SNE and relational SOM.

Note that RGTM leads to a sparse representation of data interms of a set of latent points in latent space together with aprescription of how this generates a probability distribution indata space. In particular, due to an explicit mapping of the latentspace to the data space, an appropriate resolution of the mappingcan be chosen posterior to training.

One drawback of RGTM compared to vectorial approaches isits dependency on the dissimilarity matrix which is quadratic inthe number of points. This causes difficulties if large data sets aredealt with. One can incorporate the classical Nystrom techniqueto approximate the dissimilarity matrix in linear time. This way, alinear time complexity algorithm is achieved as demonstrated inpreliminary experiments in [18]. As an alternative, in [9], approx-imation schemes have been proposed in the context of relationalneural gas which, on the one hand, result in a sparse representa-tion of prototypes, on the other hand, allow a patch processing ofhuge dissimilarity matrices for which the computational loadwould otherwise be too big. This way, the resulting topographicmapping scheme is linear in the number of data points. Thetransfer of this method to RGTM and further experimentalcomparisons are the subject of ongoing research.

References

[1] G.A. Barreto, A.F.R. Araujo, S. Kremer, A taxonomy for spatiotemporalconnectionist networks revisited: the unsupervised case, Neural Computa-tion 15 (6) (2003) 1255–1320.

[2] H.-U. Bauer, K. Pawelzik, Quantifying the neighborhood preservation of self-organizing feature maps, IEEE Transactions on Neural Networks 3 (4) (1992)570–579.

[3] C. Bishop, M. Svensen, C. Williams, The generative topographic mapping,Neural Computation 10 (1) (1998) 215–234.

[4] R. Boulet, B. Jouve, F. Rossi, N. Villa, Batch kernel SOM and related Laplacianmethods for social network analysis, Neurocomputing 71 (7–9) (2008)1257–1273.

[5] Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzani, Similarity-basedclassification: concepts and algorithms, Journal of Machine Learning Research10 (2009) 747–776.

Page 13: Relational generative topographic mapping

A. Gisbrecht et al. / Neurocomputing 74 (2011) 1359–1371 1371

[6] R. Cilibrasi, M.B. Vitanyi, Clustering by compression, IEEE Transactions onInformation Theory 51 (4) (2005) 1523–1545.

[7] T. Graepel, K. Obermayer, A stochastic self-organizing map for proximitydata, Neural Computation 11 (1999) 139–155.

[8] B. Haasdonk, C. Bahlmann, Learning with distance substitution kernels, in:Pattern Recognition—Proceedings of the 26th DAGM Symposium, 2004.

[9] B. Hammer, A. Hasenfuss, Topographic mapping of large dissimilarity datasets, Neural Computation 22 (9) (2010) 2229–2284.

[10] B. Hammer, A. Micheli, A. Sperduti, M. Strickert, Recursive self-organizingnetwork models, Neural Networks 17 (8–9) (2004) 1061–1086.

[11] A. Hasenfuss, B. Hammer, Relational topographic maps, in: M.R. Berthold,J. Shawe-Taylor, N. Lavrac (Eds.), IDA 2007, 2007, pp. 93–105.

[12] R.J. Hathaway, J.C. Bezdek, Nerf c-means: non-Euclidean relational fuzzyclustering, Pattern Recognition 27 (3) (1994) 429–437.

[13] T. Heskes, Energy functions for self-organizing maps, in: E. Oja, S. Kaski (Eds.),Kohonen Maps, Elsevier, Amsterdam, 1999, pp. 303–315.

[14] S. Kaski, J. Nikkila, M. Oja, J. Venna, P. Toronen, E. Castren, Trustworthinessand metrics in visualizing similarity of gene expression, BMC Bioinformatics4 (2003) 48.

[15] S. Kaski, J. Venna, Local multidimensional scaling, Neural Networks 19 (6)(2006) 889–899.

[16] J.A. Lee, M. Verleysen, Quality assessment of dimensionality reduction: rank-based criteria, Neurocomputation 72 (7–9) (2009) 1431–1443.

[17] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, Information retrievalperspective to nonlinear dimensionality reduction for data visualization,Journal of Machine Learning Research 11 (2010) 451–490.

[18] A. Gisbrecht, B. Mokbel, B. Hammer, The Nystrom approximation for rela-tional generative topographic mappings, in: B. Hammer, F. Sha, L. van derMaaten, A. Smola (Eds.), Workshop Proceedings of the NIPS Workshop onChallenges of Data Visualization, 2010.

[19] B. Mokbel, A. Gisbrecht, B. Hammer, On the effect of clustering on qualityassessment measures for dimensionality reduction. in: B. Hammer, F. Sha, L.van der Maaten, A. Smola (Eds.), Workshop Proceedings of the NIPS Work-shop on Challenges of Data Visualization, 2010.

[20] T. Kohonen, Self-organizing Maps, Springer, 1995.[21] T. Kohonen, P. Somervuo, How to make large self-organizing maps for

nonvectorial data, Neural Networks 15 (2002) 945–952.[22] J. Laub, V. Roth, J.M. Buhmann, K.-R. Muller, On the information and

representation of non-Euclidean pairwise data, Pattern Recognition 39(2006) 1815–1826.

[23] L.J.P. van der Maaten, G.E. Hinton, Visualizing high-dimensional data usingt-SNE, Journal of Machine Learning Research 9 (November) (2008)2579–2605.

[24] I. Olier, A. Vellido, J. Giraldo, Kernel generative topographic mapping, in:M. Verleysen (Ed.), ESANN ’2010, 2010, pp. 481–486.

[25] E. Pekalska, B. Duin, The Dissimilarity Representation for Pattern Recognition,World Scientific, 2005.

[26] V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann, Optimal cluster preservingembedding of nonmetric proximity data, IEEE Transactions on PatternAnalysis and Machine Intelligence 25 (12) (2003) 1540–1551.

[27] H. Mevissen, M. Vingron, Quantifying the local reliability of a sequencealignment, Protein Engineering 9 (1996) 127–132.

[28] B. Mokbel, A. Hasenfuss, B. Hammer, Graph-based representation of symbolicmusical data, in: A. Torsello, F. Escolano, L. Brun (Eds.), GbRPR 2009, Springer,2009, pp. 42–51.

[29] P. Tino, A. Kaban, Y. Sun, A generative probabilistic approach to visualizingsets of symbolic sequences, in: R. Kohavi, J. Gehrke, W. DuMouchel, J. Ghosh

(Eds.), Proceedings of the 10th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining—KDD-2004, ACM Press, 2004, pp.701–706.

[30] H. Yin, On the equivalence between kernel self-organising maps and self-organising mixture density network, Neural Networks 19 (6) (2006)780–784.

Andrej Gisbrecht received his Diploma in ComputerScience in 2009 from the Clausthal University ofTechnology, Germany, and continued there as a PhDstudent. Since early 2010 he is a PhD-student at theCognitive Interaction Technology Center of Excellenceat Bielefeld University, Germany.

Bassam Mokbel is currently a PhD student at BielefeldUniversity, Germany, in the research group for theore-tical computer science within the Center of Excellencefor Cognitive Interaction Technology (CITEC). He stu-died computer science in Germany at the ClausthalUniversity of Technology, where he received hisDiploma degree in 2009, and afterwards became aresearch assistant. Before he joined the CITEC in April2010, he worked in the state-funded collaborativeresearch program ‘‘IT Ecosystems’’ about complexand heterogeneous distributed computer systems.

Barbara Hammer received her PhD in ComputerScience in 1995 and her venia legendi in ComputerScience in 2003, both from the University of Osnab-rueck, Germany. From 2000 to 2004, she was a leaderof the junior research group ‘Learning with NeuralMethods on Structured Data’ at the University ofOsnabrueck before accepting an offer as professor forTheoretical Computer Science at Clausthal Universityof Technology, Germany, in 2004. Since 2010, she isholding a professorship for Theoretical ComputerScience for Cognitive Systems at the CITEC cluster ofexcellence at Bielefeld University, Germany. Several

research stays have taken her to Italy, UK, India,

France, the Netherlands, and the USA. Her areas of expertise include hybridsystems, self-organizing maps, clustering, and recurrent networks as well asapplications in bioinformatics, industrial process monitoring, or cognitive science.