an alternative infinite mixture of gaussian process...

8
An Alternative Infinite Mixture Of Gaussian Process Experts Edward Meeds and Simon Osindero Department of Computer Science University of Toronto Toronto, M5S 3G4 {ewm,osindero}@cs.toronto.edu Abstract We present an infinite mixture model in which each component com- prises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multi- modality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This al- lows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian specification of the effective ‘gating network’ for the different experts. 1 Introduction Gaussian process (GP) models are powerful tools for regression, function approximation, and predictive density estimation. However, despite their power and flexibility, they suffer from several limitations. The computational requirements scale cubically with the number of data points, thereby necessitating a range of approximations for large datasets. Another problem is that it can be difficult to specify priors and perform learning in GP models if we require non-stationary covariance functions, multi-modal output, or discontinuities. There have been several attempts to circumvent some of these lacunae, for example [2, 1]. In particular the Infinite Mixture of Gaussian Process Experts (IMoGPE) model proposed by Rasmussen and Ghahramani [1] neatly addresses the aforementioned key issues. In a single GP model, an n by n matrix must be inverted during inference. However, if we use a model composed of multiple GP’s, each responsible only for a subset of the data, then the computational complexity of inverting an n by n matrix is replaced by several inversions of smaller matrices — for large datasets this can result in a substantial speed-up and may allow one to consider large-scale problems that would otherwise be unwieldy. Furthermore, by combining multiple stationary GP experts, we can easily accommodate non-stationary covariance and noise levels, as well as distinctly multi-modal outputs. Finally, by placing a Dirichlet process prior over the experts we can allow the data and our prior beliefs (which may be rather vague) to automatically determine the number of components to use. In this work we present an alternative infinite model that is strongly inspired by the work in [1], but which uses a different formulation for the mixture of experts that is in the style presented in, for example [3, 4]. This alternative approach effectively uses posterior re-

Upload: dinhnhi

Post on 16-Nov-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

An Alternative Infinite Mixture Of GaussianProcess Experts

Edward Meeds and Simon OsinderoDepartment of Computer Science

University of TorontoToronto, M5S 3G4

ewm,[email protected]

Abstract

We present an infinite mixture model in which each component com-prises a multivariate Gaussian distribution over an input space, and aGaussian Process model over an output space. Our model is neatly ableto deal with non-stationary covariance functions, discontinuities, multi-modality and overlapping output signals. The work is similar to that byRasmussen and Ghahramani [1]; however, we use a full generative modelover input and output space rather than just a conditional model. This al-lows us to deal with incomplete data, to perform inference over inversefunctional mappings as well as for regression, and also leads to a morepowerful and consistent Bayesian specification of the effective ‘gatingnetwork’ for the different experts.

1 Introduction

Gaussian process (GP) models are powerful tools for regression, function approximation,and predictive density estimation. However, despite their power and flexibility, they sufferfrom several limitations. The computational requirements scale cubically with the numberof data points, thereby necessitating a range of approximations for large datasets. Anotherproblem is that it can be difficult to specify priors and perform learning in GP models if werequire non-stationary covariance functions, multi-modal output, or discontinuities.

There have been several attempts to circumvent some of these lacunae, for example [2, 1].In particular the Infinite Mixture of Gaussian Process Experts (IMoGPE) model proposedby Rasmussen and Ghahramani [1] neatly addresses the aforementioned key issues. In asingle GP model, an n by nmatrix must be inverted during inference. However, if we use amodel composed of multiple GP’s, each responsible only for a subset of the data, then thecomputational complexity of inverting an n by n matrix is replaced by several inversionsof smaller matrices — for large datasets this can result in a substantial speed-up and mayallow one to consider large-scale problems that would otherwise be unwieldy. Furthermore,by combining multiple stationary GP experts, we can easily accommodate non-stationarycovariance and noise levels, as well as distinctly multi-modal outputs. Finally, by placing aDirichlet process prior over the experts we can allow the data and our prior beliefs (whichmay be rather vague) to automatically determine the number of components to use.

In this work we present an alternative infinite model that is strongly inspired by the workin [1], but which uses a different formulation for the mixture of experts that is in the stylepresented in, for example [3, 4]. This alternative approach effectively uses posterior re-

PSfrag replacements

yixi ziN

PSfrag replacements

yixiziN

Figure 1: Left: Graphical model for the standard MoE model [6]. The expert indicatorsz(i) are specified by a gating network applied to the inputs x(i). Right: An alternativeview of MoE model using a full generative model [4]. The distribution of input locations isnow given by a mixture model, with components for each expert. Conditioned on the inputlocations, the posterior responsibilities for each mixture component behave like a gatingnetwork.

sponsibilities from a mixture distribution as the gating network. Even if the task at handis simply output density estimation or regression, we suggest a full generative model overinputs and outputs might be preferable to a purely conditional model. The generative ap-proach retains all the strengths of [1] and also has a number of potential advantages, suchas being able to deal with partially specified data (e.g. missing input co-ordinates) andbeing able to infer inverse functional mappings (i.e. the input space given an output value).The generative approach also affords us a richer and more consistent way of specifyingour prior beliefs about how the covariance structure of the outputs might vary as we movewithin input space.

An example of the type of generative model which we propose is shown in figure 2. Weuse a Dirichlet process prior over a countably infinite number of experts and each expertcomprises two parts: a density over input space describing the distribution of input pointsassociated with that expert, and a Gaussian Process model over the outputs associated withthat expert. In this preliminary exposition, we restrict our attention to experts whose in-put space densities are given a single full covariance Gaussian. Even this simple approachdemonstrates interesting performance and capabilities. However, in a more elaborate setupthe input density associated with each expert might itself be an infinite mixture of sim-pler distributions (for instance, an infinite mixture of Gaussians [5]) to allow for the mostflexible partitioning of input space amongst the experts.

The structure of the paper is as follows. We begin in section 2 with a brief overview oftwo ways of thinking about Mixtures of Experts. Then, in section 3, we give the completespecification and graphical depiction of our generative model, and in section 4 we outlinethe steps required to perform Monte Carlo inference and prediction. In section 5 we presentthe results of several simple simulations that highlight some of the salient features of ourproposal, and finally in section 6, we discuss our work and place it in relation to similartechniques.

2 Mixtures of ExpertsIn the standard mixture of experts (MoE) model [6], a gating network probabilisticallymixes regression components. One subtlety in using GP’s in a mixture of experts model isthat IID assumptions on the data no longer hold and we must specify joint distributions foreach possible assignment of experts to data. Let x(i) be the set of d-dimensional inputvectors, y(i) be the set of scalar outputs, and z(i) be the set of expert indicators whichassign data points to experts.

The likelihood of the outputs, given the inputs, is specified in equation 1, where θGPr rep-

resents the GP parameters of the rth expert, θg represents the parameters of the gatingnetwork, and the summation is over all possible configurations of indicator variables.

PSfrag replacements

Σx

νS fS

aνcbνc

ν0 f0

µx

S

νc

Σ0

Σr

µr

aα0

bα0

α0

zri

xr(i)

Yr

z(i)

v0r

v1r

wjr

j = 1 : Dr = 1 : K

a0 b0

a1 b1

aw bw

µ0

i = 1 : Nr

Figure 2: The graphical model representation of the alternative infinite mixture of GPexperts (AiMoGPE) model proposed in this paper. We have used x

r(i) to represent the

ith data point in the set of input data whose expert label is r, and Yr to represent the set ofall output data whose expert label is r. In other words, input data are IID given their expertlabel, whereas the sets of output data are IID given their corresponding sets of input data.The lightly shaded boxes with rounded corners represent hyper-hyper parameters that arefixed (Ω in the text). The DP concentration parameter α0, the expert indicators variables,z(i), the gate hyperparameters, φx = µ0,Σ0, νc, S, the gate component parameters,ψx

r = µr,Σr, and the GP expert parameters, θGPr = v0r, v1r, wjr, are all updated for

all r and j.

P (y(i)|x(i), θ)=∑

Z

P (z(i)|x(i), θg)

r

P (y(i) : z(i) = r|x(i) : z(i) = r, θGPr )

(1)There is an alternative view of the MoE model in which the experts also generate the inputs,rather than simply being conditioned on them [3, 4] (see figure 1). This alternative viewemploys a joint mixture model over input and output space, even though the objectiveis still primarily that of estimating conditional densities i.e. outputs given inputs. Thegating network effectively gets specified by the posterior responsibilities of each of thedifferent components in the mixture. An advantage of this perspective is that it can easilyaccommodate partially observed inputs and it also allows ‘reverse-conditioning’, shouldwe wish to estimate where in input space a given output value is likely to have originated.For a mixture model using Gaussian Processes experts, the likelihood is given by

P (x(i),y(i)|θ) =∑

Z

P (z(i)|θg)×

r

P (y(i) : z(i) = r|x(i) : z(i) = r, θGPr )P (x(i) : z(i) = r|θg) (2)

where the description of the density over input space is encapsulated in θg.

3 Infinite Mixture of Gaussian Processes: A Joint Generative ModelThe graphical structure for our full generative model is shown in figure 2. Our generativeprocess does not produce IID data points and is therefore most simply formulated either as

a joint distribution over a dataset of a given size, or as a set of conditionals in which weincrementally add data points.To construct a complete set ofN sample points from the prior(specified by top-level hyper-parameters Ω) we would perform the following operations:

1. Sample Dirichlet process concentration variable α0 given the top-level hyper-parameters.

2. Construct a partition of N objects into at most N groups using a Dirichlet pro-cess. This assignment of objects is denoted by using a set the indicator variablesz(i)

Ni=1.

3. Sample the gate hyperparameters φx given the top-level hyperparameters.4. For each grouping of indicators z(i) : z(i) = r, sample the input space param-

eters ψx

r conditioned on φx. ψx

r defines the density in input space, in our case afull-covariance Gaussian.

5. Given the parameters ψx

r for each group, sample the locations of the input pointsXr ≡ x(i) : z(i) = r.

6. For each group, sample the hyper-parameters for the GP expert associated withthat group, θGP

r .7. Using the input locations Xr and hyper-parameters θGP

r for the individual groups,formulate the GP output covariance matrix and sample the set of output values,Yr ≡ y(i) : z(i) = r from this joint Gaussian distribution.

We write the full joint distribution of our model as follows.

P (x(i), y(i)Ni=1, z(i)

Ni=1, ψ

x

r Nr=1, θ

GPr N

r=1, α0, φx|N,Ω) =

N∏

r=1

[

HNr P (ψx

r |φx)P (Xr|ψ

x

r )P (θGPr |Ω)P (Yr|Xr, θ

GPr ) + (1 −HN

r )D0(ψx

r , θGPr )

]

× P (z(i)Ni=1|N,α0)P (α0|Ω)P (φx|Ω) (3)

Where we have used the supplementary notation: HNr = 0 if z(i) : z(i) = r is the

empty set and HNr = 1 otherwise; and D0(ψ

x

r , θGPr ) is a delta function on an (irrelevant)

dummy set of parameters to ensure proper normalisation.

For the GP components, we use a standard, stationary covariance function of the form

Q(x(i),x(h)) = v0 exp

(

−1

2

∑D

j=1

(

x(i)j − x(h)j

)2/w2

j

)

+ δ(i, h)v1 (4)

The individual distributions in equation 3 are defined as follows1:

P (α0|Ω) = G(α0; aα0, bα0

) (5)

P (z(i)Ni=1|N,Ω) = PU(α0, N) (6)

P (φx|Ω) = N (µ0; µx,Σx/f0)W(Σ−10 ; ν0, f0Σ

−1x /ν0)

G(νc; aνc, bνc

)W(S−1; νS , fSΣx/νS) (7)

P (ψx

r |Ω) = N (µr; µ0,Σ0)W(Σ−1r ; νc, S/νc) (8)

P (Xr|ψx

r ) = N (Xr; µr, Σr) (9)

P (θGPr |Ω) = G(v0r; a0, b0)G(v1r; a1, b1)

∏D

j=1LN (wjr; aw, bw) (10)

P (Yr|Xr, θGPr ) = N (Yr; µQr

, σ2Qr

) (11)

1We use the notation N , W , G, and LN to represent the normal, the Wishart, the gamma, and thelog-normal distributions, respectively; we use the parameterizations found in [7] (Appendix A). Thenotation PU refers to the Polya urn distribution [8].

In an approach similar to Rasmussen [5], we use the input data mean µx and covarianceΣx to provide an automatic normalisation of our dataset. We also incorporate additionalhyperparameters f0 and fS , which allow prior beliefs about the variation in location of µr

and size of Σr, relative to the data covariance.

4 Monte Carlo Updates

Almost all the integrals and summations required for inference and learning operationswithin our model are analytically intractable, and therefore necessitate Monte Carlo ap-proximations. Fortunately, all the necessary updates are relatively straightforward to carryout using a Markov Chain Monte Carlo (MCMC) scheme employing Gibbs sampling andHybrid Monte Carlo. We also note that in our model the predictive density depends onthe entire set of test locations (in input space). This transductive behaviour follows fromthe non-IID nature of the model and the influence that test locations have on the posteriordistribution over mixture parameters. Consequently, the marginal predictive distribution ata given location can depend on the other locations for which we are making simultaneouspredictions. This may or may not be desired. In some situations the ability to incorporatethe additional information about the input density at test time may be beneficial. However,it is also straightforward to effectively ‘ignore’ this new information and simply compute aset of independent single location predictions.

Given a set of test locations x∗

(t), along with training data pairs x(i), y(i) and top-levelhyper-parameters Ω, we iterate through the following conditional updates to produce ourpredictive distribution for unknown outputs y∗(t). The parameter updates are all conjugatewith the prior distributions, except where noted:

1. Update indicators z(i) by cycling through the data and sampling one indicatorvariable at a time. We use algorithm 8 from [9] with m = 1 to explore newexperts.

2. Update input space parameters.3. Update GP hyper-params using Hybrid Monte Carlo [10].4. Update gate hyperparameters. Note that νc is updated using slice sampling [11].5. Update DP hyperparameter α0 using the data augmentation technique of Escobar

and West [12].6. Resample missing output values by cycling through the experts, and jointly sam-

pling the missing outputs associated with that GP.

We perform some preliminary runs to estimate the longest auto-covariance time, τmax forour posterior estimates, and then use a burn-in period that is about 10 times this timescalebefore taking samples every τmax iterations.2 For our simulations the auto-covariance timewas typically 40 complete update cycles, so we use a burn-in period of 500 iterations andcollect samples every 50.

5 Experiments

5.1 Samples From The Prior

In figure 3 (A) we give an example of data drawn from our model which is multi-modaland non-stationary. We also use this artificial dataset to confirm that our MCMC algorithmperforms well and is able recover sensible posterior distributions. Posterior histograms forsome of the inferred parameters are shown in figure 3 (B) and we see that they are wellclustered around the ‘true’ values.

2This is primarily for convenience. It would also be valid to use all the samples after the burn-inperiod, and although they could not be considered independent, they could be used to obtain a moreaccurate estimator.

−8 −6 −4 −2 0 2 4 6 8 10−60

−50

−40

−30

−20

−10

0

10

20

30

40

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 10

5

10

15

α0

coun

t

3 4 5 60

20

40

60

80

100

k

coun

t

(A) (B)

Figure 3: (A) A set of samples from our model prior. The different marker styles are usedto indicate the sets of points from different experts. (B) The posterior distribution of logα0

with its true value indicated by the dashed line (top) and the distribution of occupied experts(bottom). We note that the posterior mass is located in the vicinity of the true values.

5.2 Inference On Toy Data

To illustrate some of the features of our model we constructed a toy dataset consisting of4 continuous functions, to which we added different levels of noise. The functions usedwere:

f1(a1) = 0.25a21 − 40 a1 ∈ (0 . . . 15) Noise SD: 7 (12)

f2(a2) = −0.0625(a2 − 18)2 + .5a2 + 20 a2 ∈ (35 . . . 60) Noise SD: 7 (13)

f3(a3) = 0.008(a3 − 60)3 − 70 a3 ∈ (45 . . . 80) Noise SD: 4 (14)

f4(a4) = − sin(0.25a4) − 6 a4 ∈ (80 . . . 100) Noise SD: 2 (15)

The resulting data has non-stationary noise levels, non-stationary covariance, discontinu-ities and significant multi-modality. Figure 4 shows our results on this dataset along withthose from a single GP for comparison.

We see that in order to account for the entire data set with a single GP, we are forced to inferan unnecessarily high level of noise in the function. Also, a single GP is unable to capturethe multi-modality or non-stationarity of the data distribution. In contrast, our model seemsmuch more able to deal with these challenges.

Since we have a full generative model over both input and output space, we are also ableto use our model to infer likely input locations given a particular output value. Thereare a number of applications for which this might be relevant, for example if one wantedto sample candidate locations at which to evaluate a function we are trying to optimise.We provide a simple illustration of this in figure 4 (B). We choose three output levelsand conditioned on the output having these values, we sample for the input location. Theinference seems plausible and our model is able to suggest locations in input space for amaximal output value (+40) that was not seen in the training data.

5.3 Regression on a simple “real-world” dataset

We also apply our model and algorithm to the motorcycle dataset of [13]. This is a com-monly used dataset in the GP community and therefore serves as a useful basis for compar-ison. In particular, it also makes it easy to see how our model compares with standard GP’sand with the work of [1]. Figure 5 compares the performance of our model with that of asingle GP. In particular, we note that although the median of our model closely resemblesthe mean of the single GP, our model is able to more accurately model the low noise level

−20 0 20 40 60 80 100 120−120

−100

−80

−60

−40

−20

0

20

40

60

80Training DataAiMoGPESingle GP

−20 0 20 40 60 80 100 120−120

−100

−80

−60

−40

−20

0

20

40

60

80

(A) (B)

Figure 4: Results on a toy dataset. (A) The training data is shown along with the predictivemean of a stationary covariance GP and the median of the predictive distribution of ourmodel. (B) The small dots are samples from the model (160 samples per location) evaluatedat 80 equally spaced locations across the range (but plotted with a small amount of jitterto aid visualisation). These illustrate the predictive density from our model. The solid thelines show the ± 2 SD interval from a regular GP. The circular markers at ordinates of 40,10 and −100 show samples from ‘reverse-conditioning’ where we sample likely abscissalocations given the test ordinate and the set of training data.

on the left side of the dataset. For the remainder of the dataset, the noise level modeled byour model and a single GP are very similar, although our model is better able to capturethe behaviour of the data at around 30 ms. It is difficult to make an exact comparison to[1], however we can speculate that our model is more realistically modeling the noise atthe beginning of the dataset by not inferring an overly “flat” GP expert at that location. Wecan also report that our expert adjacency matrix closely resembles that of [1].

6 DiscussionWe have presented an alternative framework for an infinite mixture of GP experts. We feelthat our proposed model carries over the strengths of [1] and augments these with the sev-eral desirable additional features. The pseudo-likelihood objective function used to adaptthe gating network defined in [1] is not guaranteed to lead to a self-consistent distributionand therefore the results may depend on the order in which the updates are performed; ourmodel incorporates a consistent Bayesian density formulation for both input and outputspaces by definition. Furthermore, in our most general framework we are more naturallyable to specify priors over the partitioning of space between different expert components.Also, since we have a full joint model we can infer inverse functional mappings.

There should be considerable gains to be made by allowing the input density models bemore powerful. This would make it easier for arbitrary regions of space to share the samecovariance structures; at present the areas ‘controlled’ by a particular expert tend to belocal. Consequently, a potentially undesirable aspect of the current model is that strongclustering in input space can lead us to infer several expert components even if a single GPwould do a good job of modelling the data. An elegant way of extending the model in thisway might be to use a separate infinite mixture distribution for the input density of eachexpert, perhaps incorporating a hierarchical DP prior across the infinite set of experts toallow information to be shared.

With regard to applications, it might be interesting to further explore our model’s capabilityto infer inverse functional mappings; perhaps this could be useful in an optimisation oractive learning context. Finally, we note that although we have focused on rather smallexamples so far, it seems that the inference techniques should scale well to larger problems

0 10 20 30 40 50 60−150

−100

−50

0

50

100

Time (ms)

Acc

eler

atio

n (g

)

Training DataAiMoGPESingleGP

0 10 20 30 40 50 60−150

−100

−50

0

50

100

Time (ms)

Acc

eler

atio

n (g

)

(A) (B)

Figure 5: (A) Motorcycle impact data together with the median of our model’s point-wisepredictive distribution and the predictive mean of a stationary covariance GP model. (B)The small dots are samples from our model (160 samples per location) evaluated at 80equally spaced locations across the range (but plotted with a small amount of jitter to aidvisualisation). The solid lines show the ± 2 SD interval from a regular GP.

and more practical tasks.

AcknowledgmentsThanks to Ben Marlin for sharing slice sampling code and to Carl Rasmussen for makingminimize.m available.

References[1] C.E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In Advances

in Neural Information Processing Systems 14, pages 881–888. MIT Press, 2002.[2] V. Tresp. Mixture of Gaussian processes. In Advances in Neural Information Processing Sys-

tems, volume 13. MIT Press, 2001.[3] Z. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an EM ap-

proach. In Advances in Neural Information Processing Systems 6, pages 120–127. Morgan-Kaufmann, 1995.

[4] L. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of experts. In Advancesin Neural Information Processing Systems 7, pages 633–640. MIT Press, 1995.

[5] C. E. Rasmussen. The infinite Gaussian mixture model. In Advances in Neural InformationProcessing Systems, volume 12, pages 554–560. MIT Press, 2000.

[6] R.A. Jacobs, M.I. Jordan, and G.E. Hinton. Adaptive mixture of local experts. Neural Compu-tation, 3, 1991.

[7] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman andHall, 2nd edition, 2004.

[8] D. Blackwell and J. B. MacQueen. Ferguson distributions via Polya urn schemes. The Annalsof Statistics, 1(2):353–355, 1973.

[9] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal ofComputational and Graphical Statistics, 9:249–265, 2000.

[10] R. M. Neal. Probabilistic inference using Markov chain Monte Carlo methods. Technical ReportCRG-TR-93-1, University of Toronto, 1993.

[11] R. M. Neal. Slice sampling (with discussion). Annals of Statistics, 31:705–767, 2003.[12] M. Escobar and M. West. Computing Bayesian nonparametric hierarchical models. In Prac-

tical Nonparametric and Semiparametric Bayesian Statistics, number 133 in Lecture Notes inStatistics. Springer-Verlag, 1998.

[13] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regressioncurve fitting. J. Royal Stayt Society. Ser. B, 47:1–52, 1985.