expectation propagation in dynamical systems · 8/10/2012 · linear systems: kalman lter/smoother...

Expectation Propagation in Dynamical Systems

Marc Peter Deisenroth

Joint Work with Shakir Mohamed (UBC)

August 10, 2012

Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1

Motivation

Figure : Complex time series: motion capture, GDP, climate

Time series in economics, robotics, motion capture, etc. haveunknown dynamical structure, are high-dimensional and noisy

Flexible and accurate modelsNonlinear (Gaussian process) dynamical systems (GPDS)

Accurate inference in (GP)DS important forBetter knowledge about latent structuresParameter learning


Outline

1 Inference in Time Series ModelsFiltering and SmoothingExpectation PropagationApproximating the Partition FunctionRelation to Smoothing

2 EP in Gaussian Process Dynamical SystemsGaussian ProcessesFiltering/Smoothing in GPDSExpectation Propagation in GPDS

3 Results


Inference in Time Series Models Filtering and Smoothing

Time Series Models

xt−1 xt xt+1

zt−1 zt zt+1

xt = f(xt−1) + w , w ∼ N(0, Q

)zt = g(xt) + v , v ∼ N

(0, R

)Latent state x ∈ RD

Measurement/observation z ∈ RE

Transition function f

Measurement function g



Inference in Time Series Models

xt−1 xt xt+1

zt−1 zt zt+1

Objective: Posterior distribution over latent variables xtFiltering (Forward Inference)Compute p(xt|z1:t) for t = 1, . . . , TSmoothing (Forward-Backward Inference)Compute p(xt|z1:t) for t = 1, . . . , T (forward sweep)Compute p(xt|z1:T ) for t = T, . . . , 1 (backward sweep)

Examples:

Linear systems: Kalman filter/smoother (Kalman, 1959)Nonlinear systems: Approximate inference

Extended Kalman Filter/Smoother (Kalman, 1959–1961)Unscented Kalman Filter/Smoother (Julier & Uhlmann, 1997)



Machine Learning Perspective

xt−1 xt xt+1

zt−1 zt zt+1

Treat filtering/smoothing as an inference problem in graphicalmodels with hidden variables

Allows for efficient local message passing distributed

Messages are unnormalized probability distributions

Iterative refinement of the posterior marginals p(xt), t = 1, . . . , TMultiple forward-backward sweeps until global consistency

(convergence)

Here: Expectation Propagation (Minka 2001)


Inference in Time Series Models Expectation Propagation

Expectation Propagation

xt−1 xt xt+1

zt−1 zt zt+1

xt xt+1

p(xt+1|xt)

p(zt|xt) p(zt+1|xt+1)

Inference in factor graphs

p(xt) =∏n

i=1 ti(xt)

q(xt) =∏n

i=1 ti(xt)

Approximate factors ti are members of the Exponential Family(e.g., Multinomial, Gamma, Gaussian)

Find good a good approximation such that q ≈ p




Figure : Moment matching vs. mode matching. Borrowed from Bishop (2006)

EP locally minimizes KL(p||q), where p is the true distribution and qis an approximation (from Exponential Family) to it.

EP = moment matching (unlike Variational Bayes [“modematching”], which minimizes KL(q||p))

EP exploits properties of the Exponential Family: Compute momentsof distributions via derivatives of the log-partition function




qB(xt) xt

qM(xt)

xt+1qC(xt+1)

qM(xt+1)

p(xt+1|xt)

p(zt|xt) p(zt+1|xt+1)

qB(xt)xt

qM(xt)

xt+1

qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)

Figure : Factor graph (left) and fully factored factor graph (right).

Write down the (fully factored) factor graph

p(xt) =∏n

i=1 ti(xt)

q(xt) =∏n

i=1 ti(xt)

Find approximate ti, such that KL(p||q) is minimized.

Multiple sweeps through graph until global consistency of themessages is assured



Messages in a Dynamical System

qB(xt)xt

qM(xt)

xt+1

qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)

Approximate (factored) marginal: q(xt) =∏

i ti(xt)

Here, our messages ti have names:

Measurement message qMForward message qBBackward message qC

Define cavity distribution: q\i(xt) = q(xt)/ti(xt) =∏

k 6=i tk(xt)



Gaussian EP in More Detail

qB(xt)xt

qM(xt)

xt+1

qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)

1 Write down the factor graph

2 Initialize all messages ti, i = M,B,CUntil convergence:

3 For all latent variables xt and corresponding messages ti(xt) do

1 Compute the cavity distribution q\i(xt) = N(xt |µ\i

t , Σ\it

)by

Gaussian division.2 Compute the moments of ti(xt)q

\i(xt)Updated moments of q(xt)

3 Compute updated message

ti(xt) = q(xt)/q\i(xt)




qB(xt)xt

qM(xt)

xt+1

qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)





t , Σ\it

)by

Gaussian division.

2 Compute the moments of ti(xt)q\i(xt)

Updated moments of q(xt)3 Compute updated message





qB(xt)xt

qM(xt)

xt+1

qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)





t , Σ\it

)by

Gaussian division.2 Compute the moments of ti(xt)q

\i(xt)Updated moments of q(xt)

3 Compute updated message




Updating the Measurement Message

qB(xt) xt

qM(xt)

qC(xt)

Measurement message

qM(xt) =proj[

true factor︷︸︸︷tM(xt)

cavity distr.︷︸︸︷q\M(xt) ]

q\M(xt)

The proj[.] operator projects onto Exponential Family distributionsImplemented by taking derivatives of the log partition

function logZM, where

ZM =

∫tM(xt)q

\M(xt)dxt , tM(xt) = p(zt|xt)



Updating in Context: Forward Message

qB(xt) xt

qM(xt)

xt+1qC(xt+1)

qM(xt+1)

p(xt+1|xt) qB(xt) xt

qM(xt)

xt+1qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)

Forward message Need to take the coupling between xt and xt+1

into account (lost when writing down the fully factored factor graph).

Key insight: Want a close approximation

qC(xt+1)qM(xt+1)︸︷︷︸context q\B(xt+1)

qB(xt+1) ≈ q\B(xt+1)

∫p(xt+1|xt)qB(xt)qM(xt)dxt

Achieve this by projection

qB(xt+1) =proj[

cavity distr.︷︸︸︷q\B(xt+1)

true factor︷︸︸︷tB(xt+1)]

q\B(xt+1),

tB(xt+1) =




Updating in Context: Forward Message

qB(xt) xt

qM(xt)

xt+1qC(xt+1)

qM(xt+1)

p(xt+1|xt) qB(xt) xt

qM(xt)

xt+1qC(xt+1)

qM(xt+1)

qB(xt+1)qC(xt)

Forward message Need to take the coupling between xt and xt+1

into account (lost when writing down the fully factored factor graph).Key insight: Want a close approximation

qC(xt+1)qM(xt+1)︸︷︷︸context q\B(xt+1)

qB(xt+1) ≈ q\B(xt+1)


Achieve this by projection

qB(xt+1) =proj[

cavity distr.︷︸︸︷q\B(xt+1)

true factor︷︸︸︷tB(xt+1)]

q\B(xt+1),

tB(xt+1) =



Inference in Time Series Models Approximating the Partition Function

Key Points and Challenge

EP is based on matching the moments of ti(xt)q\i(xt)

Computing the partition function

Zi(µ\it ,Σ

\it ) =

∫ti(xt)q

\i(xt)dxt

and its derivatives with respect to µ\it and Σ

\it are sufficient for EP

Properties of the Exponential Family

Tricky part: Integral not solvable for nonlinear systems withcontinuous variables



Approach

Interpretation of partition function Zi as a probability distribution.Example: Measurement message

ZM =

∫tM(x)q

\M(x)dx =

∫p(z|x)q\M(x)dx

= p(z)

Idea: Approximate p(z) by a (Gaussian) distribution ZM

Take the derivatives of log ZM with respect to the moments of thecavity distribution

Get updated moments for the posterior and the messagesFixes the intractability problems, but we are no longer exact



Possible Gaussian Approximations

Example: Measurement message

ZM =

∫tM(x)q

\M(x)dx =

∫tM(x)N

(x |µ\M, Σ\M

)dx

tM(x) = N(z | g(x), S

)

Linearize g at µ\M integral tractable

Gaussian moment matching: compute mean and variance of ZMapproximate ZM by a Gaussian with the correct mean/variance



Possible Gaussian Approximations

Example: Measurement message

ZM =

∫tM(x)q

\M(x)dx =

∫tM(x)N

(x |µ\M, Σ\M

)dx

tM(x) = N(z | g(x), S

)Linearize g at µ\M integral tractable

Gaussian moment matching: compute mean and variance of ZMapproximate ZM by a Gaussian with the correct mean/variance


Inference in Time Series Models Relation to Smoothing

Theoretical Results

ZM =

∫tM(x)q

\M(x)dx =

∫tM(x)N

(x |µ\M, Σ\M

)dx

tM(x) = N(z | g(x), S

)Relation to Common Filters/Smoothers

Approximating ZM by a Gaussian ZM is equivalent to approximatingp(x, z) by a Gaussian—an approximation that is common to almost allfiltering algorithmsa

aDeisenroth & Ohlsson (ACC 2011)

Generalizing Common Smoothers

Linearizing g(x) in ZM generalizes the EKS to an iterative procedure

Moment matching generalizes the ADS to an iterative procedure



Interesting Side Effects

To minimize the KL divergence, EP updates require the derivatives

∂ logZM∂µ\M

,∂ logZM

∂Σ\M

The Gaussian approximation of ZM = p(z) ≈ N(µz, Σz

)is exact if

and only if there is a linear relationship between x and z, i.e.,

z = Jx , x ∼ N(µ\M, Σ\M

)for some J µz,Σz have a special form

Linearity must be explicitly encoded in the partial derivatives!

Example:

∂ logZM∂µ\M

=∂ logZM∂µz

∂µz

∂µ\M= (z− µz)

>Σ−1z J>

Even if µz is a general function of µ\M and Σ\M, this must beignored. Otherwise: Inconsistent EP updates!1

1Deisenroth & Mohamed (arXiv preprint, 2012)Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 18


Illustration: Toy Tracking Problem

5 10 15 20

−4

−2

0

2

4

Time step

Sta

te

Ground truthEKS

5 10 15 20

−4

−2

0

2

4

Time step

Sta

te

Ground truthEP−EKS

Iteratively improving the posteriors via EP can heal the the EKS



Illustration: Toy Tracking Problem

5 10 15 20

−4

−2

0

2

4

Time step

Sta

te

Ground truthEKS

5 10 15 20

−4

−2

0

2

4

Time stepS

tate

Ground truthEP−EKS

Iteratively improving the posteriors via EP can heal the the EKS


EP in Gaussian Process Dynamical Systems

Gaussian Process Dynamical Systems

xt−1 xt xt+1

zt−1 zt zt+1

xt = f(xt−1) + w , w ∼ N(0, Q

)zt = g(xt) + v , v ∼ N

(0, R

)State x (not observed)

Measurement/observation z

GP distribution p(f) over transition function f

GP distribution p(g) over measurement function g


EP in Gaussian Process Dynamical Systems Gaussian Processes

Gaussian Processes for Flexible Modeling

Non-parametric method flexible, i.e., shape of functionadapts to dataProbabilistic method consistently describes uncertaintiesabout the unknown functionSufficient: specification of high-level assumptions (e.g.,smoothness)Automatic trade-off between data-fit and complexity of thefunction (Occam’s razor)

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10

−2

0

2

(xt−1

, ut−1

)

xt



Gaussian Process Regression

Mathematically: Probability distribution over functions

Bayesian inference tractable:1 Specify high-level prior beliefs p(f) about the function (e.g.,

smoothness)2 Observe data X,y = f(X) + ε3 Compute posterior distribution p(f |X,y) over functions

Bayes’ theorem:

p(f |X,y) =p(y|X, f)p(f)

p(y|X)

p(f): Prior (over functions)p(y|X, f): Likelihood (noise model)p(f |X,y): Posterior (over functions)



Pictorial Introduction to Gaussian Processes

−5 0 5−3

−2

−1

0

1

2

3

x

f(x)

Prior belief about the function.




−5 0 5−3

−2

−1

0

1

2

3

x

f(x)

Observe some function values.




−5 0 5−3

−2

−1

0

1

2

3

x

f(x)

Posterior belief about the function.


EP in Gaussian Process Dynamical Systems Filtering/Smoothing in GPDS

Gaussian Process Dynamical Systems

xt−1 xt xt+1

zt−1 zt zt+1

xt = f(xt−1) + w , w ∼ N(0, Q

)zt = g(xt) + v , v ∼ N

(0, R

)GP distribution p(f) over transition function f

GP distribution p(g) over measurement function g

Let’s talk about inference in GPDSs



Inference in GPDS

−1 −0.5 0 0.5 1

∆t

−1 −0.5 0 0.5 10

1

(xt−1

, ut−1

)

p(x

t−1,

ut−

1)

0 1 2 3

∆t

p(∆t)

Objective: Gaussian approximations to the joints p(xt, zt|z1:t−1) andp(xt−1,xt|z1:t−1) sufficient for Gaussian filtering/smoothing2

Mapping distributions through a GP requires approximations, e.g.,Linearization of the posterior GP mean function (red)Moment matching (blue)

Filtering/smoothing in GPDS3: GP-EKS, GP-ADS, GP-CKS, ...

2Deisenroth & Ohlsson (ACC 2011)3Deisenroth et al. (ICML 2009), Deisenroth et al. (IEEE-TAC, 2012)



Inference in GPDS

−1 −0.5 0 0.5 1

∆t

−1 −0.5 0 0.5 10

1

(xt−1

, ut−1

)

p(x

t−1,

ut−

1)

0 1 2 3

∆t

p(∆t)

Objective: Gaussian approximations to the joints p(xt, zt|z1:t−1) andp(xt−1,xt|z1:t−1) sufficient for Gaussian filtering/smoothing2

Mapping distributions through a GP requires approximations, e.g.,Linearization of the posterior GP mean function (red)Moment matching (blue)

Filtering/smoothing in GPDS3: GP-EKS, GP-ADS, GP-CKS, ...2Deisenroth & Ohlsson (ACC 2011)3Deisenroth et al. (ICML 2009), Deisenroth et al. (IEEE-TAC, 2012)


EP in Gaussian Process Dynamical Systems Expectation Propagation in GPDS

EP in GPDS

Generalize single-sweep forward-backward smoothing in GPDSs to aniterative procedure using EP

Slightly more involved than EP in nonlinear systems (e.g., EP-EKS)Also have to average over function distribution (GP)

Key idea the same as before:Approximate the partition function by a Gaussian distribution 4

Linearization of the posterior mean function (e.g., Ko & Fox, 2009)EP-GPEKS

Moment matching (e.g., Quinonero-Candela et al., 2003)EP-GPADS

4Deisenroth & Mohamed (arXiv preprint, 2012)Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 26

Results

Results: Synthetic Data (1)

−5 0 5

−4

−2

0

2

4

x

f(x)

Ground truth

Training data

GP

Figure : GP model with training set and ground truth

xt+1 = 4 sin(4xt) + w , w ∼ N(0, 0.12

)zt = 4 sin(4xt) + v , v ∼ N

(0, 0.12

)Initial state distribution p(x1) = N

(0, 1

)very broad

30 training points for GP models, randomly selected

Tracking horizon: 20 time steps


Results

Results: Synthetic Data (2)

0 5 10 15

−4

−2

0

2

4

6

Time step

sta

te

True statePosterior state distribution (EP−GPADS)Posterior state distribution (GPADS)

(a) Posterior trajectories with confidencebounds.

5 10 15 20 25 30

−2

−1

0

1

2

EP iteration

Avera

ge N

LL p

er

data

poin

t

EP−GPADSGPADS

(b) Average NLL as a function of the EPiteration with standard error.

After convergence, the posterior is spot on (left)

Iterating EP greatly improves predictive power (right)


Results

Results: Pendulum Tracking

PendulumMethod NLLx MAEx LPUx

GPEKS −0.29± 0.30 0.30± 0.02 −2.76± 0.12EP-GPEKS −0.24± 0.33 0.31± 0.02 −2.77± 0.12GPADS −0.75± 0.06 0.29± 0.02 −2.52± 0.06EP-GPADS −0.79± 0.06 0.29± 0.02 −2.58± 0.04

NLL: negative log likelihood predictive performanceMAE: mean absolute error error of the posterior meanLPU: log posterior uncertainty tightness of the posterior

Linearization-based inference: Variances too smallEP makes things worse

Moment-matching based inference: Coherent estimatesEP improves posterior


Results

Results: Motion Capture Data

10 trials of golf swings recorded at 40 Hz (mocap.cs.cmu.edu)

Observations z ∈ R56

Latent space x ∈ R3

7 training sequences, 3 test sequences

GPDS learning via GPDM approach (Wang et al., 2008)


mocap.cs.cmu.edu

Results

Results: Motion Capture Data


Results

Summary

General framework for iterative inference in dynamical systems

Key: Approximation of the partition function

Rederive classical filters/smoothers as a special case

Promising results in (GP)DS

[email protected]

http://www.ias.tu-darmstadt.de/Team/MarcDeisenroth


[email protected]

http://www.ias.tu-darmstadt.de/Team/MarcDeisenroth

Results

References

[1] C. M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer-Verlag, 2006.

[2] M. P. Deisenroth, M. F. Huber, and U. D. Hanebeck. Analytic Moment-based Gaussian Process Filtering. In L. Bouttouand M. L. Littman, editors, Proceedings of the 26th International Conference on Machine Learning, pages 225–232,Montreal, QC, Canada, June 2009. Omnipress.

[3] M. P. Deisenroth and S. Mohamed. Expectation Propagation in Gaussian Process Dynamical Systems, July 2012.http://arxiv.org/abs/1207.2940.

[4] M. P. Deisenroth and H. Ohlsson. A General Perspective on Gaussian Filtering and Smoothing: Explaining Current andDeriving New Algorithms. In Proceedings of the American Control Conference, 2011.

[5] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing withGaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012. doi:10.1109/TAC.2011.2179426.

[6] S. J. Julier and J. K. Uhlmann. A New Extension of the Kalman Filter to Nonlinear Systems. In Proceedings ofAeroSense: 11th Symposium on Aerospace/Defense Sensing, Simulation and Controls, pages 182–193, 1997.

[7] R. E. Kalman. A New Approach to Linear Filtering and Prediction Problems. Transactions of the ASME — Journal ofBasic Engineering, 82(Series D):35–45, 1960.

[8] J. Ko and D. Fox. GP-BayesFilters: Bayesian Filtering using Gaussian Process Prediction and Observation Models.Autonomous Robots, 27(1):75–90, July 2009.

[9] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachusetts Institute ofTechnology, Cambridge, MA, USA, January 2001.

[10] J. Quinonero-Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of Uncertainty in Bayesian KernelModels—Application to Multiple-Step Ahead Forecasting. In IEEE International Conference on Acoustics, Speech andSignal Processing, volume 2, pages 701–704, April 2003.

[11] J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian Process Dynamical Models for Human Motion. IEEE Transactionson Pattern Analysis and Machine Intelligence, 30(2):283–298, 2008.


http://arxiv.org/abs/1207.2940

expectation propagation in dynamical systems · 8/10/2012 · linear systems: kalman lter/smoother...

Documents