stochastic optimal control theory with applications ...bertk/acns/summerschool_nijmegen2013.pdf ·...

Stochastic optimal control theory with applicationsinneuroscience

Bert KappenSNN Donders Institute, Radboud University, Nijmegen

Gatsby Unit, UCL London

August 26, 2013

Bert Kappen

How to control a device?

Plant is unknown

Exploration of state space

Motor babbling in infants

Problem for brains and for robots

Bert Kappen Nijmegen Summerschool 1/43

How to find your way home?

How to navigate to previously visited locations?


Intractability due to uncertainty

Noise affects optimal control in quantitatively.

Optimal control computation is only tractable for simple cases:- deterministic problems using PMP approach

- LQ problems


The big idea

Linear Bellman equation and path integralExpress a control computation as an inference computation


The big idea

Linear Bellman equation and path integralExpress a control computation as an inference computation

Approximate inferenceIntractable inference problems can be made efficient using statistical physicsmethods


Outline

• Link between control theory, inference and statistical physics

– Hopf ’50, Fleming Mitter ’82, Kappen ’05


Outline



• How to control a device?

– Motor babbling as importance sampling


Outline



• How to control a device?

– Motor babbling as importance sampling

• How to find your way home?

– KL control theory– Efficient alternative for RL– Model of hippocampus– Computation by simulation


Discrete time optimal control

Consider the control of a discrete time deterministic dynamical system:

xt+1 = xt + f(xt, ut), t = 0, 1, . . . , T − 1

xt describes the state and ut specifies the control or action at time t.

Given x0 and u0:T−1, we can compute x1:T .

Define a cost for each sequence of controls:

C(x0, u0:T−1) =

T−1∑t=0

R(xt, ut)

Find the sequence u0:T−1 that minimizes C(x0, u0:T−1).


Dynamic programming

Find the minimal cost path from A to J.

C(J) = 0, C(H) = 3, C(I) = 4

C(F ) = min(6 + C(H), 3 + C(I)) = 7

Minimal cost at time t easily expressable in terms of minimal cost at time t+ 1.


Discrete time optimal control

Dynamic programming uses concept of optimal cost-to-go J(t, x).

One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:

J(t, xt) = minut:T−1

(T−1∑s=t

R(xs, us)

)= min

ut(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))

J(T, x) = 0

J(0, x) = minu0:T−1

C(x, u0:T−1)

This is called the Bellman Equation.

Computes ut(x) for all intermediate t, x.


Stochastic optimal control

Consider a stochastic dynamical system

dxi = fi(x, u)dt+ dξi 〈dξidξj〉 = νijdt

Given x(0) find control sequence u(0 → T ) that minimizes the expected futurecost

C =

〈φ(x(T )) +

∫ T0

dtR(x(t), u(t))

〉

Expectation is over all trajectories given the control path.

J(t, x) = minu

(R(x, u) + 〈J(t+ dt, x+ dx)〉)

−∂tJ(t, x) = minu

(R(x, u) + f(x, u)∇xJ(x, t) +

1

2ν∇2xJ(x, t)

)with boundary condition J(x, T ) = φ(x). This is HJB equation.


Path integral control theory

dx = f(x, t)dt+ g(x, t)(udt+ dξ)

C =

〈φ(x(T )) +

∫ Tt

dsV (x(s), s) +1

2uTRu

〉

with 〈dξadξb〉 = νabdt and R = λν−1, λ > 0.

The HJB equation becomes

−∂tJ = minu

(1

2uTRu+ V + (f + gu)T (∇J) + 1

2Tr(gνgT∇2J

))with boundary condition J(x, T ) = φ(x).


Path integral control theory

Minimization wrt u yields non-linear HJB:

u = −R−1gT∇J

−∂tJ = −1

2(∇J)TgR−1gT (∇J) + V + fT∇J + 1

2Tr(gνgT∇2J

)Define ψ(x, t) through J(x, t) = −λ logψ(x, t). We obtain a linear HJB:

∂tψ =

(V

λ− fT∇− 1

2Tr(gνgT∇2

))ψ


Feynman-Kac formula

Denote Q(τ |x, t) the distribution over uncontrolled trajectories that start at x, t:

dx = f(x, t)dt+ g(x, t)dξ

with τ a trajectory x(t→ T ). Then

ψ(x, t) =

∫dQ(τ |x, t) exp

(−S(τ)

λ

)= EQ

(e−S/λ

)S(τ) = φ(x(T )) +

∫ Tt

dsV (x(s), s)

ψ can be computed by forward sampling the uncontrolled process.


Posterior distribution over optimal trajectories

ψ(x, t) can be interpreted as a partition sum for the distribution over paths underoptimal control:

P (τ |x, t) = 1ψ(x, t)

Q(τ |x, t) exp(−S(τ)

λ

)

The optimal cost-to-go is a free energy:

J(x, t) = −λ logEQ(e−S/λ

)

The optimal control is an expectation wrt P :

u(x, t)dt = EP (dξ) =EQ(dξe−S/λ

)EQ(e−S/λ

)Bert Kappen Nijmegen Summerschool 16/43

Recap

Control problem:

dx = fdt+ g(udt+ dξ) C =

〈φ+

∫ Tt

V +1

2uTRu

〉R = λν−1

HJB is linear:

∂tψ = Hψ J = −λ logψ

Solution is given by Feynman-Kac formula: ψ = EQ(e−S/λ

).

Q distribution over uncontrolled dynamics (u = 0).

Optimal control is expectation value: udt =EQ

(dξe−S/λ

)EQ(e−S/λ)


Motor babbling: Estimate optimal control by importancesampling

Initialize û = 0.

Iterate:

• Generate samples from Q′(τ) using random control ûdt + dξ, ν = λR−1:

dx = fdt+ g(ûdt+ dξ)Plant

xt

ut

xt+dt

This can be computed using simulator without knowledge of f, g.

• Update the control

udt = ûdt+EQ′

(dξe−S

′/λ)

EQ′(e−S′/λ

)Converges to optimal stochastic control solution.


Acrobot

Joint angles q1, q2:

d11(q)q̈1 + d12(q)q̈2 + h1(q, q̇) + φ1(q) = 0

d21(q)q̈1 + d22q̈2 + h2(q, q̇) + φ2(q) = u

We can write these equations in standard form

dxi = fi(x)dt+ gi(x)udt

with x1 = q1, x2 = q2, x3 = q̇1, x4 = q̇2


Acrobot

0 20 40 60 80 100−4

−2

0

2

4

0 20 40 60 80 1000

2

4

6

8

ss

0 20 40 60 80 100−150

−100

−50

0

J

Jphi

0 20 40 60 80 100−10

0

10

20

30

incre

ment

mean

std

100 iterations. At each iteration 50 stochastic trajectories were generated. Noise was lowered at

each iteration. Top left: final height for each stochastic trajectory for each iteration (red) and for

each deterministic solution (blue).


Acrobot

(movie92.mp4)

Result after 100 trials


Lavf52.81.0

movie92_0.mp4Media File (video/mp4)

Darmstadt simulator: Beer pong

(beer pong video) (beer pong video)

Left) PID controller provides trajectory based solution for one particular target location. Right) We

demonstrate that PI feed-back control can adapt to changing target location and/or noise.


x264

out-3.mp4Media File (video/mp4)

x264

out-2.mp4Media File (video/mp4)

Application in robotics

(ICREA2011.mp4)

(Theodorou et al. 2010)


Lavf52.81.0

ICRA2011-1_0.mp4Media File (video/mp4)

KL control theory

x denotes state of the agent and x1:T is a path through state space from timet = 1 to T .

q(x1:T |x0) denotes a probability distribution over possible future trajectories giventhat the agent at time t = 0 is is state x0, with

q(x1:T |x0) =T∏t=0

q(xt+1|xt)

q(xt+1|xt) implements the allowed moves.

V (x1:T ) =∑Tt=1 V (xt) is the total cost when following path x1:T .

The KL control problem is to find the probability distribution p(x1:T |x0) thatminimizes

C(p|x0) =∑x1:T

p(x1:T |x0)(

logp(x1:T |x0)q(x1:T |x0)

+ V (x1:T )

)= KL(p||q) + 〈V 〉p


KL control theory

p(x1:T |x0) and q(x1:T |x0) distributions over trajectories.

Given q, find p that minimizes

C(p|x0) = KL(p||q) + 〈V 〉p

The solution and the optimal control cost are

p(x1:T |x0) =1

Z(x0)q(x1:T |x0) exp (−V (x1:T ))

C = − logZ(x0)

Z(x0) =∑x1:T

q(x1:T |x0) exp (−V (x1:T ))

NB: Z(x0) is an integral over paths.


KL control theory

The optimal control at time t = 0 is given by

p(x1|x0) =∑x2:T

p(x1:T |x0) ∝ q(x1|x0) exp(−V (x1))β1(x1)

with βt(x) the backward messages.

xxx

....

x0 T−2 T−1 T

βT (xT ) = 1

βt−1(xt−1) =∑xt

q(xt|xt−1) exp(−V (xt))βt(xt)


Link to continuous path integral formulation

The previous continuous path integral control can be obtained as a special case ofthe KL control formulation.

dx = f(x, t)dt+ g(x, t)(udt+ dξ)〈dξ2〉

= νdt

p(xt+dt|xt, ut) = N (xt+dt|xt + f(xt, t)dt+ g(x, t)utdt,Ξ(x, t))q(xt+dt|xt) = N (xt+dt|xt + f(x, t)dt,Ξ(x, t))

C(p|x0) = KL(p|q) + 〈V 〉 =∑xdt:T

p(xdt:T |x0)

(T∑t=dt

1

2uTt ν

−1ut + V (xt)

)


Average cost KL control

When T →∞ and q ergodic the backward message recursion

βt−1(xt−1) =∑xt

H(xt−1, xt)βt(xt) H(x, y) = q(y|x) exp(−V (y))

becomes the computation of the Perron-Frobenius eigen pair (β(·), λ):

Hβ = λβ H(x, y) = q(y|x) exp(−V (x))

The optimal control satisfies

p(y|x) = q(y|x) exp(−V (x)) β(y)λβ(x)

C(x0) = − log β(x0)− T log λ

Todorov 2006


KL-learning

Goal: find Perron-Frobenius solution Hβ = λβ, with H = [q(y|x) exp(−V (x))],while stepping through state space according to q and observing incurred cost.

Algorithm (KL-learning):

Initialize β0 random and λ0 =∑x β0(x). Initialize x0 random.

For t = 1 . . . do

xt ∼ q(·|xt−1)βt(xt−1) = βt−1(xt−1) + η∆

λt = λt−1 + η∆

∆ =exp(−V (xt))βt−1(xt)

λt−1− βt−1(xt−1)

Generalization of z-learning (Todorov) to λ 6= 1

Bierkens, Kappen 2012


Planning of goal directed behaviour

Effective navigation requires planning to goal locations that have been previouslyvisited.

The hippocampus has long been associated with navigation. Hippocampal placecells fire selectively when an animal occupies a restricted location in an environment.

4 well-trained rats performing a spatial memory

task in a 2 × 2 meter open area. Record upto 250 hippocampal place cells. Two phases:

forage to obtain reward in an unknown location;

obtain reward in a predictable reward location.

Neural activity during many candidate events

revealed temporally compressed, two-dimensional

trajectories across the environment.

Pfeiffer and Foster 2013


Observations and assumptions

• the place cell activity moves from current location to goal location, sequentiallyactivating intermediate place cells.

• can be understood as a type of gradient flow in a potential field

• the potential field is shaped around the food locations, which change each day


Thinking rats

Pfeiffer and Foster 2013


Finite state model

Two dimensional grid of hippocampus place cells as a finite state model.

Each state x corresponds to one place cell firing and all other place cells silent.

We assume a grid world with one-to-one pre-learned correspondence between theplace cells and the grid locations.

Four food locations

5 10 15

5

10

15


5 10 15

5

10

15

5 10 15

5

10

15

10

20

30

5 10 15

5

10

15 0

10

20

30

5 10 15

5

10

150

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15 0

20

40

5 10 15

5

10

15 0

10

20

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

5 10 15

5

10

15

20

40

60

0

10

20


Attractor dynamics

0 2000 4000 6000 8000 10000−1

0

1

2

3

t

min

ima

l d

ista

nce

to

ta

rge

t

0 2000 4000 6000 8000 100000

10

20

30

40

pa

th le

ng

th t

o m

inim

al d

ista

nce

t

Quality of the controlled dynamics as a function of the learning steps t. Left: average minimal

distance of trajectories starting at (8,8) and of length 50 to one of the food locations. Right:

Average coresponding path length.


Changing locations

run file7 movie.m


Discussion

KL control as a simple alternative for RL:- only single eigenvalue computation

- actor-critic or policy iteration requires multiple policy evaluations

- Q learning requires state and action representation


Discussion




KL Learning- model-free learning

- model-based thinking


Discussion




KL Learning- model-free learning


Accellerations:- learn a representation of uncontrolled dynamics while exploring

- update β in parallel for all states, not only the state that is visited


Discussion




KL learning:- model-free learning


Accellerations:- learn a representation of uncontrolled dynamics while exploring

- update β in parallel for all states, not only the state that is visited

Neural issues:- neural ’blob’ (Amari, Kohonen) for place cell activity

- topological map learning for place fields (Kohonen)

- β(x) (and λ) as extra layer of neurons or thresholds


Conclusion

Path integral control problems are inference problems- decision making by sampling


Conclusion


- decisions are bifurcations that occur at phase transitions


Conclusion


- phase transitions

- efficient computational methods

0 0.2 0.4 0.6 0.8 1 −1

0

1

2

3

4

5

Noise

Co

st

Diffe

ren

ce

10 15 20 25 3010

−1

100

101

102

103

Agents

CP

U T

ime


Conclusion


- phase transitions

- efficient computational methods

0 0.2 0.4 0.6 0.8 1 −1

0

1

2

3

4

5

Noise

Co

st

Diffe

ren

ce

10 15 20 25 3010

−1

100

101

102

103

Agents

CP

U T

ime

Theory for sensori-motor integration- learning (motor babbling) approach for robotics

- hippocampal model for learning goal directed behavior

www.snn.ru.nl/~bertk

www.snn.ru.nl/~bertk

stochastic optimal control theory with applications ...bertk/acns/summerschool_nijmegen2013.pdf ·...

Documents