stochastic optimal control theory with applications ...bertk/acns/summerschool_nijmegen2013.pdf ·...
TRANSCRIPT
-
Stochastic optimal control theory with applicationsinneuroscience
Bert KappenSNN Donders Institute, Radboud University, Nijmegen
Gatsby Unit, UCL London
August 26, 2013
Bert Kappen
-
How to control a device?
Plant is unknown
Exploration of state space
Motor babbling in infants
Problem for brains and for robots
Bert Kappen Nijmegen Summerschool 1/43
-
How to find your way home?
How to navigate to previously visited locations?
Bert Kappen Nijmegen Summerschool 2/43
-
Intractability due to uncertainty
Noise affects optimal control in quantitatively.
Optimal control computation is only tractable for simple cases:- deterministic problems using PMP approach
- LQ problems
Bert Kappen Nijmegen Summerschool 3/43
-
The big idea
Linear Bellman equation and path integralExpress a control computation as an inference computation
Bert Kappen Nijmegen Summerschool 4/43
-
The big idea
Linear Bellman equation and path integralExpress a control computation as an inference computation
Approximate inferenceIntractable inference problems can be made efficient using statistical physicsmethods
Bert Kappen Nijmegen Summerschool 5/43
-
Outline
• Link between control theory, inference and statistical physics
– Hopf ’50, Fleming Mitter ’82, Kappen ’05
Bert Kappen Nijmegen Summerschool 6/43
-
Outline
• Link between control theory, inference and statistical physics
– Hopf ’50, Fleming Mitter ’82, Kappen ’05
• How to control a device?
– Motor babbling as importance sampling
Bert Kappen Nijmegen Summerschool 7/43
-
Outline
• Link between control theory, inference and statistical physics
– Hopf ’50, Fleming Mitter ’82, Kappen ’05
• How to control a device?
– Motor babbling as importance sampling
• How to find your way home?
– KL control theory– Efficient alternative for RL– Model of hippocampus– Computation by simulation
Bert Kappen Nijmegen Summerschool 8/43
-
Discrete time optimal control
Consider the control of a discrete time deterministic dynamical system:
xt+1 = xt + f(xt, ut), t = 0, 1, . . . , T − 1
xt describes the state and ut specifies the control or action at time t.
Given x0 and u0:T−1, we can compute x1:T .
Define a cost for each sequence of controls:
C(x0, u0:T−1) =
T−1∑t=0
R(xt, ut)
Find the sequence u0:T−1 that minimizes C(x0, u0:T−1).
Bert Kappen Nijmegen Summerschool 9/43
-
Dynamic programming
Find the minimal cost path from A to J.
C(J) = 0, C(H) = 3, C(I) = 4
C(F ) = min(6 + C(H), 3 + C(I)) = 7
Minimal cost at time t easily expressable in terms of minimal cost at time t+ 1.
Bert Kappen Nijmegen Summerschool 10/43
-
Discrete time optimal control
Dynamic programming uses concept of optimal cost-to-go J(t, x).
One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:
J(t, xt) = minut:T−1
(T−1∑s=t
R(xs, us)
)= min
ut(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))
J(T, x) = 0
J(0, x) = minu0:T−1
C(x, u0:T−1)
This is called the Bellman Equation.
Computes ut(x) for all intermediate t, x.
Bert Kappen Nijmegen Summerschool 11/43
-
Stochastic optimal control
Consider a stochastic dynamical system
dxi = fi(x, u)dt+ dξi 〈dξidξj〉 = νijdt
Given x(0) find control sequence u(0 → T ) that minimizes the expected futurecost
C =
〈φ(x(T )) +
∫ T0
dtR(x(t), u(t))
〉
Expectation is over all trajectories given the control path.
J(t, x) = minu
(R(x, u) + 〈J(t+ dt, x+ dx)〉)
−∂tJ(t, x) = minu
(R(x, u) + f(x, u)∇xJ(x, t) +
1
2ν∇2xJ(x, t)
)with boundary condition J(x, T ) = φ(x). This is HJB equation.
Bert Kappen Nijmegen Summerschool 12/43
-
Path integral control theory
dx = f(x, t)dt+ g(x, t)(udt+ dξ)
C =
〈φ(x(T )) +
∫ Tt
dsV (x(s), s) +1
2uTRu
〉
with 〈dξadξb〉 = νabdt and R = λν−1, λ > 0.
The HJB equation becomes
−∂tJ = minu
(1
2uTRu+ V + (f + gu)T (∇J) + 1
2Tr(gνgT∇2J
))with boundary condition J(x, T ) = φ(x).
Bert Kappen Nijmegen Summerschool 13/43
-
Path integral control theory
Minimization wrt u yields non-linear HJB:
u = −R−1gT∇J
−∂tJ = −1
2(∇J)TgR−1gT (∇J) + V + fT∇J + 1
2Tr(gνgT∇2J
)Define ψ(x, t) through J(x, t) = −λ logψ(x, t). We obtain a linear HJB:
∂tψ =
(V
λ− fT∇− 1
2Tr(gνgT∇2
))ψ
Bert Kappen Nijmegen Summerschool 14/43
-
Feynman-Kac formula
Denote Q(τ |x, t) the distribution over uncontrolled trajectories that start at x, t:
dx = f(x, t)dt+ g(x, t)dξ
with τ a trajectory x(t→ T ). Then
ψ(x, t) =
∫dQ(τ |x, t) exp
(−S(τ)
λ
)= EQ
(e−S/λ
)S(τ) = φ(x(T )) +
∫ Tt
dsV (x(s), s)
ψ can be computed by forward sampling the uncontrolled process.
Bert Kappen Nijmegen Summerschool 15/43
-
Posterior distribution over optimal trajectories
ψ(x, t) can be interpreted as a partition sum for the distribution over paths underoptimal control:
P (τ |x, t) = 1ψ(x, t)
Q(τ |x, t) exp(−S(τ)
λ
)
The optimal cost-to-go is a free energy:
J(x, t) = −λ logEQ(e−S/λ
)
The optimal control is an expectation wrt P :
u(x, t)dt = EP (dξ) =EQ(dξe−S/λ
)EQ(e−S/λ
)Bert Kappen Nijmegen Summerschool 16/43
-
Recap
Control problem:
dx = fdt+ g(udt+ dξ) C =
〈φ+
∫ Tt
V +1
2uTRu
〉R = λν−1
HJB is linear:
∂tψ = Hψ J = −λ logψ
Solution is given by Feynman-Kac formula: ψ = EQ(e−S/λ
).
Q distribution over uncontrolled dynamics (u = 0).
Optimal control is expectation value: udt =EQ
(dξe−S/λ
)EQ(e−S/λ)
Bert Kappen Nijmegen Summerschool 17/43
-
Motor babbling: Estimate optimal control by importancesampling
Initialize û = 0.
Iterate:
• Generate samples from Q′(τ) using random control ûdt + dξ, ν = λR−1:
dx = fdt+ g(ûdt+ dξ)Plant
xt
ut
xt+dt
This can be computed using simulator without knowledge of f, g.
• Update the control
udt = ûdt+EQ′
(dξe−S
′/λ)
EQ′(e−S′/λ
)Converges to optimal stochastic control solution.
Bert Kappen Nijmegen Summerschool 18/43
-
Acrobot
Joint angles q1, q2:
d11(q)q̈1 + d12(q)q̈2 + h1(q, q̇) + φ1(q) = 0
d21(q)q̈1 + d22q̈2 + h2(q, q̇) + φ2(q) = u
We can write these equations in standard form
dxi = fi(x)dt+ gi(x)udt
with x1 = q1, x2 = q2, x3 = q̇1, x4 = q̇2
Bert Kappen Nijmegen Summerschool 19/43
-
Acrobot
0 20 40 60 80 100−4
−2
0
2
4
0 20 40 60 80 1000
2
4
6
8
ss
0 20 40 60 80 100−150
−100
−50
0
J
Jphi
0 20 40 60 80 100−10
0
10
20
30
incre
ment
mean
std
100 iterations. At each iteration 50 stochastic trajectories were generated. Noise was lowered at
each iteration. Top left: final height for each stochastic trajectory for each iteration (red) and for
each deterministic solution (blue).
Bert Kappen Nijmegen Summerschool 20/43
-
Acrobot
(movie92.mp4)
Result after 100 trials
Bert Kappen Nijmegen Summerschool 21/43
Lavf52.81.0
movie92_0.mp4Media File (video/mp4)
-
Darmstadt simulator: Beer pong
(beer pong video) (beer pong video)
Left) PID controller provides trajectory based solution for one particular target location. Right) We
demonstrate that PI feed-back control can adapt to changing target location and/or noise.
Bert Kappen Nijmegen Summerschool 22/43
x264
out-3.mp4Media File (video/mp4)
x264
out-2.mp4Media File (video/mp4)
-
Application in robotics
(ICREA2011.mp4)
(Theodorou et al. 2010)
Bert Kappen Nijmegen Summerschool 23/43
Lavf52.81.0
ICRA2011-1_0.mp4Media File (video/mp4)
-
KL control theory
x denotes state of the agent and x1:T is a path through state space from timet = 1 to T .
q(x1:T |x0) denotes a probability distribution over possible future trajectories giventhat the agent at time t = 0 is is state x0, with
q(x1:T |x0) =T∏t=0
q(xt+1|xt)
q(xt+1|xt) implements the allowed moves.
V (x1:T ) =∑Tt=1 V (xt) is the total cost when following path x1:T .
The KL control problem is to find the probability distribution p(x1:T |x0) thatminimizes
C(p|x0) =∑x1:T
p(x1:T |x0)(
logp(x1:T |x0)q(x1:T |x0)
+ V (x1:T )
)= KL(p||q) + 〈V 〉p
Bert Kappen Nijmegen Summerschool 24/43
-
KL control theory
p(x1:T |x0) and q(x1:T |x0) distributions over trajectories.
Given q, find p that minimizes
C(p|x0) = KL(p||q) + 〈V 〉p
The solution and the optimal control cost are
p(x1:T |x0) =1
Z(x0)q(x1:T |x0) exp (−V (x1:T ))
C = − logZ(x0)
Z(x0) =∑x1:T
q(x1:T |x0) exp (−V (x1:T ))
NB: Z(x0) is an integral over paths.
Bert Kappen Nijmegen Summerschool 25/43
-
KL control theory
The optimal control at time t = 0 is given by
p(x1|x0) =∑x2:T
p(x1:T |x0) ∝ q(x1|x0) exp(−V (x1))β1(x1)
with βt(x) the backward messages.
xxx
....
x0 T−2 T−1 T
βT (xT ) = 1
βt−1(xt−1) =∑xt
q(xt|xt−1) exp(−V (xt))βt(xt)
Bert Kappen Nijmegen Summerschool 26/43
-
Link to continuous path integral formulation
The previous continuous path integral control can be obtained as a special case ofthe KL control formulation.
dx = f(x, t)dt+ g(x, t)(udt+ dξ)〈dξ2〉
= νdt
p(xt+dt|xt, ut) = N (xt+dt|xt + f(xt, t)dt+ g(x, t)utdt,Ξ(x, t))q(xt+dt|xt) = N (xt+dt|xt + f(x, t)dt,Ξ(x, t))
C(p|x0) = KL(p|q) + 〈V 〉 =∑xdt:T
p(xdt:T |x0)
(T∑t=dt
1
2uTt ν
−1ut + V (xt)
)
Bert Kappen Nijmegen Summerschool 27/43
-
Average cost KL control
When T →∞ and q ergodic the backward message recursion
βt−1(xt−1) =∑xt
H(xt−1, xt)βt(xt) H(x, y) = q(y|x) exp(−V (y))
becomes the computation of the Perron-Frobenius eigen pair (β(·), λ):
Hβ = λβ H(x, y) = q(y|x) exp(−V (x))
The optimal control satisfies
p(y|x) = q(y|x) exp(−V (x)) β(y)λβ(x)
C(x0) = − log β(x0)− T log λ
Todorov 2006
Bert Kappen Nijmegen Summerschool 28/43
-
KL-learning
Goal: find Perron-Frobenius solution Hβ = λβ, with H = [q(y|x) exp(−V (x))],while stepping through state space according to q and observing incurred cost.
Algorithm (KL-learning):
Initialize β0 random and λ0 =∑x β0(x). Initialize x0 random.
For t = 1 . . . do
xt ∼ q(·|xt−1)βt(xt−1) = βt−1(xt−1) + η∆
λt = λt−1 + η∆
∆ =exp(−V (xt))βt−1(xt)
λt−1− βt−1(xt−1)
Generalization of z-learning (Todorov) to λ 6= 1
Bierkens, Kappen 2012
Bert Kappen Nijmegen Summerschool 29/43
-
Planning of goal directed behaviour
Effective navigation requires planning to goal locations that have been previouslyvisited.
The hippocampus has long been associated with navigation. Hippocampal placecells fire selectively when an animal occupies a restricted location in an environment.
4 well-trained rats performing a spatial memory
task in a 2 × 2 meter open area. Record upto 250 hippocampal place cells. Two phases:
forage to obtain reward in an unknown location;
obtain reward in a predictable reward location.
Neural activity during many candidate events
revealed temporally compressed, two-dimensional
trajectories across the environment.
Pfeiffer and Foster 2013
Bert Kappen Nijmegen Summerschool 30/43
-
Observations and assumptions
• the place cell activity moves from current location to goal location, sequentiallyactivating intermediate place cells.
• can be understood as a type of gradient flow in a potential field
• the potential field is shaped around the food locations, which change each day
Bert Kappen Nijmegen Summerschool 31/43
-
Thinking rats
Pfeiffer and Foster 2013
Bert Kappen Nijmegen Summerschool 32/43
-
Finite state model
Two dimensional grid of hippocampus place cells as a finite state model.
Each state x corresponds to one place cell firing and all other place cells silent.
We assume a grid world with one-to-one pre-learned correspondence between theplace cells and the grid locations.
Four food locations
5 10 15
5
10
15
Bert Kappen Nijmegen Summerschool 33/43
-
5 10 15
5
10
15
5 10 15
5
10
15
10
20
30
5 10 15
5
10
15 0
10
20
30
5 10 15
5
10
150
5
10
15
5 10 15
5
10
15
5 10 15
5
10
15 0
20
40
5 10 15
5
10
15 0
10
20
5 10 15
5
10
15
5 10 15
5
10
15
5 10 15
5
10
15
5 10 15
5
10
15
20
40
60
0
10
20
Bert Kappen Nijmegen Summerschool 34/43
-
Attractor dynamics
0 2000 4000 6000 8000 10000−1
0
1
2
3
t
min
ima
l d
ista
nce
to
ta
rge
t
0 2000 4000 6000 8000 100000
10
20
30
40
pa
th le
ng
th t
o m
inim
al d
ista
nce
t
Quality of the controlled dynamics as a function of the learning steps t. Left: average minimal
distance of trajectories starting at (8,8) and of length 50 to one of the food locations. Right:
Average coresponding path length.
Bert Kappen Nijmegen Summerschool 35/43
-
Changing locations
run file7 movie.m
Bert Kappen Nijmegen Summerschool 36/43
-
Discussion
KL control as a simple alternative for RL:- only single eigenvalue computation
- actor-critic or policy iteration requires multiple policy evaluations
- Q learning requires state and action representation
Bert Kappen Nijmegen Summerschool 37/43
-
Discussion
KL control as a simple alternative for RL:- only single eigenvalue computation
- actor-critic or policy iteration requires multiple policy evaluations
- Q learning requires state and action representation
KL Learning- model-free learning
- model-based thinking
Bert Kappen Nijmegen Summerschool 38/43
-
Discussion
KL control as a simple alternative for RL:- only single eigenvalue computation
- actor-critic or policy iteration requires multiple policy evaluations
- Q learning requires state and action representation
KL Learning- model-free learning
- model-based thinking
Accellerations:- learn a representation of uncontrolled dynamics while exploring
- update β in parallel for all states, not only the state that is visited
Bert Kappen Nijmegen Summerschool 39/43
-
Discussion
KL control as a simple alternative for RL:- only single eigenvalue computation
- actor-critic or policy iteration requires multiple policy evaluations
- Q learning requires state and action representation
KL learning:- model-free learning
- model-based thinking
Accellerations:- learn a representation of uncontrolled dynamics while exploring
- update β in parallel for all states, not only the state that is visited
Neural issues:- neural ’blob’ (Amari, Kohonen) for place cell activity
- topological map learning for place fields (Kohonen)
- β(x) (and λ) as extra layer of neurons or thresholds
Bert Kappen Nijmegen Summerschool 40/43
-
Conclusion
Path integral control problems are inference problems- decision making by sampling
Bert Kappen Nijmegen Summerschool 41/43
-
Conclusion
Path integral control problems are inference problems- decision making by sampling
- decisions are bifurcations that occur at phase transitions
Bert Kappen Nijmegen Summerschool 42/43
-
Conclusion
Path integral control problems are inference problems- decision making by sampling
- phase transitions
- efficient computational methods
0 0.2 0.4 0.6 0.8 1 −1
0
1
2
3
4
5
Noise
Co
st
Diffe
ren
ce
10 15 20 25 3010
−1
100
101
102
103
Agents
CP
U T
ime
Bert Kappen Nijmegen Summerschool 43/43
-
Conclusion
Path integral control problems are inference problems- decision making by sampling
- phase transitions
- efficient computational methods
0 0.2 0.4 0.6 0.8 1 −1
0
1
2
3
4
5
Noise
Co
st
Diffe
ren
ce
10 15 20 25 3010
−1
100
101
102
103
Agents
CP
U T
ime
Theory for sensori-motor integration- learning (motor babbling) approach for robotics
- hippocampal model for learning goal directed behavior
www.snn.ru.nl/~bertk
Bert Kappen Nijmegen Summerschool 44/43
www.snn.ru.nl/~bertk