katerina fragkiadaki - github pages · 2018. 1. 9. · katerina fragkiadaki carnegie mellon school...

56
iLQR Deep Reinforcement Learning and Control Katerina Fragkiadaki Carnegie Mellon School of Computer Science

Upload: others

Post on 29-Mar-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

iLQR

Deep Reinforcement Learning and Control

Katerina Fragkiadaki

Carnegie MellonSchool of Computer Science

Page 2: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Optimal Control (Open Loop)

s.t. x0 = x0

xt+1 = f(xt, ut) t = 0, ..., T � 1

minx,u

TX

t=0

c

t

(xt

, u

t

)

• The optimal control problem:

Page 3: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Optimal Control (Open Loop)

• Solution:

• Sequence of controls and resulting state sequence

• In general non-convex optimization problem, can be solved with sequential convex programming (SCP): https://stanford.edu/class/ee364b/lectures/seq_slides.pdf

xu

s.t. x0 = x0

xt+1 = f(xt, ut) t = 0, ..., T � 1

minx,u

TX

t=0

c

t

(xt

, u

t

)

• The optimal control problem:

Page 4: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Optimal Control (Closed Loop a.k.a. MPC)

Given:!!

For!t=0,!1,!2,!…,!T!

!  Solve!

!  Execute!ut

!  Observe!resul3ng!state,!

Op3mal!Control!(Closed!Loop)!

=!“Model!Predic3ve!Control”!!Ini3alize!with!solu3on!from!t!F!1!to!solve!fast!at!3me!t!

minx,u

TX

k=t

c

k

(xk

, u

k

)

s.t. x

k+1 = f(xk

, u

k

), 8k 2 {t, t+ 1, . . . , T � 1}x

t

= x

t

xt+1

Given:

For

• Solve

• Execute

• Observe resulting state,

• Initialize with solution from to solve fast at time

minx,u

TX

k=t

c

k

(xk

, u

k

)

x0

t = 0, 1, 2, ..., T

s.t. xk+1 = f(xk, uk), 8k 2 {t, t+ 1, ..., T � 1}

ut

xt+1

xt = xt

t� 1 t

Page 5: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Shooting methods vs collocation methods

Shooting methods vs collocation

collocation method: optimize over actions and states, with constraints

Collocation Method: optimize over actions and state, with constraints

minu1,...,uT ,x1,...,xT

TX

t=1

c(xt

, u

t

) s.t xt

= f(xt�1, ut�1)

Diagram: Sergey Levine

Page 6: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Shooting methods vs collocation

shooting method: optimize over actions only

Shooting Method: optimize over actions only

minu1,...,uT

c(x1, u1) + c(f(x1, u1), u2) + · · ·+ c(f(f(...)...), uT )

Diagram: Sergey Levine

Indeed, x are not necessary since every u results (following the dynamics) in a state sequence x, for which in turn the cost can be computed• Not clear how to initialize in a way that nudges towards a goal state

Shooting methods vs collocation methods

Page 7: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Bellman’s Curse of Dimensionality

Bellman’s!Curse!of!Dimensionality!!  n:dimensional!state!space!

!  Number!of!states!grows!exponenBally!in!n!(for!fixed!number!of!discreBzaBon!levels!per!coordinate)!

!  In!pracBce!!  DiscreBzaBon!is!considered!only!computaBonally!feasible!up!to!5!or!6!

dimensional!state!spaces!even!when!using!!  Variable!resoluBon!discreBzaBon!!  Highly!opBmized!implementaBons!

• n-dimensional state space

• Number of states grows exponentially in n (for fixed number of discretization levels per coordinate)

• In practice

• Discretization is considered only computationally feasible up to 5 or 6 dimensional state spaces even when using

• Variable resolution discretization

• Highly optimized implementations

Page 8: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear case: LQR

Linear case: LQR

linear quadratic

• Very special case: Optimal Control for Linear Dynamic Systems and Quadratic Cost (a.k.a. LQ setting)

• Can solve continuous state-space optimal control problem exactly• Running time: O(Tn3)

Page 9: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear dynamics: Newtonian Dynamics

x

t+1 = x

t

+�tx

t

+�t

2F

x

yt+1 = yt +�tyt +�t2Fy

yt+1 = yt +�tFy

x

t+1 = x

t

+�tF

x

Page 10: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

What is the state x?

• position and velocities of the robotic joints

• position and velocity of the object being manipulated

In most robotic tasks, state is hand engineered and includes:

Those are both known: the robot knows its state and we perceive the state of the objects in the world. In tasks where we do not even want to bother with object state, we just concatenate the robotic state across multiple time steps to implicitly infer the interaction (collision with the object)

Page 11: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

What is the cost

is the target state

• In the final time step, you can add a term with higher weight:

Final cost

• For object manipulation, includes not only desired pose of the end effector but also desired pose of the objects

c(xt, ut)

c(xt, ut) = kxt � x

⇤k+ �kutk

x

c(xT , uT ) = 2(kxT � x

⇤k+ �kuT k)

x

Page 12: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Definitions: : optimal action value function, optimal cost-to-go at state as a function of assuming we act optimal past step tQ(xt, ut) xt

ut : optimal state value function, optimal cost-to-go from state

V (xt) xt

Linear Quadratic Regulator (LQR)Linear case: LQR

V (xt) = minut

Q(xt, ut)

: the initial state, known and given x0

Page 13: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Principle of Optimality

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must

constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.)

Page 14: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

Value iteration: backward propagation!Start from and work backwards

Linear case: LQR

uT

Page 15: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQRLinear case: LQR

Value iteration: backward propagation!Start from and work backwards

Linear case: LQR

Cost matrices for the last time step:

uT

Page 16: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQRLinear case: LQR

Linear case: LQR

Value iteration: backward propagation!Start from and work backwards

Linear case: LQR

Cost matrices for the last time step:

uT

Set derivative to zero since we have a quadratic to find minimizing u_T:

Page 17: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQRLinear case: LQR

Linear case: LQR

Value iteration: backward propagation!Start from and work backwards

Linear case: LQR

Cost matrices for the last time step:

uT

Set derivative to zero since we have a quadratic to find minimizing u_T:

Linear case: LQRLinear case: LQRLinear case: LQR

Page 18: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Remember:Substituting the minimizer into gives us !

V (xt) = minut

Q(xt, ut)

uT Q(xT , uT ) V (xT )

Page 19: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

Remember:Substituting the minimizer into gives us !

V (xt) = minut

Q(xt, ut)

uT Q(xT , uT ) V (xT )

Linear case: LQR

Linear case: LQR

Page 20: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

Remember:Substituting the minimizer into gives us !

V (xt) = minut

Q(xt, ut)

uT Q(xT , uT ) V (xT )

Linear case: LQR

Linear case: LQR

Linear case: LQR

Page 21: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

Remember:Substituting the minimizer into gives us !

V (xt) = minut

Q(xt, ut)

uT Q(xT , uT ) V (xT )

Linear case: LQRLinear case: LQR

Page 22: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

Remember:Substituting the minimizer into gives us !

V (xt) = minut

Q(xt, ut)

uT Q(xT , uT ) V (xT )

Linear case: LQRLinear case: LQR

optimal cost-to-go as a function of the final state

Page 23: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

Linear case: LQR

linear linearquadratic

cost at T-1 best cost-to-go

Page 24: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

Linear case: LQR

linear linearquadratic

best cost-to-go

Linear case: LQR

linear linearquadratic

cost at T-1

Page 25: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

q⇤(s, a) = r(s, a) + �X

s02S

T (s0|s, a)v⇤(s0)

Linear case: LQR

linear linearquadratic

best cost-to-go

Linear case: LQR

linear linearquadratic

cost at T-1

Page 26: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

q⇤(s, a) = r(s, a) + �X

s02S

T (s0|s, a)v⇤(s0)

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Linear case: LQR

linear linearquadraticWe can eliminate x_T by writing only in terms of quantities of T-1!

Page 27: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

q⇤(s, a) = r(s, a) + �X

s02S

T (s0|s, a)v⇤(s0)

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We can eliminate x_T by writing only in terms of quantities of T-1!

Page 28: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

q⇤(s, a) = r(s, a) + �X

s02S

T (s0|s, a)v⇤(s0)

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We can eliminate x_T by writing only in terms of quantities of T-1!

Page 29: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)We propagate the optimal value function backwards!!

q⇤(s, a) = r(s, a) + �X

s02S

T (s0|s, a)v⇤(s0)

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We can eliminate x_T by writing only in terms of quantities of T-1!

We have written only in terms of !V (xT ) xT�1, uT�1

Page 30: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We propagate the optimal value function backwards!!

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Page 31: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We have written optimal action value function only in terms of !

Q(xT�1, uT�1)

xT�1, uT�1

We propagate the optimal value function backwards!!

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Page 32: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We have written optimal action value function only in terms of !

Q(xT�1, uT�1)

xT�1, uT�1

We propagate the optimal value function backwards!!

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Let’s take derivative to find the minimizing u_{T-1}!

Page 33: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear Quadratic Regulator (LQR)Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

Linear case: LQR

linear linearquadratic

We have written optimal action value function only in terms of !

Q(xT�1, uT�1)

xT�1, uT�1

We propagate the optimal value function backwards!!

Linear case: LQR

linear linearquadratic

Immediate cost best cost-to-go

Let’s take derivative to find the minimizing u_{T-1}!

Page 34: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear case: LQR

Linear case: LQR

Diagram: Sergey Levine

Backward recursion:

Page 35: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Linear case: LQR

Linear case: LQRBackward recursion:

We know x_0!

Linear case: LQR

Forward recursion:

Page 36: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Non-linear case:Use iterative approximations!

First order Taylor expansion for the dynamics around a trajectory :

Nonlinear case: DDP/iterative LQR

xt, ut, t = 1 · · ·T

Nonlinear case: DDP/iterative LQR

Second order Taylor expansion for the cost around a trajectory :xt, ut, t = 1 · · ·T

Page 37: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Non-linear case:Use iterative approximations!

First order Taylor expansion for the dynamics around a trajectory :

Nonlinear case: DDP/iterative LQR

xt, ut, t = 1 · · ·T

Nonlinear case: DDP/iterative LQR

Second order Taylor expansion for the cost around a trajectory :xt, ut, t = 1 · · ·T

Page 38: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Non-linear case:Use iterative approximations!

Nonlinear case: DDP/iterative LQR

First order Taylor expansion for the dynamics around a trajectory :

Nonlinear case: DDP/iterative LQR

xt, ut, t = 1 · · ·T

Nonlinear case: DDP/iterative LQR

Second order Taylor expansion for the cost around a trajectory :xt, ut, t = 1 · · ·T

Page 39: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Nonlinear case: DDP/iterative LQRInitialization: Given , pick a random control sequence and obtain corresponding state sequence

Iterative LQR (i-LQR)x0 u0...uT

x0...xT

8t

8t

8t8t

8t

8t

ut = ut +Kt(xt � xT ) + kt

Page 40: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Nonlinear case: DDP/iterative LQR

Iterative LQR (i-LQR)

Linear approximation around

Find so that minimizes the linear approximation

Go to the and

8t

8t

8t

8t

8tx, u

�ut, t = 1...T

ut +�ut

x

0 = x+�xt u0 = u+�ut

Initialization: Given , pick a random control sequence and obtain corresponding state sequence

x0 u0...uTx0...xT

8tut = ut +Kt(xt � xT ) + kt

Page 41: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Nonlinear case: DDP/iterative LQR

Nonlinear case: DDP/iterative LQR

Nonlinear case: DDP/iterative LQRThe quadratic approximation in invalid too far away from the reference trajectory

Nonlinear case: DDP/iterative LQR

line search for \alpha

Instead of finding the argmin i do a line search

Run forward pass with real nonlinear dynamics and ut = ut +Kt(xt � xT ) + ↵kt

Page 42: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

• So far we have been planning (e.g. 100 steps) and then we close our eyes and hope our modeling was accurate enough..

• At convergence of iLQR and DDP, we end up with linearization around the (state, input) trajectory the algorithm converged to.

• In practice: the system could not be on this trajectory due to perturbations / initial state being off / dynamics model being off / …

• Can we handle such noise better?

Nonlinear case: DDP/iterative LQR

Page 43: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Model Predictive Control• Yes! If we close the loop! Model predictive control!• Solution: at time t when asked to generate control input u_t, we could re-

solve the control problem using iLQR or DDP over the time steps t through T

Case study: nonlinear model-predictive control

• Re-planning entire trajectory is often impractical -> in practice: replay over horizon H (receding horizon control)

Page 44: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

i-LQR: When it works Cost:

Direction for minimizing the cost

kxt � x

⇤k

x

xt

Page 45: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

i-LQR: When it doesn’t work

Cost:

Due to discontinuities of contact, the local search fails! Solution?Initialize using a human demonstration instead of random!

kxt � x

⇤k

x

xt

Learning Dexterous manipulation Policies from Experience and Imitation, Kumar, Gupta, Todorov, Levine 2016

Page 46: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Time varying linear dynamicsLocal models

t

reference trajectoryxt, ut, t = 1, ..., T

Page 47: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Time varying linear dynamicsLocal models

t

f(xt, ut) ⇡ Atxt +Btut

At =df

dxtBt =

df

dut

reference trajectorylearn time varying linear dynamics:

xt, ut, t = 1, ..., TAt,Bt

Page 48: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Time varying linear dynamicsLocal models

t

f(xt, ut) ⇡ Atxt +Btut

At =df

dxtBt =

df

dut

reference trajectorylearn time varying linear dynamics:

xt, ut, t = 1, ..., TAt,Bt

How do I get the data to fit my linear dynamics at each time step?We execute the controller at state to explore how the world works in the vicinity of the reference trajectory!

ut xt

Page 49: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Discrete and Continuous version

+D

Page 50: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Fitting Dynamics(1): Compute analytically derivatives of the “true” non-linear dynamics

• We may not have such analytic non linear dynamic equations available• Very limiting: under modeling errors• Complicated derivations

Page 51: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Fitting Dynamics(2): Finite Differences

We need 2 samples per state dimension

Page 52: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Fitting Dynamics(3): Linear regression

Use linear regression to fit A,B,D to samples

Use GMM priors as described in the lecture

+D

Page 53: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Learning Neural Network Policies with guided Policy Search under Unknown Dynamics, Levine and Abbeel 2014

Bayesian Linear dynamics fittingFit a Global Gaussian Mixture Model using all samples of all iterations and time steps. -> priorUse current samples (from this iteration) and obtain Gaussian posterior for , which you condition to obtain .Such prior results in 4 to 8 times less samples needed, despite the fact that it is not accurate enough by itself.

Posterior of mean and covariance where are the empirical means and covariances and an inverse Wishart prior�, µ0, n0,m

µ, ⌃

p(xt+1|xt, ut)

(xt, ut, xt+1)

(xt, ut, xt+1)

Page 54: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

One shot Learning of Manipulation Skills with Online Dynamic Adaptation and Neural Network Priors, Fu et al.

Fit a Global Model of Dynamics by fitting a Neural Network using all samples of all iterations and time steps, and across multiple manipulation tasks->multi-task learning. Use model predictive control with iLQR for computing the policy at every time step. State is the robotic arm configuration and cost depends on a desired end-effector pose. No object involved in the state.

Bayesian Linear dynamics fitting

(xt, ut, xt+1)

Page 55: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Time varying linear dynamics

Local models

We iteratively fit dynamics and update the policy. Why such iteration is important?So that the space (state, action distribution) our dynamics are estimated is similar to the one our policy visits (last lecture).

Page 56: Katerina Fragkiadaki - GitHub Pages · 2018. 1. 9. · Katerina Fragkiadaki Carnegie Mellon School of Computer Science. Optimal Control (Open Loop) s.t. x 0 =¯x 0 x t+1 = f (x t,u

Fitting time varying linear dynamics

Local models

• Can we further improve sample complexity? Right now each sample contributes in one linear model fitting.

• Instead of linear regression use Bayesian linear regression!(xt, ut, xt+1)