space-indexed dynamic programming: learning to follow trajectories j. zico kolter, adam coates,...

Space-Indexed Dynamic Programming: Learning to

Follow Trajectories

J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway

Computer Science DepartmentStanford University

July 2008, ICML

Outline

• Reinforcement Learning and Following Trajectories

• Space-indexed Dynamical Systems and Space-indexed Dynamic Programming

• Experimental Results

Reinforcement Learning and Following Trajectories

Trajectory Following

• Consider task of following trajectory in a vehicle such as a car or helicopter

• State space too large to discretize, can’t apply tabular RL/dynamic programming

Trajectory Following

• Dynamic programming algorithms w/ non-stationary policies seem well-suited to task– Policy Search by Dynamic Programming

(Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)

Dynamic Programming

t=1

Divide control task into discrete time steps

Dynamic Programming

t=1


t=2

Dynamic Programming

t=1


t=2t=3

t=4 t=5 : : :

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Proceeding backwards in time, learn policies for

t = T, T-1, …, 2, 1

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :


t = T, T-1, …, 2, 1

¼5

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :


t = T, T-1, …, 2, 1

¼5¼4

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :


t = T, T-1, …, 2, 1

¼5¼4¼3

¼2¼1

Dynamic Programming

t=1 t=2t=3

t=4 t=5 : : :

Key Advantage: Policies are local (only need to perform well over small

portion of state space)

¼5¼4¼3

¼2¼1

Problems with Dynamic Programming

Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed


¼5

Supposed we learned policy assuming this

distribution over states¼5


¼5

But, due to natural stochasticity of environment, car is actually here at t = 5


¼5

Resulting policy will perform very poorly


¼5¼4

¼3¼2

¼1

Partial Solution: Re-indexingExecute policy closest to current

location, regardless of time


Problem #2: Uncertainty over future states makes it hard to

learn any good policy


Due to stochasticity, large uncertainty over states in

distant future

Dist. over states at time t = 5


DP algorithms require learning policy that performs well over entire distribution

Dist. over states at time t = 5

Space-Indexed Dynamic Programming

• Basic idea of Space-Indexed Dynamic Programming (SIDP):

Perform DP with respect to space indices (planes tangent to trajectory)

Space-Indexed Dynamical Systems and Dynamic

Programming

Difficulty with SIDP

• No guarantee that taking single action will move to next plane along trajectory

• Introduce notion of space-indexed dynamical system

Time-Indexed Dynamical System

• Creating time-indexed dynamical systems:

_s = f (s;u)



_s = f (s;u)

current state



_s = f (s;u)

control actioncurrent state



_s = f (s;u)

control actioncurrent statetime derivative of state



_s = f (s;u)

Euler integration

st+¢ t = st +f (st;ut)¢ t

Space-Indexed Dynamical Systems

• Creating space-indexed dynamical systems:

• Simulate forward until whenever vehicle hits next tangent plane

space index d

space index d+1



space index dspace index d+1

_s = f (s;u)

sd+1 = sd+f (sd;ud)¢ t(sd;ud)



space index dspace index d+1

_s = f (s;u)

(Positive solution exists as long as controller makes

some forward progress)


¢ t(s;u) =( _s?d+1)

T (s¡ s?d+1)( _s?d+1)

T _s

¢ t(s;u)


• Result is a dynamical system indexed by spatial-index variable d rather than time

• Space-indexed dynamic programming runs DP directly on this system



Divide trajectory into discrete space planes

d=1



d=1 d=2



d=1 d=2d=3

d=4d=5


d=1 d=2d=3

d=4d=5

Proceeding backwards, learn policies for

d = D, D-1, …, 2, 1


d=1 d=2d=3

d=4d=5

¼5


d = D, D-1, …, 2, 1


d=1 d=2d=3

d=4d=5

¼5¼4


d = D, D-1, …, 2, 1


d=1 d=2d=3

d=4d=5

¼5¼4¼3

¼2¼1


d = D, D-1, …, 2, 1


Problem #1: Policies from traditional dynamic

programming algorithms are time-indexed


Time indexed DP: can execute

policy learned for different location

Space indexed DP: always executes policy based on current spatial

index

¼5

¼4


Problem #2: Uncertainty over future states makes it hard to

learn any good policy


Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5


Time indexed DP: wide distribution

over future states

Space indexed DP: much tighter

distribution over future states

Dist. over states at time t = 5 Dist. over states at index d = 5

t(5):

Experiments

Experimental Domain

• Task: following race track trajectory in RC car with randomly placed obstacles

Experimental Setup

• Implemented space-indexed version of PSDP algorithm– Policy chooses steering angle using SVM

classifier (constant velocity)– Used simple textbook model simulator of car

dynamics to learn policy

• Evaluated PSDP time-indexed, time-indexed with re-indexing and space-indexed

Time-Indexed PSDP

Time-Indexed PSDP w/ Re-indexing

Space-Indexed PSDP

Empirical Evaluation

Time-indexed PSDP Time-indexed PSDP with Re-indexing

Space-indexed PSDP

Cost: 49.32Cost: Infinite (no trajectory succeeds) Cost: 59.74

Additional Experiments

• In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP

Related Work

• Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005

• Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008

• Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989

Summary

• Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed

• In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming

• Demonstrated usefulness of these methods on real-world control tasks.

Thank you!

Videos available online athttp://cs.stanford.edu/~kolter/icml08videos

space-indexed dynamic programming: learning to follow trajectories j. zico kolter, adam coates,...

Documents

time t

dynamic programmingproblem

dynamic programmingdue

dynamic programmingdifficulty

dynamic programmingbut

dynamic programmingsupposed

dynamical systemcreating

dynamic programming