space-indexed dynamic programming: learning to follow trajectories j. zico kolter, adam coates,...
TRANSCRIPT
Space-Indexed Dynamic Programming: Learning to
Follow Trajectories
J. Zico Kolter, Adam Coates, Andrew Y. Ng, Yi Gu, Charles DuHadway
Computer Science DepartmentStanford University
July 2008, ICML
Outline
• Reinforcement Learning and Following Trajectories
• Space-indexed Dynamical Systems and Space-indexed Dynamic Programming
• Experimental Results
Reinforcement Learning and Following Trajectories
Trajectory Following
• Consider task of following trajectory in a vehicle such as a car or helicopter
• State space too large to discretize, can’t apply tabular RL/dynamic programming
Trajectory Following
• Dynamic programming algorithms w/ non-stationary policies seem well-suited to task– Policy Search by Dynamic Programming
(Bagnell, et. al), Differential Dynamic Programming (Jacobson and Mayne)
Dynamic Programming
t=1
Divide control task into discrete time steps
Dynamic Programming
t=1
Divide control task into discrete time steps
t=2
Dynamic Programming
t=1
Divide control task into discrete time steps
t=2t=3
t=4 t=5 : : :
Dynamic Programming
t=1 t=2t=3
t=4 t=5 : : :
Proceeding backwards in time, learn policies for
t = T, T-1, …, 2, 1
Dynamic Programming
t=1 t=2t=3
t=4 t=5 : : :
Proceeding backwards in time, learn policies for
t = T, T-1, …, 2, 1
¼5
Dynamic Programming
t=1 t=2t=3
t=4 t=5 : : :
Proceeding backwards in time, learn policies for
t = T, T-1, …, 2, 1
¼5¼4
Dynamic Programming
t=1 t=2t=3
t=4 t=5 : : :
Proceeding backwards in time, learn policies for
t = T, T-1, …, 2, 1
¼5¼4¼3
¼2¼1
Dynamic Programming
t=1 t=2t=3
t=4 t=5 : : :
Key Advantage: Policies are local (only need to perform well over small
portion of state space)
¼5¼4¼3
¼2¼1
Problems with Dynamic Programming
Problem #1: Policies from traditional dynamic
programming algorithms are time-indexed
Problems with Dynamic Programming
¼5
Supposed we learned policy assuming this
distribution over states¼5
Problems with Dynamic Programming
¼5
But, due to natural stochasticity of environment, car is actually here at t = 5
Problems with Dynamic Programming
¼5
Resulting policy will perform very poorly
Problems with Dynamic Programming
¼5¼4
¼3¼2
¼1
Partial Solution: Re-indexingExecute policy closest to current
location, regardless of time
Problems with Dynamic Programming
Problem #2: Uncertainty over future states makes it hard to
learn any good policy
Problems with Dynamic Programming
Due to stochasticity, large uncertainty over states in
distant future
Dist. over states at time t = 5
Problems with Dynamic Programming
DP algorithms require learning policy that performs well over entire distribution
Dist. over states at time t = 5
Space-Indexed Dynamic Programming
• Basic idea of Space-Indexed Dynamic Programming (SIDP):
Perform DP with respect to space indices (planes tangent to trajectory)
Space-Indexed Dynamical Systems and Dynamic
Programming
Difficulty with SIDP
• No guarantee that taking single action will move to next plane along trajectory
• Introduce notion of space-indexed dynamical system
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
_s = f (s;u)
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
_s = f (s;u)
current state
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
_s = f (s;u)
control actioncurrent state
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
_s = f (s;u)
control actioncurrent statetime derivative of state
Time-Indexed Dynamical System
• Creating time-indexed dynamical systems:
_s = f (s;u)
Euler integration
st+¢ t = st +f (st;ut)¢ t
Space-Indexed Dynamical Systems
• Creating space-indexed dynamical systems:
• Simulate forward until whenever vehicle hits next tangent plane
space index d
space index d+1
Space-Indexed Dynamical Systems
• Creating space-indexed dynamical systems:
space index dspace index d+1
_s = f (s;u)
sd+1 = sd+f (sd;ud)¢ t(sd;ud)
Space-Indexed Dynamical Systems
• Creating space-indexed dynamical systems:
space index dspace index d+1
_s = f (s;u)
(Positive solution exists as long as controller makes
some forward progress)
sd+1 = sd+f (sd;ud)¢ t(sd;ud)
¢ t(s;u) =( _s?d+1)
T (s¡ s?d+1)( _s?d+1)
T _s
¢ t(s;u)
Space-Indexed Dynamical Systems
• Result is a dynamical system indexed by spatial-index variable d rather than time
• Space-indexed dynamic programming runs DP directly on this system
sd+1 = sd+f (sd;ud)¢ t(sd;ud)
Space-Indexed Dynamic Programming
Divide trajectory into discrete space planes
d=1
Space-Indexed Dynamic Programming
Divide trajectory into discrete space planes
d=1 d=2
Space-Indexed Dynamic Programming
Divide trajectory into discrete space planes
d=1 d=2d=3
d=4d=5
Space-Indexed Dynamic Programming
d=1 d=2d=3
d=4d=5
Proceeding backwards, learn policies for
d = D, D-1, …, 2, 1
Space-Indexed Dynamic Programming
d=1 d=2d=3
d=4d=5
¼5
Proceeding backwards, learn policies for
d = D, D-1, …, 2, 1
Space-Indexed Dynamic Programming
d=1 d=2d=3
d=4d=5
¼5¼4
Proceeding backwards, learn policies for
d = D, D-1, …, 2, 1
Space-Indexed Dynamic Programming
d=1 d=2d=3
d=4d=5
¼5¼4¼3
¼2¼1
Proceeding backwards, learn policies for
d = D, D-1, …, 2, 1
Problems with Dynamic Programming
Problem #1: Policies from traditional dynamic
programming algorithms are time-indexed
Space-Indexed Dynamic Programming
Time indexed DP: can execute
policy learned for different location
Space indexed DP: always executes policy based on current spatial
index
¼5
¼4
Problems with Dynamic Programming
Problem #2: Uncertainty over future states makes it hard to
learn any good policy
Space-Indexed Dynamic Programming
Time indexed DP: wide distribution
over future states
Space indexed DP: much tighter
distribution over future states
Dist. over states at time t = 5 Dist. over states at index d = 5
Space-Indexed Dynamic Programming
Time indexed DP: wide distribution
over future states
Space indexed DP: much tighter
distribution over future states
Dist. over states at time t = 5 Dist. over states at index d = 5
t(5):
Experiments
Experimental Domain
• Task: following race track trajectory in RC car with randomly placed obstacles
Experimental Setup
• Implemented space-indexed version of PSDP algorithm– Policy chooses steering angle using SVM
classifier (constant velocity)– Used simple textbook model simulator of car
dynamics to learn policy
• Evaluated PSDP time-indexed, time-indexed with re-indexing and space-indexed
Time-Indexed PSDP
Time-Indexed PSDP w/ Re-indexing
Space-Indexed PSDP
Empirical Evaluation
Time-indexed PSDP Time-indexed PSDP with Re-indexing
Space-indexed PSDP
Cost: 49.32Cost: Infinite (no trajectory succeeds) Cost: 59.74
Additional Experiments
• In the paper: additional experiments on the Stanford Grand Challenge Car using space-indexed DDP, and on a simulated helicopter domain using space-indexed PSDP
Related Work
• Reinforcement learning / dynamic programming: Bagnell et al., 2004; Jacobson and Mayne, 1970; Lagoudakis and Parr, 2003; Langford and Zadrozny, 2005
• Differential Dynamic Programming: Atkeson, 1994; Tassa et al., 2008
• Gain Scheduling, Model Predictive Control: Leith and Leithead, 2000; Garica et al., 1989
Summary
• Trajectory following uses non-stationary policies, but traditional DP / RL algorithms suffer because they are time-indexed
• In this paper, we introduce the notions of a space-indexed dynamical system, and space-indexed dynamic programming
• Demonstrated usefulness of these methods on real-world control tasks.
Thank you!
Videos available online athttp://cs.stanford.edu/~kolter/icml08videos