inverse reinforcement learning model-free...
TRANSCRIPT
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Inverse Reinforcement LearningModel-Free Approaches
Olivier [email protected]
Supélec - UMI 2958 (GeorgiaTech - CNRS), FranceMaLIS research Group
September 6, 2012GDR Robotique
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Reinforcement Learning
Notions
AgentTaskEnvironment
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Imitation: Expert
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Generalization
Inverse RL
Reward inference
Given 1) measurements ofan agent’s behaviour overtime, in a variety ofcircumstances, 2)measurements of the sensoryinputs to that agent; 3) amodel of the physicalenvironment (if available).
Determine the rewardfunction that the agent isoptimizing
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Seminal paper by Russell [Russell, 1998] (and [Kalman, 1964])
Why?In RL, the reward is often a heuristic that might be wrongUnderstand human and animal behaviorMost compact representation of the task (transfert imitation)
QuestionsTo what extend is a reward uniquely recoverable?What are appropriate error metrics for fitting?Computational complexity?Can we identify local inconsistencies?Can we determine the reward before learning?How many observations are required?
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
First IRL algorithms [Ng and Russell, 2000]
Inversing Bellman Equations for finite MDPs
∀a 6= a∗ : (Pa∗ − Pa)(I− γPa∗)−1R � 0
Problems0 is a solutionThere is an infinite set of solutions
Solutions: add constrainsPenalize deviations w.r.t. π∗
Linear representation of R (r̂ = θTψ(s))Solve with linear programming
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Apprenticeship learning via IRL [Abbeel and Ng, 2004] I
Find a policy whose performance is close to that of the expert’s, on theunknown reward function r̂∗ = θ∗Tψ(s).
Feature expectation : µπ(s)
V π(st) = E
[∑i
γ i rt+i
∣∣∣∣∣π]
V π(st) = E
[∑i
γ iθTψ(st+i )
∣∣∣∣∣π]
V π(st) = θT E
[∑i
γ iψ(st+i )
∣∣∣∣∣π]
︸ ︷︷ ︸µπ(st)
= θTµπ(st)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Apprenticeship learning via IRL [Abbeel and Ng, 2004] II
Fitness to performance
if ‖θ‖2 ≤ 1 : ‖µπ(s)− µE (s)‖2 < ε⇒ ‖V π(s)− V E (s)‖2 < ε
1. Initiate Π with random policy π0 and compute µ0 = µπ0
2. Compute t and θ such that
t = maxθ{minπi∈Π
θT (µE − µi )} s.t. ‖θ‖2 ≤ 1
If t ≤ ξ Terminate
3. Train a new policy πi optimizing R = θTψ(s)
4. Compute µi for πi ; Π← πi
5. Goto to step 2.
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Apprenticeship learning via IRL [Abbeel and Ng, 2004] III
Many methods based on matching feature expectations
Game theoretic point of view [Syed and Schapire, 2008]
Linear programming (generates only 1 policy) [Syed et al., 2008]
Maximum entropy [Ziebart et al., 2008]
All iterative algorithms requiring computing µπ for many π
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Max-Margin Planning [Ratliff et al., 2006] I
IdeasMaximize the margin between the best policy obtained withthe current estimation and the expert policyAllow the expert to be non-optimal (slack variables)
Quadratic program
minθ,ξi
12‖θ‖2 +
γ
n
∑i
βiξ2i
s.t.∀i : θTµπ∗i
d0,Mi+ ξi ≥ max
πθTµπd0,Mi
+ LMi ,π
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Max-Margin Planning [Ratliff et al., 2006] II
Optimization problem : min J(θ)
J(θ) =1N
N∑1=1
maxπ
(θTµπd0,Mi+ LMi ,π)− θTµ
π∗i
d0,Mi+λ
2‖θ‖2
Problemsmax operator : sub-gradient descentNeed to solve an MDP for each constraint : A∗
Need to estimate µπ for any π
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Problem statementThe genesisState of the Art
Max-Margin Planning [Ratliff et al., 2006] III
Classification based [Ratliff et al., 2007][Taskar et al., 2005]
J(q) =1N
N∑i=1
maxa
(q(si , a) + l(si , a))− q(si , ai ).
Problemsmax operator : sub-gradient descentNot taking the structure into account (inconsistent behaviour)
Advantages
q(s, a) can be of any (differentiable) formNo need of knowing the MDP
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Structured Classification of IRL: SCIRL [Klein et al., 2012]
Best of both worldsNo need of MDP modelTake structure into account
Contributions
Classification with the structure of the MDPOnly needs data from the expertCan use other data if available
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Special case of classification
Score function based classifiers
Classifier : map inputs x ∈ X to labels y ∈ YData : {(xi , yi )1≤i≤N}Decision rule : g ∈ YXScore function : gs(x) ∈ argmax
y∈Ys(x , y)
Linearly parameterized score function
s(x , y) = θTφ(x , y) (1)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Let’s compare
Putting it all together
X ≡ S
,
Y ≡ A
,
s ≡ QπE ⇒ φ ≡ µE
Linearly parametrized scorefunction based classifiers
gs(x) ∈ argmaxy∈Y
s(x , y)
s(x , y) = θTφ(x , y)
gs(x) ∈argmax
y∈YθTφ(x , y)
Expert’s decision rule
πE (s) =
argmaxa
QπE (s, a)
Qπ(st , a) = θTµπ(st , a)
πE (s) =
argmaxaθTµπ
E(s, a)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Let’s compare
Putting it all together
X ≡ S ,
Y ≡ A
,
s ≡ QπE ⇒ φ ≡ µE
Linearly parametrized scorefunction based classifiers
gs(x) ∈ argmaxy∈Y
s(x , y)
s(x , y) = θTφ(x , y)
gs(x) ∈argmax
y∈YθTφ(x , y)
Expert’s decision rule
πE (s) =
argmaxa
QπE (s, a)
Qπ(st , a) = θTµπ(st , a)
πE (s) =
argmaxaθTµπ
E(s, a)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Let’s compare
Putting it all together
X ≡ S , Y ≡ A,
s ≡ QπE ⇒ φ ≡ µE
Linearly parametrized scorefunction based classifiers
gs(x) ∈ argmaxy∈Y
s(x , y)
s(x , y) = θTφ(x , y)
gs(x) ∈argmax
y∈YθTφ(x , y)
Expert’s decision rule
πE (s) =
argmaxa
QπE (s, a)
Qπ(st , a) = θTµπ(st , a)
πE (s) =
argmaxaθTµπ
E(s, a)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Let’s compare
Putting it all together
X ≡ S , Y ≡ A, s ≡ QπE
⇒ φ ≡ µE
Linearly parametrized scorefunction based classifiers
gs(x) ∈ argmaxy∈Y
s(x , y)
s(x , y) = θTφ(x , y)
gs(x) ∈argmax
y∈YθTφ(x , y)
Expert’s decision rule
πE (s) =
argmaxa
QπE (s, a)
Qπ(st , a) = θTµπ(st , a)
πE (s) =
argmaxaθTµπ
E(s, a)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Let’s compare
Putting it all together
X ≡ S , Y ≡ A, s ≡ QπE ⇒ φ ≡ µE
Linearly parametrized scorefunction based classifiers
gs(x) ∈ argmaxy∈Y
s(x , y)
s(x , y) = θTφ(x , y)
gs(x) ∈argmax
y∈YθTφ(x , y)
Expert’s decision rule
πE (s) =
argmaxa
QπE (s, a)
Qπ(st , a) = θTµπ(st , a)
πE (s) =
argmaxaθTµπ
E(s, a)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
SCIRL Pseudo-code
Algorithm 1: SCIRL algorithm
Given a training set D = {(si , ai = πE (si ))1≤i≤N}, an estimateµ̂πE of the expert feature expectation µπE and a classificationalgorithm;
Compute the parameter vector θc using the classificationalgorithm fed with the training set D and considering theparameterized score function θT µ̂πE (s, a);
Output the reward function Rθc (s) = θTc ψ(s) ;
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ContributionSCIRL Algorithm
Why is it nice?
Advantages
Only µE is requiredNo need to compute µ for other policiesBased on transitions (not trajectories or policies)Is based on standard classification methodsBounds can be computedReturns a reward
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Analysis
Error bound
Definitions
Cf = (1− γ)∑t≥0
γtc(t) with c(t) = maxπ1,...,πt ,s∈S
(ρTE Pπ1 ...Pπt )(s)
ρE (s)
εc = Es∼ρE
[1{πc(s)6=πE (s)}] ∈ [0, 1]
εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R
ε̄Q = Es∼ρE
[maxa∈A
εQ(s, a)−mina∈A
εQ(s, a)] ≥ 0
Theorem
0 ≤ Es∼ρE
[v∗Rθc− vπERθc
] ≤ Cf
1− γ
(ε̄Q + εc
2γ‖Rθc‖∞1− γ
)(2)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Analysis
Error bound
Definitions
Cf = (1− γ)∑t≥0
γtc(t) with c(t) = maxπ1,...,πt ,s∈S
(ρTE Pπ1 ...Pπt )(s)
ρE (s)
εc = Es∼ρE
[1{πc(s)6=πE (s)}] ∈ [0, 1]
εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R
ε̄Q = Es∼ρE
[maxa∈A
εQ(s, a)−mina∈A
εQ(s, a)] ≥ 0
Theorem
0 ≤ Es∼ρE
[v∗Rθc− vπERθc
] ≤ Cf
1− γ
(ε̄Q + εc
2γ‖Rθc‖∞1− γ
)(2)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Analysis
Error bound
Definitions
Cf = (1− γ)∑t≥0
γtc(t) with c(t) = maxπ1,...,πt ,s∈S
(ρTE Pπ1 ...Pπt )(s)
ρE (s)
εc = Es∼ρE
[1{πc(s)6=πE (s)}] ∈ [0, 1]
εµ = µ̂πE − µπE : S × A→ Rp
εQ = θTc εµ : S × A→ R
ε̄Q = Es∼ρE
[maxa∈A
εQ(s, a)−mina∈A
εQ(s, a)] ≥ 0
Theorem
0 ≤ Es∼ρE
[v∗Rθc− vπERθc
] ≤ Cf
1− γ
(ε̄Q + εc
2γ‖Rθc‖∞1− γ
)(2)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Analysis
Error bound
Definitions
Cf = (1− γ)∑t≥0
γtc(t) with c(t) = maxπ1,...,πt ,s∈S
(ρTE Pπ1 ...Pπt )(s)
ρE (s)
εc = Es∼ρE
[1{πc(s)6=πE (s)}] ∈ [0, 1]
εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R
ε̄Q = Es∼ρE
[maxa∈A
εQ(s, a)−mina∈A
εQ(s, a)] ≥ 0
Theorem
0 ≤ Es∼ρE
[v∗Rθc− vπERθc
] ≤ Cf
1− γ
(ε̄Q + εc
2γ‖Rθc‖∞1− γ
)(2)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Analysis
Error bound
Definitions
Cf = (1− γ)∑t≥0
γtc(t) with c(t) = maxπ1,...,πt ,s∈S
(ρTE Pπ1 ...Pπt )(s)
ρE (s)
εc = Es∼ρE
[1{πc(s)6=πE (s)}] ∈ [0, 1]
εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R
ε̄Q = Es∼ρE
[maxa∈A
εQ(s, a)−mina∈A
εQ(s, a)] ≥ 0
Theorem
0 ≤ Es∼ρE
[v∗Rθc− vπERθc
] ≤ Cf
1− γ
(ε̄Q + εc
2γ‖Rθc‖∞1− γ
)(2)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Analysis
Error bound
Definitions
Cf = (1− γ)∑t≥0
γtc(t) with c(t) = maxπ1,...,πt ,s∈S
(ρTE Pπ1 ...Pπt )(s)
ρE (s)
εc = Es∼ρE
[1{πc(s)6=πE (s)}] ∈ [0, 1]
εµ = µ̂πE − µπE : S × A→ Rp εQ = θTc εµ : S × A→ R
ε̄Q = Es∼ρE
[maxa∈A
εQ(s, a)−mina∈A
εQ(s, a)] ≥ 0
Theorem
0 ≤ Es∼ρE
[v∗Rθc− vπERθc
] ≤ Cf
1− γ
(ε̄Q + εc
2γ‖Rθc‖∞1− γ
)(2)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ImplementationInverted PendulumHighway Driving
Instanciation
Structured large margin classifier [Taskar et al., 2005]
J(θ) =1N
N∑i=1
maxaθT µ̂πE (si , a) +L(si , a)− θT µ̂πE (si , ai ) +
λ
2‖θ‖2.
(3)
L(si , a) = 1 si a 6= ai , 0 otherwiseSub-gradient descend
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ImplementationInverted PendulumHighway Driving
Computing µE
LSTD-µ [?]
Based on alreadyknown Least-squaretemporal differencesmethod
Characteristics
Can be fed withmere transitionsNo modelOff or on policyevaluation
Monte carlo with heuristics
µ̂π(s0, a0) = 1M
M∑j=1
∑i≥0
γ iφ(s ji )
µ̂π(s0, a 6= a0) = γµ̂π(s0, a0)
Characteristics
Only on-policySeems more robust in direconditions
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ImplementationInverted PendulumHighway Driving
Inverted Pendulum
-3 -2 -1 0 1 2 3
Position
-3
-2
-1
0
1
2
3
Spe
ed
-1
-0.8
-0.6
-0.4
-0.2
0
-3 -2 -1 0 1 2 3
Position
-3
-2
-1
0
1
2
3
Spe
ed
-10
-8
-6
-4
-2
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Position
-3
-2
-1
0
1
2
3
Spe
ed
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-3 -2 -1 0 1 2 3
Position
-3
-2
-1
0
1
2
3
Spe
ed
-100
-50
0
50
100
150
200
250
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
ImplementationInverted PendulumHighway Driving
Results on the driving problem
50 100 150 200 250 300 350 400Number of samples from the expert
−4
−2
0
2
4
6
8
10
Es∼U
[Vπ RE
(s)]
50 100 150 200 250 300 350 400Number of samples from the expert
−4
−2
0
2
4
6
8
10
Es∼U
[Vπ RE
(s)]
Description
Goal of the expert : avoid other cars, do not go off-road, gofastUsing only data from the expert and natural featuresNon trivial (State of the art does not work)
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Future work
Future work
No computation of µE everywhereReal-world problem
Task Transfer ?
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Future work
Thank you. . .
. . . for your attention
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Future work
References I
Pieter Abbeel and Andrew Y. Ng.Apprenticeship learning via inverse reinforcement learning.In Proceedings of the 21st International Conference on Machine learning (ICML), 2004.
Rudolph Kalman.When is a linear control system optimal?Transactions of the ASME, Journal of Basic Engineering, Series D, 86:81.90, 1964.
Edouard Klein, Matthieu Geist, Bilal PIOT, and Olivier Pietquin.Inverse Reinforcement Learning through Structured Classification.In Advances in Neural Information Processing Systems (NIPS 2012), Lake Tahoe (NV, USA),December 2012.
Andrew Y. Ng and Stuart Russell.Algorithms for Inverse Reinforcement Learning.In Proceedings of 17th International Conference on Machine Learning (ICML), 2000.
Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich.Maximum Margin Planning.In Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006.
Nathan Ratliff, Andrew D. Bagnell, and Martin Zinkevich.Imitation learning for locomotion and manipulation.In Proceedings of ICHR, 2007.
Olivier Pietquin IRL
IntroductionSCIRL
Theoretical resultsExperimental results
Opening and future work
Future work
References II
Stuart Russell.Learning agents for uncertain environments (extended abstract).In Proceedings of the 11th annual Conference on Computational Learning Theory (COLT), 1998.
Umar Syed and Robert Schapire.A game-theoretic approach to apprenticeship learning.In Advances in Neural Information Processing Systems 20 (NIPS), 2008.
Umar Syed, Michael Bowling, and Robert E. Schapire.Apprenticeship learning using linear programming.In Proceedings of the 25th International Conference on Machine learning (ICML), 2008.
Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin.Learning Structured Prediction Models: a Large Margin Approach.In Proceedings of 22nd International Conference on Machine Learning (ICML), 2005.
Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey.Maximum entropy inverse reinforcement learning.In Proceedings of the 23rd national conference on Artificial intelligence (AAAI), 2008.
Olivier Pietquin IRL