inverse reinforcement learning: a review
Post on 03-May-2022
4 Views
Preview:
TRANSCRIPT
INVERSE REINFORCEMENT
LEARNING: A REVIEW
Ghazal Zand
Department of Electrical Engineering and Computer Science
Cleveland State University
g.zand@vikes.csuohio.edu
outlines
• Quick review of RL
• IRL introduction
• IRL motivation
• IRL some sample applications
• IRL formulation
• IRL shortcomings
• IRL approaches
• Comparison
• Conclusion
2
Inverse Reinforcement Learning vs. Reinforcement
Learning
3
IRL framework RL framework
Inverse Reinforcement Learning Motivation
• IRL was originally posed by Andrew Ng and Stuart Russell
• Ng and Russell. "Algorithms for inverse reinforcement learning.” Icml. 2000
• Bee foraging : reward at each flower
• RL assumes known function of its nectar content
• But actually different factors have influence on it: e.g. distance, time, risk of wind or
predators, …4
• Autonomous helicopter aerobatics through apprenticeship learning
• Abbeel, Coates, and Ng.
• Enabling Robots to Communicate
their Objectives
• Huang, Held, Abbeel and Dragan
5
IRL
Applications
IRL Applications
• Apprenticeship learning for motion
planning with application to parking lot
navigation
• Abbeel, Dolgov, Ng and Thrun
• Try to mimic different driving styles
6
IRL Formulation
• We are given
• A standard MDP
• defined as a five-element tuple 𝑀 = (𝑆,𝐴,𝑃,𝑅,𝛾)
• The reward function 𝑅 is unknown. But, it can be written as a linear combination of features 𝑅∗ =
𝑊∗.𝐹
• A set of m trajectories generated by an expert
• Goal is to
• Find a reward function 𝑅∗ which explains the expert behavior
• Use this 𝑅∗ to find a policy that its performance is close to performance of the expert’s7
IRL Shortcomings
• In original IRL the environment is modeled as a MDP
• But in practice, there is no access the true global state of the environment
• Original IRL algorithm assumes the expert always performs optimally
• But in practice, usually expert’s demonstrations are imperfect, noisy and incomplete
• Most of the IRL algorithms consider the reward function as a linear combination of features
• While the expert might act according to more complex reward functions
• The original IRL problem is ill-posed
• Since there are infinitely many reward functions consistent with the expert’s behavior
8
Improvements on IRL
• To solve the MDP limitations
• In: Choi and Kim. 2011. Inverse reinforcement learning in partially observable
environments. Journal of Machine Learning Research
• A generalization to a Partially Observable Markov Decision Process (POMDP) is
proposed
• POMDP considers that the agent’s sensors are limited, so it can estimate states
through the observations
9
Improvements on IRL (BIRL)
• To solve the uncertainty in obtained reward, in original IRL problem
• The probability distributions were utilized to model this uncertainty
• Ramachandran and Amir. 2007. Bayesian inverse reinforcement learning. Urbana.
• BIRL assumes given the reward function all of the expert actions are independent
• This assumption allows us to:
10
• likelihood of observing expert’s demonstration sequence
Improvements on IRL (BIRL)
• According to the Bayes theorem, the posterior probability of the reward function is given by
• where 𝑃𝑅(𝑅) is the prior knowledge on the reward function
• In Qiao and Beling. 2011. Inverse reinforcement learning with Gaussian process. IEEE(ACC)
• Authors assign a Gaussian prior on the reward function
• To deal with noisy observations, incomplete policies, and small number of observations
11
Improvements on IRL (MaxEnt IRL)
• Similar to BIRL, Maximum Entropy Inverse Reinforcement Learning (MaxEnt
IRL) uses a probability approach
• Ziebart et al. 2008. Maximum Entropy Inverse Reinforcement Learning. AAAI
• The optimal value of W is given by• maximizing the likelihood of the observed trajectory through maximum entropy,
through gradient-based methods
• One solution to the large state spaces problem is approximating MaxEnt IRL by
graphs
• Shimosaka et al. 2017. Fast Inverse Reinforcement Learning with Interval Consistent
Graph for Driving Behavior Prediction12
Improvements on IRL (MMP IRL)
• To solve the uncertainty in obtained reward function in original IRL
problem:
• Ratliff et al. 2006. Maximum margin planning. ACM
• Introduces the loss-functions
• in different forms
• to penalize choosing actions that are different from expert’s demonstration
• to penalize arriving states that the expert chooses not to enter.
13
Improvements on IRL (MMP IRL)
• The difference between MMP and the original IRL:
• in MMP, the margin scales with these loss-functions
• instead of returning policies, MMP reproduces the expert’s behaviors
• But MMP still assumes a linear form for the reward function.
• Ratliff et al. 2009. Learning to search: Functional gradient techniques for imitation
learning. Autonomous Robots.
• Extended MMP to learn non-linear reward functions by introducing LEARCH
algorithm
14
Improvements on IRL (more complex reward functions)
• IRL algorithm originally considers the reward as a weighted linear combination of feature
• But to better capture the relationship between feature vector and expert demonstration
• Choi and Kim. 2013. Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning. IJCAI
• Considered the reward function as a weighted set of composite features
• Levine et al. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems
• Used a kernel machine for modelling the reward function15
Improvements on IRL (more complex reward functions)
• Wulfmeier et al. 2015. Deep inverse reinforcement learning. CoRR
• Used a sufficiently large deep neural network with two layers and sigmoid activation
functions
16
Highway Driving Simulator
• To compare the efficiency of some of these algorithm we applied them to
the problem of highway driving simulator set in
• Levine et al. 2010. Feature construction for inverse reinforcement learning.
Advances in Neural Information Processing Systems
• Goal: learn reward function from human demonstration
• Road color indicates the reward at highest speed
• The agent will be penalized for driving fast near the police vehicle
17
(a) Sample highway environment
(b) Human demonstration
(c) MMPIRL resuts
(d) MaxEntIRL results
(e) MWAL IRL results
(f) GPIRL results
Learned Reward
Functions
Comparison
• Expected Value Difference score:
• Presented in
• Levine et al. 2011. Nonlinear inverse reinforcement learning with Gaussian processes. Advances in Neural Information Processing Systems
• Measures how suboptimal the learned policy is under the true reward
19
Conclusion
• The IRL problem was introduced
• Some sample applications of IRL were represented
• The original IRL’s shortcomings were mentioned
• Then papers that presented different approaches of IRL were reviewed
• Finally, the performance of some of the IRL algorithms on highway
driving simulator with varying number of human demonstrations was
compared
20
top related