a few reinforcement learning stories · a few reinforcement learning stories r. fonteneau...
TRANSCRIPT
![Page 1: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/1.jpg)
A few reinforcement learning stories
R. Fonteneau
University of Liège, Belgium
May 5th, 2017ULB - IRIDIA
![Page 2: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/2.jpg)
Outline
Context: machine learning & (deep) reinforcement learning in brief
Batch Mode Reinforcement Learning
Synthesizing Artificial Trajectories
Estimating the Performances of Policies
![Page 3: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/3.jpg)
Context: machine learning and (deep) reinforcement learning in brief
![Page 4: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/4.jpg)
Machine learning is about extracting {patterns, knowledge, information} from data
Machine Learning
![Page 5: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/5.jpg)
Machine learning algorithms have recently shown impressive results, in particular when input data are images: this has led to the identification of a subfield of Machine Learning called Deep Learning.
Deep Learning
![Page 6: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/6.jpg)
Reinforcement learning, an area of machine learning originally inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
Deep reinforcement learning combines deep learning with reinforcement learning (and, consequently, in DP / MPC schemes).
(Deep) Reinforcement Learning
![Page 7: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/7.jpg)
Human-level control through deep reinforcement learning. Nature, 2015. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Demis Hassabis
Mastering the game of Go with deep neural networks and tree search. Nature, 2016.David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel & Demis Hassabis
Recent (Deep) Reinforcement Learning Successes
![Page 8: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/8.jpg)
Batch Mode Reinforcement Learning
![Page 9: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/9.jpg)
Reinforcement Learning
● Reinforcement Learning (RL) aims at finding a policy maximizing received rewards by interacting with the environment
Agent Environment
Actions
Observations, Rewards
Examples of rewards:
![Page 10: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/10.jpg)
Batch Mode Reinforcement Learning
● All the available information is contained in a batch collection of data
● Batch mode RL aims at computing a (near-)optimal policy from this collection of data
Batch mode RL
Finite collection of trajectories of the agent Near-optimal decision strategy
Agent Environment
Actions
Observations,Rewards
![Page 11: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/11.jpg)
Time
Patients
1
p
0 1 T
? 'optimal'treatment ?
Batch Mode Reinforcement Learning
![Page 12: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/12.jpg)
Time
Patients
1
p
0 1 T
? 'optimal'treatment ?
Batch Mode Reinforcement Learning
Batch collection of trajectories of patients
![Page 13: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/13.jpg)
Objectives
● Main goal: Finding a "good" policy
● Many associated subgoals:
– Evaluating the performance of a given policy– Computing performance guarantees– Computing safe policies– Choosing how to generate additional transitions– ...
![Page 14: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/14.jpg)
Main Difficulties
Main difficulties of the batch mode setting:
● Dynamics and reward functions are unknown (and not accessible to simulation)
● The state-space and/or the action space are large or continuous
● The environment may be highly stochastic
![Page 15: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/15.jpg)
Usual Approach
To combine dynamic programming with function approximators (neural networks, regression trees, SVM, linear regression over basis functions, etc)
Function approximators have two main roles:
● To offer a concise representation of state-action value function for deriving value / policy iteration algorithms
● To generalize information contained in the finite sample
![Page 16: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/16.jpg)
Remaining Challenges
The black box nature of function approximators may have some unwanted effects:
● hazardous generalization● difficulties to compute performance guarantees● unefficient use of optimal trajectories
A proposition: synthesizing artificial trajectories
![Page 17: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/17.jpg)
Synthesizing Artificial Trajectories
![Page 18: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/18.jpg)
Formalization
System dynamics:
Reward function:
Performance of a policy
where
Reinforcement learning
![Page 19: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/19.jpg)
Formalization
The system dynamics, reward function and disturbance probability distribution are unknown
Instead, we have access to a sample of one-step system transitions:
Batch mode reinforcement learning
![Page 20: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/20.jpg)
Artificial trajectories are (ordered) sequences of elementary pieces of trajectories:
FormalizationArtificial trajectories
![Page 21: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/21.jpg)
Artificial Trajectories: What For?
Artificial trajectories can help for:
● Estimating the performances of policies● Computing performance guarantees● Computing safe policies● Choosing how to generate additional transitions
![Page 22: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/22.jpg)
Artificial Trajectories: What For?
Artificial trajectories can help for:
● Estimating the performances of policies● Computing performance guarantees● Computing safe policies● Choosing how to generate additional transitions
![Page 23: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/23.jpg)
Estimating the Performances of Policies
![Page 24: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/24.jpg)
Model-free Monte Carlo Estimation
If the system dynamics and the reward function were accessible to simulation, then Monte Carlo estimation would allow estimating the performance of h
![Page 25: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/25.jpg)
MODEL OR SIMULATOR REQUIRED!
Model-free Monte Carlo Estimation
![Page 26: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/26.jpg)
Model-free Monte Carlo Estimation
If the system dynamics and the reward function were accessible to simulation, then Monte Carlo (MC) estimation would allow estimating the performance of h
We propose an approach that mimics MC estimation by rebuilding p artificial trajectories from one-step system transitions
![Page 27: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/27.jpg)
Model-free Monte Carlo Estimation
If the system dynamics and the reward function were accessible to simulation, then Monte Carlo (MC) estimation would allow estimating the performance of h
We propose an approach that mimics MC estimation by rebuilding p artificial trajectories from one-step system transitions
These artificial trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h; each one step transition is used at most once
![Page 28: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/28.jpg)
Model-free Monte Carlo Estimation
If the system dynamics and the reward function were accessible to simulation, then Monte Carlo (MC) estimation would allow estimating the performance of h
We propose an approach that mimics MC estimation by rebuilding p artificial trajectories from one-step system transitions
These artificial trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h; each one step transition is used at most once
We average the cumulated returns over the p artificial trajectories to obtain the Model-free Monte Carlo estimator (MFMC) of the expected return of h:
![Page 29: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/29.jpg)
Model-free Monte Carlo Estimation
![Page 30: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/30.jpg)
Example with T = 3, p = 2, n = 8
The MFMC algorithm
![Page 31: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/31.jpg)
The MFMC algorithm
![Page 32: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/32.jpg)
The MFMC algorithm
![Page 33: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/33.jpg)
The MFMC algorithm
![Page 34: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/34.jpg)
The MFMC algorithm
![Page 35: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/35.jpg)
The MFMC algorithm
![Page 36: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/36.jpg)
The MFMC algorithm
![Page 37: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/37.jpg)
The MFMC algorithm
![Page 38: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/38.jpg)
The MFMC algorithm
![Page 39: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/39.jpg)
The MFMC algorithm
![Page 40: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/40.jpg)
The MFMC algorithm
![Page 41: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/41.jpg)
The MFMC algorithm
![Page 42: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/42.jpg)
The MFMC algorithm
![Page 43: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/43.jpg)
The MFMC algorithm
![Page 44: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/44.jpg)
The MFMC algorithm
![Page 45: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/45.jpg)
The MFMC algorithm
![Page 46: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/46.jpg)
The MFMC algorithm
![Page 47: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/47.jpg)
The MFMC algorithm
![Page 48: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/48.jpg)
The MFMC algorithm
![Page 49: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/49.jpg)
The MFMC algorithm
![Page 50: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/50.jpg)
The MFMC algorithm
![Page 51: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/51.jpg)
Lipschitz continuity assumptions:
Theoretical AnalysisAssumptions
![Page 52: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/52.jpg)
Distance metric ∆
k-dispersion
denotes the distance of (x,u) to its k-th nearest neighbor (using the distance ∆) in the sample
Theoretical AnalysisAssumptions
![Page 53: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/53.jpg)
Theoretical AnalysisAssumptions
The k-dispersion can beseen as the smallest radius such that all∆-balls in X×U containat least k elements from
![Page 54: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/54.jpg)
Theoretical AnalysisTheoretical results
Expected value of the MFMC estimator
![Page 55: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/55.jpg)
Theoretical AnalysisTheoretical results
Expected value of the MFMC estimator
Theorem
with
![Page 56: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/56.jpg)
Theoretical AnalysisTheoretical results
Variance of the MFMC estimator
![Page 57: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/57.jpg)
Theoretical AnalysisTheoretical results
Variance of the MFMC estimator
Theorem
with
![Page 58: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/58.jpg)
Dynamics:
Reward function:
Policy to evaluate:
Other information:
pW(.) is uniform
Experimental IllustrationBenchmark
![Page 59: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/59.jpg)
Monte Carlo estimatorModel-free Monte Carlo estimator
Simulations for p = 10, n = 100 … 10 000, uniform grid, T = 15, x0 = - 0.5
n = 100 … 10 000, p = 10
Experimental IllustrationInfluence of n
p = 10
![Page 60: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/60.jpg)
Simulations for p = 1 … 100, n = 10 000, uniform grid, T = 15, x0 = - 0.5
Monte Carlo estimatorModel-free Monte Carlo estimator
p = 1 … 100, n=10 000 p = 1 … 100
Experimental IllustrationInfluence of p
![Page 61: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/61.jpg)
Comparison with the FQI-PE algorithm using k-NN, n=100, T=5 .
Experimental IllustrationMFMC vs FQI-PE
![Page 62: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/62.jpg)
Comparison with the FQI-PE algorithm using k-NN, n=100, T=5 .
Experimental IllustrationMFMC vs FQI-PE
![Page 63: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/63.jpg)
Research map
Stochastic setting
Bias / variance analysis Illustration
MFMC: estimator of the expected returnEstimator
of theVaR
Deterministic setting
Continuousaction space
Finite action space
CGRL SamplingstrategyBounds on
the return
Convergence
Convergence+ additional properties
Illustration Illustration
![Page 64: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/64.jpg)
Research map
Stochastic setting
Bias / variance analysis Illustration
MFMC: estimator of the expected returnEstimator
of theVaR
Deterministic setting
Continuousaction space
Finite action space
CGRL SamplingstrategyBounds on
the return
Convergence
Convergence+ additional properties
Illustration Illustration
![Page 65: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/65.jpg)
Estimating the Performances of Policies
Consider again the p artificial trajectories that were rebuilt by the MFMC estimator. The Value-at-Risk of the policy h
can be straightforwardly estimated as follows:
with
Risk-sensitive criterion
![Page 66: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/66.jpg)
Deterministic Case: Computing BoundsBounds from a Single Trajectory
Given an artificial trajectory :
![Page 67: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/67.jpg)
Deterministic Case: Computing BoundsBounds from a Single Trajectory
Proposition:
Let be an artificial trajectory. Then,
with
![Page 68: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/68.jpg)
Deterministic Case: Computing BoundsMaximal Bounds
Maximal lower and upper-bounds
![Page 69: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/69.jpg)
Deterministic Case: Computing BoundsTightness of Maximal Bounds
Proposition:
![Page 70: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/70.jpg)
Inferring Safe PoliciesFrom Lower Bounds to Cautious Policies
Consider the set of open-loop policies:
For such policies, bounds can be computed in a similar way
We can then search for a specific policy for which the associated lower bound is maximized:
A O( T n ² ) algorithm for doing this: the CGRL algorithm (Cautious approach to Generalization in RL)
![Page 71: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/71.jpg)
Inferring Safe PoliciesConvergence
Theorem
![Page 72: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/72.jpg)
Inferring Safe PoliciesExperimental Results
● The puddle world benchmark
![Page 73: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/73.jpg)
CGRL FQI (Fitted Q Iteration)
The state space is
uniformly covered by
the sample
Information about the
Puddle area is
removed
Inferring Safe PoliciesExperimental Results
![Page 74: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/74.jpg)
Inferring Safe PoliciesBonus
Theorem
![Page 75: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/75.jpg)
Sampling Strategies
Given a sample of system transitions
How can we determine where to sample additional transitions ?
We define the set of candidate optimal policies:
A transition is compatible with if
and we denote by the set of all such compatible transitions.
An Artificial Trajectories Viewpoint
![Page 76: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/76.jpg)
Sampling Strategies
Iterative scheme:
with
Conjecture:
An Artificial Trajectories Viewpoint
![Page 77: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/77.jpg)
Sampling StrategiesIllustration
Action space:
Dynamics and reward function:
Horizon:
Initial sate:
Total number of policies:
Number of transitionsneeded for discriminating:
![Page 78: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/78.jpg)
Thank you
![Page 79: A few reinforcement learning stories · A few reinforcement learning stories R. Fonteneau University of Liège, Belgium May 5th, 2017 ULB - IRIDIA](https://reader036.vdocuments.site/reader036/viewer/2022081613/5fb7dc1b6206422e555172bb/html5/thumbnails/79.jpg)
References
"Batch mode reinforcement learning based on the synthesis of artificial trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Annals of Operations Research,Volume 208, Issue 1, pp 383-416, 2013.
"Generating informative trajectories by using bounds on the return of control policies". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-page highlight paper, Chia Laguna, Sardinia, Italy, May 16, 2010.
"Model-free Monte Carlo-like policy evaluation". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR W&CP 9, pp 217-224, Chia Laguna, Sardinia, Italy, May 13-15, 2010.
"A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010.
"Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. In Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.