on the value of interaction and function approximation in
TRANSCRIPT
![Page 1: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/1.jpg)
On the value of Interaction andFunction approximation in Imitation LearningNived Rajaraman, Yanjun Han, Lin F. Yang, Jingbo Liu, Jiantao Jiao, Kannan Ramchandran
NeurIPS 2021
![Page 2: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/2.jpg)
Challenges in RL
Rewards for practical RL problems are often hard to specify.
Popov et al. 2017
Reward design must be consistent with counterfactual questions: “What would an expert have done?”
Need to correctly balance interpretability and sparsity.
![Page 3: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/3.jpg)
Imitation learning over reward engineering
Expert demonstrations Learner
“Learning from demonstrations in the absence of reward feedback”
Image source: Gettyimages
![Page 4: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/4.jpg)
Motivation
What are the theoretical limits of Imitation Learning (i) with interaction and (ii) in the presence of function approximation?
Notation: : Expected total reward of policy in an episode of length .
Learner tries to minimize , is expert’s policy • Difference in expected reward of the expert and the learner policy.
J(π) π Hπ 𝖲𝗎𝖻𝗈𝗉𝗍𝗂𝗆𝖺𝗅𝗂𝗍𝗒 ≜ 𝔼 [J(π*) − J( π )] π*
![Page 5: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/5.jpg)
Theorem [RYJR20] In the no-interaction and tabular setting, Behavior Cloning achieves,
Best achievable (up to log-factors) by any algorithm.
𝖲𝗎𝖻𝗈𝗉𝗍𝗂𝗆𝖺𝗅𝗂𝗍𝗒 ≲SH2 log(N)
N
No interaction: Learner is only provided a dataset of expert demonstrations; Cannot interact with the MDP
N
Theoretical understanding of IL: Prior work
![Page 6: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/6.jpg)
Going beyond the no-interaction setting
Interactive expert: Learner can interact with the environment times and query the expert policy at visited states
N
Setting is closely related to human-in-the-loop RL
Mandlekar et al. 2020
![Page 7: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/7.jpg)
Is it possible to improve the suboptimality of behavior cloning if the expert is interactive?
IL with an interactive expert
In the worst case, no.
Hard instance: Reset cliff MDP
All learners get stuck at bad state
For all algorithms even with an interactive expert, in the worst case, [RYJR20]𝖲𝗎𝖻𝗈𝗉𝗍𝗂𝗆𝖺𝗅𝗂𝗍𝗒 ≳ SH2/N
![Page 8: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/8.jpg)
IL with an interactive expert
-recoverability assumption [RB11]: For any state , action ,
μ s a′
maxaQ*t (s, a) − Q*t (s, a′ ) ≤ μ
Interpretation: Expert knows how to “recover” after making a mistake at some time t and pays an expected cost of at most .μ
Is it possible to improve the suboptimality of behavior cloning if the expert is interactive?
![Page 9: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/9.jpg)
Is it possible to improve the suboptimality of behavior cloning if the expert is interactive?
IL with an interactive expert
Theorem 1 [RHYLJR21] Under -recoverability, in the interactive and tabular setting, DAGGER (FTRL) achieves,
Best achievable (up to log-factors) by any algorithm.
μ
𝖲𝗎𝖻𝗈𝗉𝗍𝗂𝗆𝖺𝗅𝗂𝗍𝗒 ≲μSH log(N)
N
![Page 10: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/10.jpg)
IL with function approximation
How do approaches such as BC and Mimic-MD [RYJR20] perform in the presence of function approximation?
![Page 11: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/11.jpg)
IL with linear function approximation
Linear expert: For every state , the deterministic expert plays an action
is a known representation of state-actions
sπ*t (s) ∈ argmaxa⟨θt, ϕt(s, a)⟩
ϕt(s, a) ∈ ℝd
Interpretation: Expert policy is realized by a linear multi-class classifier
![Page 12: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/12.jpg)
Linear expert with no MDP interaction
Theorem 2 [RHYLJR21]: In the no-interaction and linear expert setting, Behavior Cloning achieves,
With recovers bounds in the tabular setting.
𝖲𝗎𝖻𝗈𝗉𝗍𝗂𝗆𝖺𝗅𝗂𝗍𝗒 ≲dH2 log(N)
Nd = S
![Page 13: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/13.jpg)
Linear expert with known transition
Known transition: Learner is provided a dataset of expert demonstrations; Knows the MDP transition
N
Interpretation: carrying out Imitation Learning in a simulation environment.Image source: Waymo
![Page 14: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/14.jpg)
Linear expert with known transition
Confidence set classification: Consider classification over family of hypotheses, from . From a dataset of examples from a classifier return the largest measure of points where is known without ambiguity.
ℋ 𝒳 → 𝒴D h*
h*(x)
![Page 15: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/15.jpg)
Linear expert with known transition
Theorem 3 [RHYLJR21]:For each , consider the linear classifier . Given a confidence set classifier with expected loss , there exists an IL algorithm such that,
t π*t : S → Aℓt
𝖲𝗎𝖻𝗈𝗉𝗍𝗂𝗆𝖺𝗅𝗂𝗍𝗒 ≲ H3/2 dN
∑Ht=1 ℓt
H
Message: Error compounding ( dependence) can be broken if confidence set linear classification is possible to expected loss of .
H2
oN(1)
![Page 16: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/16.jpg)
Linear expert with known transition
Confidence set linear classification is sample efficient for the uniform distribution
Extending to general distributions?
Theorem 4 [RHYLJR21]: If distribution over inputs is uniform over the unit sphere , the minimax loss of confidence set linear classification is .
𝕊d−1
Θ(d3/2/N)
![Page 17: On the value of Interaction and Function approximation in](https://reader033.vdocuments.site/reader033/viewer/2022061414/629e2f5c1dbb257db25084a3/html5/thumbnails/17.jpg)
On the value of Interaction andFunction approximation in Imitation LearningNived Rajaraman, Yanjun Han, Lin F. Yang, Jingbo Liu, Jiantao Jiao, Kannan Ramchandran
NeurIPS 2021