1/51
Global Convergence of Policy Optimization
Tengyang Xie and Wenbin Wan
2/51
Outline
Background
Global Convergence in Tabular MDPs
Global Convergence w/ Function Approximation
Neural Policy Gradient Methods
3/51
Section 1
Background
4/51
Markov Decision Process (MDP)
An (infinite-horizon discounted) MDP [Sutton and Barto, 1998;Puterman, 2014] is a tuple (S, A, P , R , γ, d0)
I state s ∈ SI action a ∈ AI transition function P : S ×A → ∆(S)
I reward function R : S ×A → [0,Rmax]
I discount factor γ ∈ [0, 1)
I initial state distribution d0 ∈ ∆(S)
(∆(·) denotes the probability simplex)
5/51
Notations Regarding Value Function and PolicyI policy π : S → ∆(A)
I π induced random trajectory: (s0, a0, r0, s1, a1, r1, . . .), wheres0 ∼ d0, at ∼ π(·|st), rt = R(st , at), st+1 ∼ P(·|st , at), ∀t ≥ 0
I (state-)value function V π(s) := E[∑∞
t=0 γtrt |s0 = s, π]
I Q-function Qπ(s, a) := E[∑∞
t=0 γtrt |s0 = s, a0 = a, π]
I advantage function Aπ(s, a) := Qπ(s, a)− V π(s)
I expected discounted return J(π) := E[∑∞
t=0 γtrt |s0 ∼ d0, π]
I optimal policy: π?, value function of π?: V ?, Q-function ofπ?: Q?
I normalized discounted state occupancydπ(s) := (1− γ)
∑∞t=0 γ
t Pr [st = s|s0 ∼ d0, π],dπ(s, a) := dπ(s)π(a|s)
I Bellman optimality operator: T
T V (s) := maxa
E[R(s, a) + E[V (s ′)|s, a]
]
6/51
Policy Parameterizations
I direct parameterization: πθ(a|s) = θs,a, where θ ∈ ∆(A)|S|.
I softmax parameterization: πθ(a|s) =exp(θs,a)∑a′ exp(θs,a′ )
, where
θ ∈ R|S||A|.
Example function approximation:
θs,a ⇒ θ · φs,a
7/51
Policy Gradient Theorem
Objective function:
maxθ
J(πθ)
Theorem ([Sutton et al., 2000])
∇θJ(πθ) =1
1− γ E(s,a)∼dπ
[∇θ log πθ(a|s)Qπθ(s, a)] (1)
Corollary
∇θJ(πθ) =1
1− γ E(s,a)∼dπ
[∇θ log πθ(a|s)Aπθ(s, a)]
8/51
Section 2
Global Convergence in Tabular MDPs
9/51
Overview
The global convergence of policy gradient comes from its specialstructure.
There are two main ways for attaining the global convergence:I Policy Improvement — all the stationary points are global
optimal [Bhandari and Russo, 2019]
I Bounding Performance Difference — performance differencebetween π and π? can be bounded by the (variants of) policygradient [Agarwal et al., 2019]
10/51
Ways to Attain Global Convergence: I. Policy ImprovementWarm-up: Policy Improvement LemmaLet π be any policy, π+ be the greedy policy w.r.t. Qπ. ThenV π+(s) ≥ V π(s) for any s ∈ S.
Proof.For any s ∈ S, we have
V π(s)
≤ Qπ(s, π+(s))
= E[rt+1 + γV π(st+1)|st = s, at ∼ π+]
≤ E[rt+1 + γQπ(st+1, π+(st+1))|st = s, at ∼ π+]
≤ E[rt+1 + γrt+2 + γ2V π(st+2)|st = s, at ∼ π+, at+1 ∼ π+]
...≤ V π+(s)
11/51
Ways to Attain Global Convergence: I. Policy Improvement
How about π′ := π + α(π+ − π), where α ∈ (0, 1)?
Policy can be improved on this direction!
Theorem (No spurious local optima, Theorem 1 in [Bhandariand Russo, 2019])Under Assumption 1-4 in [Bhandari and Russo, 2019], for policy πθ,and π+ be a policy iteration update of πθ. Take u to satisfy
d
duπθ+αu(s) = π+(s)− πθ(s), ∀s ∈ S.
Then,
d
dαJ(θ + αu)
∣∣∣∣∣α=0
≥ 11− γ
‖V πθ − TV πθ‖1,dπθ .
11/51
Ways to Attain Global Convergence: I. Policy Improvement
How about π′ := π + α(π+ − π), where α ∈ (0, 1)?
Policy can be improved on this direction!
Theorem (No spurious local optima, Theorem 1 in [Bhandariand Russo, 2019])Under Assumption 1-4 in [Bhandari and Russo, 2019], for policy πθ,and π+ be a policy iteration update of πθ. Take u to satisfy
d
duπθ+αu(s) = π+(s)− πθ(s), ∀s ∈ S.
Then,
d
dαJ(θ + αu)
∣∣∣∣∣α=0
≥ 11− γ
‖V πθ − TV πθ‖1,dπθ .
12/51
Ways to Attain Global Convergence: I. Policy Improvement
How to prove the No-spurious-local-optima Theorem?
Lemma (Policy gradients for directional derivatives, Lemma 1in [Bhandari and Russo, 2019])For any θ and u, we have
d
dαJ(θ + αu)
∣∣∣∣∣α=0
=1
1− γ E(s,a)∼πθs0∼d0
[d
dαQπθ(s, πθ+αu(s))
∣∣∣∣∣α=0
].
Then bounding the RHS of the lemma above, we prove theNo-spurious-local-optima Theorem.
13/51
Ways to Attain Global Convergence: I. Policy Improvement
What can we learn from the No-spurious-local-optimaTheorem?
Recall the result of No-spurious-local-optima Theorem,
d
dαJ(θ + αu)
∣∣∣∣∣α=0
≥ 11− γ
‖V πθ − TV πθ‖1,dπθ .
The LHS is the a partial gradient of J(θ) (lower bounds the policygradient, Eq.(1), by any norm), and the RHS=0 if and only ifV πθ = V ?.
This implies that all the stationary points obtained by the policygradient theorem are globally optimal.
14/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
Warm-up: Performance Difference Lemma
Lemma (The performance difference lemma [Kakade andLangford, 2002])For all policies π, π′,
J(π)− J(π′) =1
1− γ E(s,a)∼dπ
[Aπ
′(s, a)
].
This lemma can be proved by directly simplify the RHS using thedefinition of Aπ
′.
15/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
Usage of the Performance Difference Lemma — GradientDomination Lemma(directly parameterized policy classes and projected policy gradient)
Lemma (Gradient domination, Lemma 4.1 in [Agarwal et al.,2019])For any policy π, we have
J(π?)− J(π) ≤∥∥∥∥dπ?dπ
∥∥∥∥∞
maxπ
(π − π)>∇πJ(π)
≤ 11− γ
∥∥∥∥dπ?d0
∥∥∥∥∞
maxπ
(π − π)>∇πJ(π),
(2)
where the max is over the set of all possible policies.
16/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
Proof of the gradient domination lemma.
By the performance difference lemma,
J(π?)− J(π) =1
1− γ∑s,a
dπ?(s)π?(a|s)Aπ(s, a)
≤ 11− γ
∑s,a
dπ?(s) maxa
Aπ(s, a)
≤ 11− γ
∥∥∥∥dπ?dπ
∥∥∥∥∞
∑s,a
dπ(s) maxa
Aπ(s, a)
17/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
Proof of the gradient domination lemma.(cont.)
∑s,a
dπ(s) maxa
Aπ(s, a)
= maxπ
∑s,a
dπ(s)π(a|s)Aπ(s, a)
= maxπ
∑s,a
dπ(s)(π(a|s)− π(a|s))Aπ(s, a)
= (1− γ) maxπ
(π − π)>∇πJ(π)
Combining these two parts above, we complete the proof.
18/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
How to use the gradient domination lemma?
Definition (First-order Stationarity)A policy π ∈ ∆(A)|S| is ε-stationary with respect to the initialstate distribution d0 if
maxπ+δ∈∆(A)|S|,‖δ‖2≤1
δ>∇πJ(π) ≤ ε,
where ∆(A)|S| is the set of all possible policies.
19/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
How to use the gradient domination lemma?(cont.)
Then, we have the following inequality
maxπ
(π − π)>∇πJ(π) ≤ maxπ+δ∈∆(A)|S|,‖δ‖2≤1
δ>∇πJ(π) (3)
We now connect the performance difference with the first-orderstationary condition (using the gradient dominate lemma andEq.(3)).
By applying classic first-order optimization results, we obtain theglobal convergence of (projected) policy gradient.
20/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
Iteration Complexity for direct parameterization
TheoremThe projected gradient ascent algorithm (projecting to theprobability simplex after each gradient ascent step) on J(πθ) withstepsize (1−γ)3
2γ|A| satisfies
mint≤T{J(π?)− J(πθ(t))} ≤ ε,
when ever T >64γ|S||A|(1− γ)6ε2
∥∥∥∥dπ?d0
∥∥∥∥2
∞.
21/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
The softmax parameterization case (πθ(a|s) =exp(θs,a)∑a′ exp(θs,a′ )
)
Challenge: Attaining the optimal policy (which is deterministic)needs to send the parameters to ∞.
Three types of algorithms:1. regular policy gradient2. policy gradient w/ entropic regularization3. natural policy gradient
22/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
The softmax parameterization case
1. Regular policy gradient only has asymptotic convergence (at thispoint)
Theorem (Global convergence for softmax parameterization,Theorem 5.1 in [Agarwal et al., 2019])Assume we follow the gradient descent update rule and that thedistribution d0 is strictly positive i.e. d0(s) > 0 for all states s.Suppose η ≤ (1−γ)2
5 , then we have that for all states s,V (t)(s)→ V ?(s) as t →∞.
23/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
The softmax parameterization case
2. Polynomial Convergence with Relative Entropy Regularization
The relative-entropy regularized objective:
Lλ(θ) := J(πθ) +λ
|S||A|∑s,a
log πθ(a|s) + λ log |A|,
where λ is a regularization parameter.
Its benefit: keep the parameters from becoming too large, as ameans to ensure adequate exploration.
24/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
The softmax parameterization case
2. Polynomial Convergence with Relative Entropy Regularization(cont.)
Theorem (Iteration complexity with relative entropyregularization, Corollary 5.4 in [Agarwal et al., 2019])Let βλ := 8γ
(1−λ)3+ 2λ|S| . Starting from any initial θ(0), consider the
gradient ascent of Lλ with λ = ε(1−λ)
2∥∥∥ dπ?
d0
∥∥∥∞
and η = 1/βλ. Then we
have mint<T{J(π?)− J(πθ(t))} ≤ ε, if T ≥320|S|2|A|2
(1− γ)6ε2
∥∥∥∥dπ?d0
∥∥∥∥2
∞.
25/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
The softmax parameterization case
3. Natural Policy Gradient
Formulation:
F (θ) = E(s,a)∼dπθ
[∇θ log πθ(a|s) (∇θ log πθ(a|s))>
]θ(t+1) = θ(t) + ηF (θ(t))†∇θJ(πθ(t)),
where M† denotes the Moore-Penrose pseudoinverse of the matrixM.
26/51
Ways to Attain Global Convergence: II. BoundingPerformance Difference
The softmax parameterization case
3. Natural Policy Gradient(cont.)
Theorem (Global Convergence for Natural Policy GradientAscent, Theorem 5.7 in [Agarwal et al., 2019])Suppose we run the natural policy gradient updates with θ(0) = 0.Fixed η ≥ 0. For all T > 0, we have:
J(πθ(T )) ≥ J(π?)− log |A|ηT
− 1(1− γ)2T
.
In particular, setting η ≥ (1− γ)2 log |A|, NPG finds an ε-optimalpolicy in a number of iterations that is at most T ≤ 2
(1−γ)2ε, which
has no dependence on |S|, |A|,∥∥∥dπ?
d0
∥∥∥2
∞.
27/51
Review of Tabular ResultsWhat we covered:I All the first order stationary points of policy gradient are
global optimal.I The exact (projected) policy gradient w/ direct
parameterization has a O
(γ|S||A|
(1− γ)6ε2
∥∥∥∥dπ?d0
∥∥∥∥2
∞
)iteration
complexity.I The exact policy gradient w/ softmax parameterization has
asymptotic convergence.I The exact policy gradient w/ relative entropy regularization
and softmax parameterization has a O
(γ|S|2|A|2
(1− γ)6ε2
∥∥∥∥dπ?d0
∥∥∥∥2
∞
)iteration complexity.
I The exact natural policy gradient w/ softmax parameterization
has a2
(1− γ)2εiteration complexity.
28/51
Review of Tabular Results
What we did not cover or future directions:
I Exploration w/ policy gradient (partially solved by Cai et al.[2019a])
I Stochastic policy gradient / sample based resultsI Actor-critic approachI A sharp analysis or improved algorithm regarding the
distribution mismatch coefficient (e.g., Eq. (2))I Sample complexity analysisI Landscape of J(π)
29/51
Section 3
Global Convergence w/ Function Approximation
30/51
Overview
What we will cover:
I Natural Policy Gradient for Unconstrained Policy Class
I Projected Policy Gradient for Constrained Policy Classes
Challenge: how to capture the approximation error properly?
31/51
Natural Policy Gradient for Unconstrained Policy Class
Let the policy classes parameterized by θ ∈ Rd , where d � |S||A|.
The update rule still using the exact natural policy gradient (seetabular part). We also need to assume that log πθ(a|s) is aβ-smooth function for all θ, s, and a.
Example (Linear softmax policies)For any state, action pair s, a, suppose we have a feature mappingφs,a ∈ Rd with ‖φs,a‖22 ≤ β. Let us consider the policy class
πθ(a|s) =exp(θ · φs,a)∑
a′∈A exp(θ · φs,a′).
with θ ∈ Rd . Then log πθ(a|s) is a β-smooth function.
32/51
Natural Policy Gradient for Unconstrained Policy Class
Tools for analyzing NPG
Let the NPG update rule can be written abstractly as(u(t) = F (θ(t))†∇θJ(πθ(t)))
θ(t+1) = θ(t) + ηu(t).
We then leverage the connection of NPG update and compatiblefunction approximation [Sutton et al., 2000; Kakade, 2002].
Lν(w ; θ) := Es,a∼ν
[(Aπθ(s, a)− w · ∇θ log πθ(a|s))2
],
L?ν(θ) := minw
Lν(w ; θ).
ν(s, a) = dπθ(s, a) ⇒ u(t) ∈ argminw Lν(w ; θ) and L?ν(θ) is ameasure of approximation error of πθ.
(we can verify that L?ν(θ) = 0⇒ πθ = π?)
33/51
Natural Policy Gradient for Unconstrained Policy Class
NPG results
Consider the update rule θ(t+1) = θ(t) + ηu(t), whereu(t) ∈ argminw Lν(t)(w ; θ) and ν(t) = dπ
θ(t)(s, a).
TheoremLet π? be the optimal policy in the class, η =
√2 log |A|/(βW 2T ),
L?ν(t)(θ
(t)) ≤ εapprox, ‖u(t)‖2 ≤W . Then we have
mint≤T{J(π?)− J(πθ(t))} ≤
(W√
2β log |A|(1− γ)
)· 1√
T
+
√1
(1− γ)3
∥∥∥∥dπ?d0
∥∥∥∥∞εapprox.
34/51
Projected Policy Gradient for Constrained Policy ClassesThe constrained policy class Π = {πθ : θ ∈ Θ}, where Θ ⊆ Rd is aconvex set, be the feasible set of all policies.
Following the similar intuition as policy improvement part, wedefine the Bellman policy error in approximating π+
θ (greedy policyw.r.t. Qπθ) as
LBPE(θ,w) =
[∑a∈A
∣∣∣π+θ (a|s)− πθ(a|s)− w>∇θπθ(a|s)
∣∣∣] .Then, the approximation error can be captured byLBPE(θ) := LBPE(θ;w?(θ)), where
w?(θ) = argminw∈Rd :w+θ∈Θ
LBPE(θ;w).
(it is easy to verify that LBPE(θ) = 0⇒ πθ = π?)
35/51
Projected Policy Gradient for Constrained Policy Classes
TheoremSuppose the πθ is Lipschitz continuous and smooth for all θ ∈ Θ.Assume for all t < T ,
LBPE(θ(t)) ≤ εapprox and ‖w?(θ(t))‖2 ≤W ?.
Let
β =β2|A|
(1− γ)2 +2γβ2
1 |A|2
(1− γ)3 .
Then, the projected policy gradient ascent w/ stepsize η = 1/βsatisfies
mint<T{J(π?)− J(πθ(t))} ≤
1(1− γ)3
∥∥∥∥dπ?d0
∥∥∥∥∞εapprox + (W ? + 1)ε,
for T ≥ 8β(1−γ)3ε2
∥∥∥dπ?d0
∥∥∥2
∞.
36/51
Section 4
Neural Policy Gradient Methods
37/51
Overparameterized Neural Policy
A two-layer neural network f ((s, a);W , b) with input (s, a) andwidth m takes the form of
f((s, a);W , b
)=
1√m
m∑r=1
br · ReLU((s, a)>[W ]r
), ∀(s, a) ∈ S ×A,
whereI (s, a) ∈ S ×A ⊆ Rd
I ReLU : R→ R is the rectified linear unit (ReLU) activationfunction, which is defined as ReLU(u) = 1{u > 0} · u.
I {br}r∈[m] and W = ([W ]>1 , . . . , [W ]>m)> ∈ Rmd are theparameters.
38/51
Overparameterized Neural Policy
Using the two-layer neural network, we define the neural policies as
πθ(a | s) =exp[τ · f
((s, a); θ
)]∑a′∈A exp
[τ · f
((s, a′); θ
)] , ∀(s, a) ∈ S ×A,
and the feature mapping φθ = ([φθ]>1 , . . . , [φθ]>m)> : Rd → Rmd ofa two-layer neural network f ((·, ·); θ) as
[φθ]r (s, a) =br√m· 1{
(s, a)>[θ]r > 0}· (s, a),
∀(s, a) ∈ S ×A, ∀r ∈ [m].
39/51
Overparameterized Neural Policy
Policy Gradient and Fisher Information Matrix(Proposition 3.1 in [Wang et al., 2019])
For πθ defined previous, we have
∇θJ(πθ) = τ · Eσπθ[Qπθ(s, a) ·
(φθ(s, a)− Eπθ
[φθ(s, a′)
])],
F (θ) = τ2 · Eσπθ[(φθ(s, a)− Eπθ
[φθ(s, a′)
])(φθ(s, a)− Eπθ
[φθ(s, a′)
])>].
40/51
Neural Policy Gradient MethodsActor Update
To update θi , we set
θi+1 ← ΠB(θi + η · G (θi ) · ∇θJ(πθi )
),
whereI B = {α ∈ Rmd : ‖α−Winit‖2 ≤ R}, where R > 1 and Winit
is the initial parameterI ΠB : Rmd → B as the projection operator onto the parameter
space B ⊆ Rmd
I G (θi ) = Imd for policy gradient and G (θi ) = (F (θi ))−1 fornatural policy gradient
I η is the learning rate and ∇θJ(πθi ) is an estimator of ∇θJ(πθi )
∇θJ(πθi ) =1B·
B∑`=1
Qωi (s`, a`) · ∇θ log πθi (a` | s`)
41/51
Neural Policy Gradient MethodsActor Update
Sampling From Visitation Measure
Recall that the policy gradient (Proposition 3.1 in [Wang et al.,2019])
∇θJ(πθ) = τ · Eσπθ[Qπθ(s, a) ·
(φθ(s, a)− Eπθ
[φθ(s, a′)
])].
We need to sample from the visitation measure σπθ . Define a newMDP (S,A, P, ζ, r , γ) with Markov transition kernel P
P(s ′ | s, a) = γ · P(s ′ | s, a) + (1− γ) · ζ(s ′), ∀(s, a, s ′) ∈ S ×A× S,
whereI P is the Markov transition kernel of the original MDP
42/51
Neural Policy Gradient MethodsActor Update
Inverting Fisher Information Matrix
Recall that G (θi ) = (F (θi ))−1 for natural policy gradient.
Challenge: Inverting an estimator F (θi ) of F (θi ) can beinfeasible as F (θi ) is a high-dimensional matrix, which is possiblynot invertible.To resolve this issue, we estimate the natural policy gradientG (θi ) · ∇θJ(πθi ) by solving
minα∈B‖F (θi ) · α− τi · ∇θJ(πθi )‖2
43/51
Neural Policy Gradient MethodsActor Update
Meanwhile, F (θi ) is an unbiased estimator of F (θi ) based on{(s`, a`)}`∈[B] sampled from σi , which is defined as
F (θi ) =τ2i
B·
B∑`=1
(φθi (s`, a`)− Eπθi
[φθi (s`, a
′`)])
(φθi (s`, a`)− Eπθi
[φθi (s`, a
′`)])>
.
The actor update of neural natural policy gradient takes the form of
τi+1 ← τi + η,
τi+1 · θi+1 ← τi · θi + η · argminα∈B
‖F (θi ) · α− τi · ∇θJ(πθi )‖2,
44/51
Neural Policy Gradient MethodsActor Update
To summarize:
At the i-th iteration,Neural policy gradient obtains θi+1 via projected gradient ascentusing ∇θJ(πθi ) defined by
∇θJ(πθi ) =1B·
B∑`=1
Qωi (s`, a`) · ∇θ log πθi (a` | s`)
Neural natural policy gradient solves
minα∈B‖F (θi ) · α− τi · ∇θJ(πθi )‖2
and obtains θi+1 according to
τi+1 ← τi + η,
τi+1 · θi+1 ← τi · θi + η · argminα∈B
‖F (θi ) · α− τi · ∇θJ(πθi )‖2,
45/51
Neural Policy Gradient MethodsCritic Update
To obtain ∇θJ(πθ), it remains to obtain the critic Qωi in
∇θJ(πθi ) =1B·
B∑`=1
Qωi (s`, a`) · ∇θ log πθi (a` | s`)
For any policy π, the action-value function Qπ is the uniquesolution to the Bellman equation Q = T πQ [Sutton and Barto,1998].Here T π is the Bellman operator that takes the form of
T πQ(s, a) = E[(1− γ) · r(s, a) + γ · Q(s ′, a′)
], ∀(s, a) ∈ S ×A.
46/51
Neural Policy Gradient MethodsCritic Update
Correspondingly, we aim to solve the following optimization problem
ωi ← argminω∈B
Eςi[(Qω(s, a)− T πθiQω(s, a)
)2],
whereI ςi is the stationary state-action distributionI T πθi is the Bellman operator associated with πθi
47/51
Neural Policy Gradient MethodsCritic Update
We adopt neural temporal-difference learning (TD) studied in [Caiet al., 2019b], which solves the optimization problem above viastochastic semigradient descent [Sutton, 1988].
Specifically, an iteration of neural TD takes the form of
ω(t + 1/2)
← ω(t)
− ηTD ·(Qω(t)(s, a)− (1− γ) · r(s, a)− γQω(t)(s ′, a′)
)· ∇ωQω(t)(s, a)
ω(t + 1)← argminα∈B
‖α− ω(t + 1/2)‖2,
48/51
Neural Policy Gradient MethodsCritic Update
To summarize:
Combine the actor updates and the critic update described by1:
θi+1 ← ΠB(θi + η · G (θi ) · ∇θJ(πθi )
)2:
τi+1 ← τi + η,
τi+1 · θi+1 ← τi · θi + η · argminα∈B
‖F (θi ) · α− τi · ∇θJ(πθi )‖2
3:
ωi ← argminω∈B
Eςi[(Qω(s, a)− T πθiQω(s, a)
)2].
49/51
Neural Policy Gradient Methods
Reference IAlekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav
Mahajan. Optimality and approximation with policy gradientmethods in markov decision processes. arXiv preprintarXiv:1908.00261, 2019.
Jalaj Bhandari and Daniel Russo. Global optimality guarantees forpolicy gradient methods. arXiv preprint arXiv:1906.01786, 2019.
Qi Cai, Zhuoran Yang, Chi Jin, and Zhaoran Wang. Provablyefficient exploration in policy optimization. arXiv preprintarXiv:1912.05830, 2019a.
Qi Cai, Zhuoran Yang, Jason D Lee, and Zhaoran Wang. Neuraltemporal-difference learning converges to global optima. InAdvances in Neural Information Processing Systems, pages11312–11322, 2019b.
Sham Kakade and John Langford. Approximately optimalapproximate reinforcement learning. In ICML, volume 2, pages267–274, 2002.
50/51
Reference IISham M Kakade. A natural policy gradient. In Advances in neural
information processing systems, pages 1531–1538, 2002.Martin L Puterman. Markov decision processes: discrete stochastic
dynamic programming. John Wiley & Sons, 2014.Richard S Sutton. Learning to predict by the methods of temporal
differences. Machine learning, 3(1):9–44, 1988.Richard S Sutton and Andrew G Barto. Reinforcement Learning:
An Introduction. MIT Press, Cambridge, MA, March 1998. ISBN0-262-19398-1.
Richard S Sutton, David A McAllester, Satinder P Singh, andYishay Mansour. Policy gradient methods for reinforcementlearning with function approximation. In Advances in neuralinformation processing systems, pages 1057–1063, 2000.
Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neuralpolicy gradient methods: Global optimality and rates ofconvergence. arXiv preprint arXiv:1909.01150, 2019.
51/51