randomized prior functions for deep reinforcement learning05-15-30...2015/05/16 · •...
Post on 15-Mar-2021
3 Views
Preview:
TRANSCRIPT
-
Randomized Prior Functions for Deep Reinforcement Learning
Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
Data & Estimation
= Supervised Learning
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
• As a field, we are pretty good at combining any 2 of these 3. … but we need practical solutions that combine them all.
http://bit.ly/rpf_nips
-
Reinforcement Learningbit.ly/rpf_nips
@ianosband
+ delayed consequences =
Reinforcement Learning
+ partial feedback =
Multi-armed Bandit
Data & Estimation
= Supervised Learning
• “Sequential decision making under uncertainty.”
We need effective uncertainty estimates for Deep RL
• Three necessary building blocks:1. Generalization
2. Exploration vs Exploitation
3. Credit assignment
• As a field, we are pretty good at combining any 2 of these 3. … but we need practical solutions that combine them all.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α)
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α)
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α)
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
Train ensemble on noisy data - classic
statistical procedure!
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
Train ensemble on noisy data - classic
statistical procedure!
No explicit “prior” mechanism for
“intrinsic motivation”
http://bit.ly/rpf_nips
-
Estimating uncertainty in deep RLbit.ly/rpf_nips
@ianosband
Dropout sampling
Variational inference
Distributional RL
Count-based density
Bootstrap ensemble
“Dropout sample ⩬ posterior sample”
(Gal+Gharamani 2015)
Dropout rate does not concentrate with
the data.
Even “concrete” dropout not
necessarily right rate.
Apply VI to Bellman error as if it was an
i.i.d. supervised loss.
Models Q-value as a distribution, rather
than point estimate.
Bellman error:
Uncertainty in Q ➡ correlated TD loss.
VI on i.i.d. model does not propagate
uncertainty.
Q(s, a) = r + γ maxα
Q(s′�, α) This distribution != posterior uncertainty.
Aleatoric vs Epistemic … it’s not the right
thing for exploration.
Estimate number of “visit counts” to
state, add bonus.
The “density model” has nothing to do
with the actual task.
With generalization, state “visit count"
!= uncertainty.
Train ensemble on noisy data - classic
statistical procedure!
No explicit “prior” mechanism for
“intrinsic motivation”
If you’ve never seen a reward, why would the agent explore?
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.
http://bit.ly/rpf_nips
-
Randomized prior functionsbit.ly/rpf_nips
@ianosband
• Key idea: add a random untrainable “prior” function to each member of the ensemble.
• Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.
Exact Bayes posterior for linear functions!
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
• Only one policy (out of more than 2N) has positive return.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
• Only one policy (out of more than 2N) has positive return.
• ε-greedy / Boltzmann / policy gradient / are useless.
http://bit.ly/rpf_nips
-
“Deep Sea” Explorationbit.ly/rpf_nips
@ianosband
• Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.
• Only one policy (out of more than 2N) has positive return.
• ε-greedy / Boltzmann / policy gradient / are useless.
• Algorithms with deep exploration can learn fast! [1] “Deep Exploration via Randomized Value Functions”
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
• DQN+ε-greedy gets stuck on the left, gives up.
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
-
Visualize BootDQN+prior exploration.bit.ly/rpf_nips
@ianosband
• Compare DQN+ε-greedy vs BootDQN+prior.
• Heat map shows estimated value of each state.
• DQN+ε-greedy gets stuck on the left, gives up.
• BootDQN+prior hopes something is out there, keeps exploring potentially-rewarding states… learns fast!
1K
K
∑k=1
maxα
Qk(s, α)• Define ensemble average:
• Red line shows exploration path taken by agent.
http://bit.ly/rpf_nips
-
VIDEO TO COVER THIS WHOLE SLIDE
bit.ly/rpf_nips @ianosband
http://bit.ly/rpf_nips
-
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
http://bit.ly/rpf_nips
-
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
http://bit.ly/rpf_nips
-
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
Montezuma’s Revenge!
http://bit.ly/rpf_nips
-
Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer
bit.ly/rpf_nips @ianosband
Blog post bit.ly/rpf_nips
Montezuma’s Revenge!
Demo code bit.ly/rpf_nips
http://bit.ly/rpf_nips
top related