randomized prior functions for deep reinforcement learning05-15-30...2015/05/16  · •...

73
Randomized Prior Functions for Deep Reinforcement Learning Ian Osband, John Aslanides, Albin Cassirer bit.ly/rpf_nips @ianosband

Upload: others

Post on 15-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Randomized Prior Functions for Deep Reinforcement Learning

    Ian Osband, John Aslanides, Albin Cassirer

    bit.ly/rpf_nips @ianosband

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    Data & Estimation

    = Supervised Learning

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    • Three necessary building blocks:

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    • Three necessary building blocks:1. Generalization

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    • Three necessary building blocks:1. Generalization

    2. Exploration vs Exploitation

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    • Three necessary building blocks:1. Generalization

    2. Exploration vs Exploitation

    3. Credit assignment

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    • Three necessary building blocks:1. Generalization

    2. Exploration vs Exploitation

    3. Credit assignment

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    • Three necessary building blocks:1. Generalization

    2. Exploration vs Exploitation

    3. Credit assignment

    • As a field, we are pretty good at combining any 2 of these 3. 
… but we need practical solutions that combine them all.

    http://bit.ly/rpf_nips

  • Reinforcement Learningbit.ly/rpf_nips

    @ianosband

    + delayed consequences =

    Reinforcement Learning

    + partial feedback =

    Multi-armed Bandit

    Data & Estimation

    = Supervised Learning

    • “Sequential decision making under uncertainty.”

    We need effective uncertainty estimates for Deep RL

    • Three necessary building blocks:1. Generalization

    2. Exploration vs Exploitation

    3. Credit assignment

    • As a field, we are pretty good at combining any 2 of these 3. 
… but we need practical solutions that combine them all.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α)

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α)

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α)

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    The “density model” has nothing to do

    with the actual task.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    The “density model” has nothing to do

    with the actual task.

    With generalization, state “visit count" 


    != uncertainty.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    Bootstrap ensemble

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    The “density model” has nothing to do

    with the actual task.

    With generalization, state “visit count" 


    != uncertainty.

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    Bootstrap ensemble

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    The “density model” has nothing to do

    with the actual task.

    With generalization, state “visit count" 


    != uncertainty.

    Train ensemble on noisy data - classic

    statistical procedure!

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    Bootstrap ensemble

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    The “density model” has nothing to do

    with the actual task.

    With generalization, state “visit count" 


    != uncertainty.

    Train ensemble on noisy data - classic

    statistical procedure!

    No explicit “prior” mechanism for

    “intrinsic motivation”

    http://bit.ly/rpf_nips

  • Estimating uncertainty in deep RLbit.ly/rpf_nips

    @ianosband

    Dropout sampling

    Variational inference

    Distributional RL

    Count-based density

    Bootstrap ensemble

    “Dropout sample ⩬ posterior sample”

    (Gal+Gharamani 2015)

    Dropout rate does not concentrate with

    the data.

    Even “concrete” dropout not

    necessarily right rate.

    Apply VI to Bellman error as if it was an

    i.i.d. supervised loss.

    Models Q-value as a distribution, rather

    than point estimate.

    Bellman error:


    
Uncertainty in Q ➡ correlated TD loss.

    VI on i.i.d. model does not propagate

    uncertainty.

    Q(s, a) = r + γ maxα

    Q(s′�, α) This distribution != posterior uncertainty.

    Aleatoric vs Epistemic … it’s not the right

    thing for exploration.

    Estimate number of “visit counts” to

    state, add bonus.

    The “density model” has nothing to do

    with the actual task.

    With generalization, state “visit count" 


    != uncertainty.

    Train ensemble on noisy data - classic

    statistical procedure!

    No explicit “prior” mechanism for

    “intrinsic motivation”

    If you’ve never seen a reward, why would the agent explore?

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.

    http://bit.ly/rpf_nips

  • Randomized prior functionsbit.ly/rpf_nips

    @ianosband

    • Key idea: add a random untrainable “prior” function to each member of the ensemble.

    • Visualize effects in 1D regression:• Training data (x,y) black points.• Prior function p(x) blue line.• Trainable function f(x) dotted line.• Output prediction Q(x) red line.

    Exact Bayes posterior for linear functions!

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.

    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.

    • Only one policy (out of more than 2N) has positive return.


    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.

    • Only one policy (out of more than 2N) has positive return.


    • ε-greedy / Boltzmann / policy gradient / are useless.


    http://bit.ly/rpf_nips

  • “Deep Sea” Explorationbit.ly/rpf_nips

    @ianosband

    • Stylized “chain” domain testing “deep exploration”:- State = N x N grid, observations 1-hot.- Start in top left cell, fall one row each step.- Actions {0,1} map to left/right in each cell.- “left” has reward = 0, “right” has reward = -0.1/N- … but if you make it to bottom right you get +1.

    • Only one policy (out of more than 2N) has positive return.


    • ε-greedy / Boltzmann / policy gradient / are useless.


    • Algorithms with deep exploration can learn fast!
[1] “Deep Exploration via Randomized Value Functions”

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    • Heat map shows estimated value of each state.

    1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    • Heat map shows estimated value of each state.

    1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    • Heat map shows estimated value of each state.

    1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    • Red line shows exploration path taken by agent.

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    • Heat map shows estimated value of each state.

    1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    • Red line shows exploration path taken by agent.

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    • Heat map shows estimated value of each state.

    • DQN+ε-greedy gets stuck on the left, gives up.


    1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    • Red line shows exploration path taken by agent.

    http://bit.ly/rpf_nips

  • Visualize BootDQN+prior exploration.bit.ly/rpf_nips

    @ianosband

    • Compare DQN+ε-greedy vs BootDQN+prior.

    • Heat map shows estimated value of each state.

    • DQN+ε-greedy gets stuck on the left, gives up.


    • BootDQN+prior hopes something is out there, keeps
exploring potentially-rewarding states… learns fast!

    1K

    K

    ∑k=1

    maxα

    Qk(s, α)• Define ensemble average:

    • Red line shows exploration path taken by agent.

    http://bit.ly/rpf_nips

  • VIDEO TO COVER THIS WHOLE SLIDE

    bit.ly/rpf_nips @ianosband

    http://bit.ly/rpf_nips

  • Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer

    bit.ly/rpf_nips @ianosband

    http://bit.ly/rpf_nips

  • Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer

    bit.ly/rpf_nips @ianosband

    Blog post
bit.ly/rpf_nips

    http://bit.ly/rpf_nips

  • Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer

    bit.ly/rpf_nips @ianosband

    Blog post
bit.ly/rpf_nips

    Montezuma’s Revenge!

    http://bit.ly/rpf_nips

  • Come visit our poster!Ian Osband, John Aslanides, Albin Cassirer

    bit.ly/rpf_nips @ianosband

    Blog post
bit.ly/rpf_nips

    Montezuma’s Revenge!

    Demo code
bit.ly/rpf_nips

    http://bit.ly/rpf_nips