potential-based reward shaping as a tool to safely incorporate auxiliary...

64
Potential-based reward shaping as a tool to safely incorporate auxiliary information Anna Harutyunyan AI Lab, VU Brussel work with Tim Brys, Peter Vrancx, Ann Now´ e (as well as Sam Devlin, Matt Taylor, Halit Bener Suay, Sonia Chernova) SequeL seminar INRIA Lille March 10, 2017 1 / 26

Upload: others

Post on 12-Mar-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential-based reward shaping as a tool to safelyincorporate auxiliary information

Anna HarutyunyanAI Lab, VU Brussel

work with Tim Brys, Peter Vrancx, Ann Nowe

(as well as Sam Devlin, Matt Taylor, Halit Bener Suay, Sonia Chernova)

SequeL seminarINRIA Lille

March 10, 20171 / 26

Page 2: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

AI Lab at VUB

I Founded in 1983 by Luc Steels as the first AI lab in mainlandEurope

I focus in evolution of language and construction grammarsI Later (late 90s?), our group split off as a ML offshoot

I Led by Ann Nowe and Bernard ManderickI 11 PhD students, 5-7 postdocsI Multi-objective / multi-agent RL, deep RL, game theory,

computational biologyI Applications to wind turbine control, smart grids, HIV

treatment strategies, etc

2 / 26

Page 3: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

AI Lab at VUB

I Founded in 1983 by Luc Steels as the first AI lab in mainlandEurope

I focus in evolution of language and construction grammarsI Later (late 90s?), our group split off as a ML offshoot

I Led by Ann Nowe and Bernard ManderickI 11 PhD students, 5-7 postdocsI Multi-objective / multi-agent RL, deep RL, game theory,

computational biologyI Applications to wind turbine control, smart grids, HIV

treatment strategies, etc

2 / 26

Page 4: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

AI Lab at VUB

I Founded in 1983 by Luc Steels as the first AI lab in mainlandEurope

I focus in evolution of language and construction grammars

I Later (late 90s?), our group split off as a ML offshoot

I Led by Ann Nowe and Bernard ManderickI 11 PhD students, 5-7 postdocsI Multi-objective / multi-agent RL, deep RL, game theory,

computational biologyI Applications to wind turbine control, smart grids, HIV

treatment strategies, etc

2 / 26

Page 5: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

AI Lab at VUB

I Founded in 1983 by Luc Steels as the first AI lab in mainlandEurope

I focus in evolution of language and construction grammarsI Later (late 90s?), our group split off as a ML offshoot

I Led by Ann Nowe and Bernard ManderickI 11 PhD students, 5-7 postdocsI Multi-objective / multi-agent RL, deep RL, game theory,

computational biologyI Applications to wind turbine control, smart grids, HIV

treatment strategies, etc

2 / 26

Page 6: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

AI Lab at VUB

I Founded in 1983 by Luc Steels as the first AI lab in mainlandEurope

I focus in evolution of language and construction grammarsI Later (late 90s?), our group split off as a ML offshoot

I Led by Ann Nowe and Bernard ManderickI 11 PhD students, 5-7 postdocsI Multi-objective / multi-agent RL, deep RL, game theory,

computational biologyI Applications to wind turbine control, smart grids, HIV

treatment strategies, etc

2 / 26

Page 7: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

This talk

will advocate for potential-based reward shaping as a suitableparadigm for integrating auxiliary information into RL, in particularwhen the information is in the form of behavioral advice, or expertdemonstration data.

3 / 26

Page 8: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

The world is full of (questionable) reward functions

I There are many sources of reward complementary to the mainobjective

Transfer old policies, similar policiesUnsupervised novelty, auxilary tasks

Supervised domain knowledge, expert data, online feedback

I How to safely combine them with the native reward?

4 / 26

Page 9: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

The world is full of (questionable) reward functions

I There are many sources of reward complementary to the mainobjective

Transfer old policies, similar policies

Unsupervised novelty, auxilary tasksSupervised domain knowledge, expert data, online feedback

I How to safely combine them with the native reward?

4 / 26

Page 10: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

The world is full of (questionable) reward functions

I There are many sources of reward complementary to the mainobjective

Transfer old policies, similar policiesUnsupervised novelty, auxilary tasks

Supervised domain knowledge, expert data, online feedback

I How to safely combine them with the native reward?

4 / 26

Page 11: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

The world is full of (questionable) reward functions

I There are many sources of reward complementary to the mainobjective

Transfer old policies, similar policiesUnsupervised novelty, auxilary tasks

Supervised domain knowledge, expert data, online feedback

I How to safely combine them with the native reward?

4 / 26

Page 12: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

The world is full of (questionable) reward functions

I There are many sources of reward complementary to the mainobjective

Transfer old policies, similar policiesUnsupervised novelty, auxilary tasks

Supervised domain knowledge, expert data, online feedback

I How to safely combine them with the native reward?

4 / 26

Page 13: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 14: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 15: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 16: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 17: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 18: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 19: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reward shaping1

In RL, a naive reward shaping scheme augments the MDP:

M = (S ,A,P, γ,R)→ M ′ = (S ,A,P, γ,R + F ),

where we hope that M ′ is somehow easier.

I F can be wrong

I F can be right, but haveinadvertent consequences

So we want to solve the original M, but use F to solve it quicker.

1Skinner BF, 1938. The behavior of organisms: An experimental analysis.5 / 26

Page 20: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential-based reward shaping (PBRS)2

DefinitionF is potential-based, if ∃Φ : S → R, s.t. ∀s, s ′ ∈ S , a ∈ A:

F (s, a, s ′) = γΦ(s ′)− Φ(s)

TheoremThat F is potential-based is sufficient and necessary (when M iscompletely unknown) to guarantee that the optimal policies ofM = (S ,A,P, γ,R) and M ′ = (S ,A,P, γ,R + F ) are the same.

2Ng, Harada and Russel. “Policy Invariance Under RewardTransformations”. ICML, 1999

6 / 26

Page 21: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential-based reward shaping (PBRS)2

DefinitionF is potential-based, if ∃Φ : S → R, s.t. ∀s, s ′ ∈ S , a ∈ A:

F (s, a, s ′) = γΦ(s ′)− Φ(s)

TheoremThat F is potential-based is sufficient and necessary (when M iscompletely unknown) to guarantee that the optimal policies ofM = (S ,A,P, γ,R) and M ′ = (S ,A,P, γ,R + F ) are the same.

2Ng, Harada and Russel. “Policy Invariance Under RewardTransformations”. ICML, 1999

6 / 26

Page 22: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential-based reward shaping (PBRS)2

DefinitionF is potential-based, if ∃Φ : S → R, s.t. ∀s, s ′ ∈ S , a ∈ A:

F (s, a, s ′) = γΦ(s ′)− Φ(s)

TheoremThat F is potential-based is sufficient and necessary (when M iscompletely unknown) to guarantee that the optimal policies ofM = (S ,A,P, γ,R) and M ′ = (S ,A,P, γ,R + F ) are the same.

2Ng, Harada and Russel. “Policy Invariance Under RewardTransformations”. ICML, 1999

6 / 26

Page 23: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential functions encode how good it is to be in a state

Policy invariance

I No positive reward cycles

I The relationship of value functions is just a constant shift:

Q∗M′(s, a) = Q∗M(s, a)− Φ(s)V ∗M′(s) = V ∗M(s)− Φ(s)

This allows to extend Φ to be over state-actions

I Ideal potential function = optimal value function

7 / 26

Page 24: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential functions encode how good it is to be in a state

Policy invariance

I No positive reward cycles

I The relationship of value functions is just a constant shift:

Q∗M′(s, a) = Q∗M(s, a)− Φ(s)V ∗M′(s) = V ∗M(s)− Φ(s)

This allows to extend Φ to be over state-actions

I Ideal potential function = optimal value function

7 / 26

Page 25: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential functions encode how good it is to be in a state

Policy invariance

I No positive reward cycles

I The relationship of value functions is just a constant shift:

Q∗M′(s, a) = Q∗M(s, a)− Φ(s)V ∗M′(s) = V ∗M(s)− Φ(s)

This allows to extend Φ to be over state-actions

I Ideal potential function = optimal value function

7 / 26

Page 26: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential functions encode how good it is to be in a state

Policy invariance

I No positive reward cycles

I The relationship of value functions is just a constant shift:

Q∗M′(s, a) = Q∗M(s, a)− Φ(s)V ∗M′(s) = V ∗M(s)− Φ(s)

This allows to extend Φ to be over state-actions

I Ideal potential function = optimal value function

7 / 26

Page 27: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Potential functions encode how good it is to be in a state

Policy invariance

I No positive reward cycles

I The relationship of value functions is just a constant shift:

Q∗M′(s, a) = Q∗M(s, a)− Φ(s)V ∗M′(s) = V ∗M(s)− Φ(s)

This allows to extend Φ to be over state-actions

I Ideal potential function = optimal value function7 / 26

Page 28: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

PBRS vs. Q-value initialization

I PBRS fell out of fashion for a time, as it was discovered thatit’s exactly equivalent to Q-value initialization, given the samestream of experience.3

I When in a tabular setting with a fixed potential function,initialization is feasible, and might be simpler

I but it gets trickier with function approximation, andimpossible with potential functions that change over time,

I PBRS still provides the only Bellman-consistent mechanism ofcombining value functions.

3Wiewiora, E. (2003). Potential-based shaping and Q-value initialization areequivalent. J. Artif. Intell. Res.(JAIR), 19, 205-208.

8 / 26

Page 29: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

PBRS vs. Q-value initialization

I PBRS fell out of fashion for a time, as it was discovered thatit’s exactly equivalent to Q-value initialization, given the samestream of experience.3

I When in a tabular setting with a fixed potential function,initialization is feasible, and might be simpler

I but it gets trickier with function approximation, andimpossible with potential functions that change over time,

I PBRS still provides the only Bellman-consistent mechanism ofcombining value functions.

3Wiewiora, E. (2003). Potential-based shaping and Q-value initialization areequivalent. J. Artif. Intell. Res.(JAIR), 19, 205-208.

8 / 26

Page 30: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

PBRS vs. Q-value initialization

I PBRS fell out of fashion for a time, as it was discovered thatit’s exactly equivalent to Q-value initialization, given the samestream of experience.3

I When in a tabular setting with a fixed potential function,initialization is feasible, and might be simpler

I but it gets trickier with function approximation, andimpossible with potential functions that change over time,

I PBRS still provides the only Bellman-consistent mechanism ofcombining value functions.

3Wiewiora, E. (2003). Potential-based shaping and Q-value initialization areequivalent. J. Artif. Intell. Res.(JAIR), 19, 205-208.

8 / 26

Page 31: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

PBRS vs. Q-value initialization

I PBRS fell out of fashion for a time, as it was discovered thatit’s exactly equivalent to Q-value initialization, given the samestream of experience.3

I When in a tabular setting with a fixed potential function,initialization is feasible, and might be simpler

I but it gets trickier with function approximation, andimpossible with potential functions that change over time,

I PBRS still provides the only Bellman-consistent mechanism ofcombining value functions.

3Wiewiora, E. (2003). Potential-based shaping and Q-value initialization areequivalent. J. Artif. Intell. Res.(JAIR), 19, 205-208.

8 / 26

Page 32: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

This talk: From information to potentials

I PBRS is theoretically justified, but requires an extraabstraction – the potential function

I In the rest of the talk, we’ll cover a few ways to obtain apotential function from whatever information is available. Inparticular:

Advice is trivial to express a reward function. We describe atrick that for an arbitrary reward function, obtainsthe corresponding potential function whose inducedshaping reward reflects the desired one exactly.

Expert trajectories. We construct a potential function centeredaround the demonstrated state-action pairs withgeneralization based on state similarity

9 / 26

Page 33: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

This talk: From information to potentials

I PBRS is theoretically justified, but requires an extraabstraction – the potential function

I In the rest of the talk, we’ll cover a few ways to obtain apotential function from whatever information is available. Inparticular:

Advice is trivial to express a reward function. We describe atrick that for an arbitrary reward function, obtainsthe corresponding potential function whose inducedshaping reward reflects the desired one exactly.

Expert trajectories. We construct a potential function centeredaround the demonstrated state-action pairs withgeneralization based on state similarity

9 / 26

Page 34: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

This talk: From information to potentials

I PBRS is theoretically justified, but requires an extraabstraction – the potential function

I In the rest of the talk, we’ll cover a few ways to obtain apotential function from whatever information is available. Inparticular:

Advice is trivial to express a reward function. We describe atrick that for an arbitrary reward function, obtainsthe corresponding potential function whose inducedshaping reward reflects the desired one exactly.

Expert trajectories. We construct a potential function centeredaround the demonstrated state-action pairs withgeneralization based on state similarity

9 / 26

Page 35: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

This talk: From information to potentials

I PBRS is theoretically justified, but requires an extraabstraction – the potential function

I In the rest of the talk, we’ll cover a few ways to obtain apotential function from whatever information is available. Inparticular:

Advice is trivial to express a reward function. We describe atrick that for an arbitrary reward function, obtainsthe corresponding potential function whose inducedshaping reward reflects the desired one exactly.

Expert trajectories. We construct a potential function centeredaround the demonstrated state-action pairs withgeneralization based on state similarity

9 / 26

Page 36: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

From rewards to potentials: Bellman equations to therescue4

I By definition of PBRS we have that the shaping reward

F = γPπΦ− Φ.

I Normally, we know Φ, and calculate F . What if we know Fand would like to obtain Φ?

I Well, Φ = (I − γPπ)−1(−F ), which is exactly something wecan estimate in RL!

I The reason this helps is becauseI evaluation is easierI Φ will be helpful long before convergence

4Harutyunyan, A., Devlin, S., Vrancx, P., and Nowe, A. Expressing ArbitraryReward Functions as Potential-Based Advice. In AAAI 2015.

10 / 26

Page 37: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

From rewards to potentials: Bellman equations to therescue4

I By definition of PBRS we have that the shaping reward

F = γPπΦ− Φ.

I Normally, we know Φ, and calculate F . What if we know Fand would like to obtain Φ?

I Well, Φ = (I − γPπ)−1(−F ), which is exactly something wecan estimate in RL!

I The reason this helps is becauseI evaluation is easierI Φ will be helpful long before convergence

4Harutyunyan, A., Devlin, S., Vrancx, P., and Nowe, A. Expressing ArbitraryReward Functions as Potential-Based Advice. In AAAI 2015.

10 / 26

Page 38: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

From rewards to potentials: Bellman equations to therescue4

I By definition of PBRS we have that the shaping reward

F = γPπΦ− Φ.

I Normally, we know Φ, and calculate F . What if we know Fand would like to obtain Φ?

I Well, Φ = (I − γPπ)−1(−F ), which is exactly something wecan estimate in RL!

I The reason this helps is becauseI evaluation is easierI Φ will be helpful long before convergence

4Harutyunyan, A., Devlin, S., Vrancx, P., and Nowe, A. Expressing ArbitraryReward Functions as Potential-Based Advice. In AAAI 2015.

10 / 26

Page 39: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Empirically it works

Figure: Left: The gridworld example. Right: Cartpole.

11 / 26

Page 40: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Learning from online feedback

A nice corollary is that it’s possible to soundly learn from sporadic(e.g. online) feedback.

It wasn’t before because:

I it is very easy to create positive reward cycles with ad hocfeedback, so non-PB is dangerous

I A naive translation of Φ = RA does not work either:I Let RA(s, a) = 1 at an approved action and 0 elsewhereI Then F (s, a, s ′) = Φ(s ′, a′)− Φ(s, a) = −1I The desired behavior got a negative shaping reward

12 / 26

Page 41: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Learning from online feedback

A nice corollary is that it’s possible to soundly learn from sporadic(e.g. online) feedback.It wasn’t before because:

I it is very easy to create positive reward cycles with ad hocfeedback, so non-PB is dangerous

I A naive translation of Φ = RA does not work either:I Let RA(s, a) = 1 at an approved action and 0 elsewhereI Then F (s, a, s ′) = Φ(s ′, a′)− Φ(s, a) = −1I The desired behavior got a negative shaping reward

12 / 26

Page 42: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Learning from online feedback

A nice corollary is that it’s possible to soundly learn from sporadic(e.g. online) feedback.It wasn’t before because:

I it is very easy to create positive reward cycles with ad hocfeedback, so non-PB is dangerous

I A naive translation of Φ = RA does not work either:I Let RA(s, a) = 1 at an approved action and 0 elsewhereI Then F (s, a, s ′) = Φ(s ′, a′)− Φ(s, a) = −1I The desired behavior got a negative shaping reward

12 / 26

Page 43: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Advising Mario online5

I 12 actions

I 7000 state features per action

I The advisor clicked a button ifit approved of Mario’sbehavior, and did nothingotherwise (a very frustratingsetting)

Variant Advice phase Cumulative

Baseline -376±51 470±83

Non-expert 401±54 677±60

Expert 402±62 774±47

5Harutyunyan, A., Brys, T., Vrancx, P., and Nowe, A. Shaping mario withhuman advice (demonstration). In Proceedings of AAMAS 2015.

13 / 26

Page 44: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Discussion

I Works well when Φ isn’t too hard to learnI e.g. when the advice is sparse, and has the same sign

I Stability may be an issue, since the potential function must belearnt on-policy, while in control the behavior typicallychanges.

I A policy iteration setting or updating on two timescales isneeded.

14 / 26

Page 45: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Learning from demonstration (LfD)

LfD setting

Turn the learning problem into a classification task:

I Given examples {(si , ai )}Ki=0 of actions in states, andassuming they are optimal, the goal is to learn the underlyingmapping π : S → A, which by assumption will be optimal.

The assumption is practical, but problematic, as the agent maynever surpass the teacher. We’d like to be able to improve throughlearning. But how to incorporate?

15 / 26

Page 46: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Learning from demonstration (LfD)

LfD setting

Turn the learning problem into a classification task:

I Given examples {(si , ai )}Ki=0 of actions in states, andassuming they are optimal, the goal is to learn the underlyingmapping π : S → A, which by assumption will be optimal.

The assumption is practical, but problematic, as the agent maynever surpass the teacher. We’d like to be able to improve throughlearning. But how to incorporate?

15 / 26

Page 47: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Reinforcement learning from demonstration (RLfD)through shaping6

Let us consider tasks where both demonstrations and anenvironment reward signal are available.

I In any goal-directed task (however complex) this can be doneby having a reward of 1 at the goal, and a reward of 0elsewhere. This will be extremely difficult to learn, butoptimizing it will be sure to do the right thing.

In such tasks, we propose to incorporate demonstration data intothe reward function through PBRS.

6Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor, M. E., andNowe, A. Reinforcement Learning from Demonstration through Shaping. InIJCAI 2015

16 / 26

Page 48: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Similarity-based shaping

Construct a potential function centered around the demonstratedstate-action pairs, with generalization w.r.t. a covariance matrixover states. Simple form:

Σ = σI ,

g(s, a, sd , ad ,Σ) =

{e−

12

(s−sd )T Σ−1(s−sd ), if a = ad

0, otherwise

Φ(s, a) = maxsd ,ad

g(s, a, sd , ad ,Σ)

17 / 26

Page 49: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Empirical studies: learning curves

Figure: Left: Cartpole, Right: Mario

18 / 26

Page 50: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Empirical studies: effect of demonstration length (left) andquality (right)

19 / 26

Page 51: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Discussion

I Works surprisingly well with a very small number (consistent!)demonstrations

I Can be robust to the quality of the demonstrations

I Performance depends on the feature covariance kernel: in ourexperiments, it was of fixed width, which was a parameter

I Computationally expensive – search for a max overdemonstration pairs at every iteration, but speedup tricks arepossible

20 / 26

Page 52: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Inverse RL through PBRS 7

Inverse RLGiven examples {(si , ai )}Ki=0 of expert actions, infer the rewardfunction that the expert policy has optimized, and solve the MDPw.r.t. it

I This work doesn’t assume access to the model, and solves theMDP from samples.

Same problem of relying on the expert being optimal. We could trythe same trick as before:

1. Infer the expert reward function

2. Learn the potential function that expresses it, as discussed

3. Use PBRS to incorporate it safely

7Suay, H. B., Brys, T., Taylor, M. E., and Chernova, S. Learning fromdemonstration for shaping through inverse reinforcement learning. In AAMAS2016.

21 / 26

Page 53: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Inverse RL through PBRS 7

Inverse RLGiven examples {(si , ai )}Ki=0 of expert actions, infer the rewardfunction that the expert policy has optimized, and solve the MDPw.r.t. it

I This work doesn’t assume access to the model, and solves theMDP from samples.

Same problem of relying on the expert being optimal. We could trythe same trick as before:

1. Infer the expert reward function

2. Learn the potential function that expresses it, as discussed

3. Use PBRS to incorporate it safely

7Suay, H. B., Brys, T., Taylor, M. E., and Chernova, S. Learning fromdemonstration for shaping through inverse reinforcement learning. In AAMAS2016.

21 / 26

Page 54: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Inverse RL through PBRS 7

Inverse RLGiven examples {(si , ai )}Ki=0 of expert actions, infer the rewardfunction that the expert policy has optimized, and solve the MDPw.r.t. it

I This work doesn’t assume access to the model, and solves theMDP from samples.

Same problem of relying on the expert being optimal. We could trythe same trick as before:

1. Infer the expert reward function

2. Learn the potential function that expresses it, as discussed

3. Use PBRS to incorporate it safely

7Suay, H. B., Brys, T., Taylor, M. E., and Chernova, S. Learning fromdemonstration for shaping through inverse reinforcement learning. In AAMAS2016.

21 / 26

Page 55: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Inverse RL through PBRS 7

Inverse RLGiven examples {(si , ai )}Ki=0 of expert actions, infer the rewardfunction that the expert policy has optimized, and solve the MDPw.r.t. it

I This work doesn’t assume access to the model, and solves theMDP from samples.

Same problem of relying on the expert being optimal. We could trythe same trick as before:

1. Infer the expert reward function

2. Learn the potential function that expresses it, as discussed

3. Use PBRS to incorporate it safely

7Suay, H. B., Brys, T., Taylor, M. E., and Chernova, S. Learning fromdemonstration for shaping through inverse reinforcement learning. In AAMAS2016.

21 / 26

Page 56: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Empirically (Mario)

SBS Φ is similarity-based

HAT another RLfD benchmark

SIS Φ = R (f of state)

DIS Φ is learnt w.r.t. to −R

There is benefit to learning a value function, rather than treatingR as a myopic potential.

22 / 26

Page 57: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Outro

I We advocate using potentials when trying to modify thereward function.

I It adds an extra safeguard abstraction between the agent andimperfect domain knowledge

I We discussed a few ways to obtain this abstraction

Philosophically

Dense reward function are dangerous – we don’t know whatloopholes the agent will find. E.g. Atari games are full of these.8

Having a minimalistic reward signal, and many overlappingpotential fields that guide the agent around the environment seemslike an appealing way to think about learning.

8https://openai.com/blog/faulty-reward-functions/23 / 26

Page 58: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Outro

I We advocate using potentials when trying to modify thereward function.

I It adds an extra safeguard abstraction between the agent andimperfect domain knowledge

I We discussed a few ways to obtain this abstraction

Philosophically

Dense reward function are dangerous – we don’t know whatloopholes the agent will find. E.g. Atari games are full of these.8

Having a minimalistic reward signal, and many overlappingpotential fields that guide the agent around the environment seemslike an appealing way to think about learning.

8https://openai.com/blog/faulty-reward-functions/23 / 26

Page 59: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

What’s missing

I A lot of room for theoretical work

I Can we have theoretically grounded prescriptions for “good”potentials that guarantee improvement in learning given someassumptions on the reward structure?

I Can we quantify the effect of PBRS beyond it being anexploration boost?

I E.g. there is intuition that it reduces the planning horizon

Theorem (PBRS reduces planning horizon9)Let M be the MDP (S ,A,T , γ,R). Let F be a PB reward function w.r.t.Φ, s.t. ||Φ− V ∗M ||∞ < ε, and let π be the optimal policy for

M = (S ,A,T , γ′,R + F ) for some γ′ < γ. Then:

V πM ≥ V ∗M − Oγ((γ − γ′)ε)

9Ng. Shaping and policy search in reinforcement learning. PhD thesis. 2003.24 / 26

Page 60: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

What’s missing

I A lot of room for theoretical workI Can we have theoretically grounded prescriptions for “good”

potentials that guarantee improvement in learning given someassumptions on the reward structure?

I Can we quantify the effect of PBRS beyond it being anexploration boost?

I E.g. there is intuition that it reduces the planning horizon

Theorem (PBRS reduces planning horizon9)Let M be the MDP (S ,A,T , γ,R). Let F be a PB reward function w.r.t.Φ, s.t. ||Φ− V ∗M ||∞ < ε, and let π be the optimal policy for

M = (S ,A,T , γ′,R + F ) for some γ′ < γ. Then:

V πM ≥ V ∗M − Oγ((γ − γ′)ε)

9Ng. Shaping and policy search in reinforcement learning. PhD thesis. 2003.24 / 26

Page 61: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

What’s missing

I A lot of room for theoretical workI Can we have theoretically grounded prescriptions for “good”

potentials that guarantee improvement in learning given someassumptions on the reward structure?

I Can we quantify the effect of PBRS beyond it being anexploration boost?

I E.g. there is intuition that it reduces the planning horizon

Theorem (PBRS reduces planning horizon9)Let M be the MDP (S ,A,T , γ,R). Let F be a PB reward function w.r.t.Φ, s.t. ||Φ− V ∗M ||∞ < ε, and let π be the optimal policy for

M = (S ,A,T , γ′,R + F ) for some γ′ < γ. Then:

V πM ≥ V ∗M − Oγ((γ − γ′)ε)

9Ng. Shaping and policy search in reinforcement learning. PhD thesis. 2003.24 / 26

Page 62: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

What’s missing

I A lot of room for theoretical workI Can we have theoretically grounded prescriptions for “good”

potentials that guarantee improvement in learning given someassumptions on the reward structure?

I Can we quantify the effect of PBRS beyond it being anexploration boost?

I E.g. there is intuition that it reduces the planning horizon

Theorem (PBRS reduces planning horizon9)Let M be the MDP (S ,A,T , γ,R). Let F be a PB reward function w.r.t.Φ, s.t. ||Φ− V ∗M ||∞ < ε, and let π be the optimal policy for

M = (S ,A,T , γ′,R + F ) for some γ′ < γ. Then:

V πM ≥ V ∗M − Oγ((γ − γ′)ε)

9Ng. Shaping and policy search in reinforcement learning. PhD thesis. 2003.24 / 26

Page 63: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

Thank you for your attention!

Questions?

25 / 26

Page 64: Potential-based reward shaping as a tool to safely incorporate auxiliary informationanna.harutyunyan.net/wp-content/uploads/2017/08/inria... · 2017-08-02 · Potential-based reward

ReferencesI Ng, Harada and Russel. “Policy Invariance Under Reward

Transformations”. ICML, 1999I Wiewiora, E. (2003). Potential-based shaping and Q-value

initialization are equivalent. J. Artif. Intell. Res.(JAIR), 19.I Harutyunyan, A., Devlin, S., Vrancx, P., and Nowe, A.

Expressing Arbitrary Reward Functions as Potential-BasedAdvice. In AAAI 2015

I Harutyunyan, A., Brys, T., Vrancx, P., and Nowe, A. Shapingmario with human advice (Demonstration). In AAMAS 2015

I Brys, T., Harutyunyan, A., Suay, H. B., Chernova, S., Taylor,M. E., and Nowe, A. Reinforcement Learning fromDemonstration through Shaping. In IJCAI 2015

I Suay, H. B., Brys, T., Taylor, M. E., and Chernova, S.Learning from demonstration for shaping through inversereinforcement learning. In AAMAS 2016

I Ng. Shaping and policy search in reinforcement learning. PhDthesis. 2003

26 / 26