tom schaul, dan horgan, karol gregor, dave silver...universal value function approximators. tom....

53
Universal Value Function Approximators Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver

Upload: others

Post on 13-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Universal Value Function Approximators

Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver

Page 2: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Motivation

Forecasts about the environment• = temporally abstract predictions (questions)

Page 3: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Motivation

Forecasts about the environment• = temporally abstract predictions (questions)• not necessarily related to reward• conditioned on a behavior• (aka GVFs, nexting)

Page 4: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Motivation

Forecasts about the environment• = temporally abstract predictions (questions)• not necessarily related to reward• conditioned on a behavior• (aka GVFs, nexting)• many of them

Page 5: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Motivation

Forecasts about the environment• = temporally abstract predictions (questions)• not necessarily related to reward• conditioned on a behavior• (aka GVFs, nexting)• many of them

Why?• better, richer representations (features)• decomposition, modularity• temporally abstract planning, long horizons

Page 6: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Concretely

Subgoal forecasts

Page 7: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Concretely

Subgoal forecasts• Reaching any of a set of states, then

• the episode terminates (γ = 0)• and a pseudo-reward of 1 is given

Page 8: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Concretely

Subgoal forecasts• Reaching any of a set of states, then

• the episode terminates (γ = 0)• and a pseudo-reward of 1 is given

• Various time-horizons induced by γ

Page 9: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Concretely

Subgoal forecasts• Reaching any of a set of states, then

• the episode terminates (γ = 0)• and a pseudo-reward of 1 is given

• Various time-horizons induced by γ

Q-values

Page 10: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Concretely

Subgoal forecasts• Reaching any of a set of states, then

• the episode terminates (γ = 0)• and a pseudo-reward of 1 is given

• Various time-horizons induced by γ

Q-values • for the subgoal-optimal policy

Page 11: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Concretely

Subgoal forecasts• Reaching any of a set of states, then

• the episode terminates (γ = 0)• and a pseudo-reward of 1 is given

• Various time-horizons induced by γ

Q-values • for the subgoal-optimal policy

Neural networks as function approximators

Page 12: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Page 13: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Why?• because the environment admits

tons of forecasts• any of them could be useful for the task

Page 14: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Why?• because the environment admits

tons of forecasts• any of them could be useful for the task

How?

Page 15: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Why?• because the environment admits

tons of forecasts• any of them could be useful for the task

How?• efficiency

Page 16: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Why?• because the environment admits

tons of forecasts• any of them could be useful for the task

How?• efficiency• exploit shared structure in value space

Page 17: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Why?• because the environment admits

tons of forecasts• any of them could be useful for the task

How?• efficiency• exploit shared structure in value space

Page 18: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Combinatorial numbers of subgoals

Why?• because the environment admits

tons of forecasts• any of them could be useful for the task

How?• efficiency• exploit shared structure in value space• generalize to similar subgoals

Page 19: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Outline

• Motivation• learn values for forecasts• efficiently for many subgoals

• Approach• new architecture• one neat trick

• Results

Page 20: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Universal Value Function Approximator

• a single neural network producing Q(s, a; g)

Page 21: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Universal Value Function Approximator

• a single neural network producing Q(s, a; g)• for many subgoals g• generalize between subgoals• compact

Page 22: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Universal Value Function Approximator

• a single neural network producing Q(s, a; g)• for many subgoals g• generalize between subgoals• compact

• UVFA (“you-fah”)

Page 23: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA architectures

• Vanilla (monolithic)

Page 24: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA architectures

• Vanilla (monolithic)• Two-stream

Page 25: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA architectures

• Vanilla (monolithic)• Two-stream

• separate embeddings φ and ψ for states and subgoals• Q-values = dot-product of embeddings

Page 26: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA architectures

• Vanilla (monolithic)• Two-stream

• separate embeddings φ and ψ for states and subgoals• Q-values = dot-product of embeddings• (works better)

Page 27: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA learning

• Method 1: bootstrapping

Page 28: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA learning

• Method 1: bootstrapping

• some stability issues

Page 29: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA learning

• Method 1: bootstrapping

• some stability issues

• Method 2: • built training set of subgoal values• train with supervised objective• like neuro-fitted Q-learning

Page 30: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

UVFA learning

• Method 1: bootstrapping

• some stability issues

• Method 2: • built training set of subgoal values• train with supervised objective• like neuro-fitted Q-learning• (works better)

Page 31: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Outline

• Motivation• learn values for forecasts• efficiently for many subgoals

• Approach• new architecture: UVFA• one neat trick

• Results

Page 32: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Trick for supervised UVFA learning: FLE

Page 33: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Trick for supervised UVFA learning: FLE

Stage 1: Factorize

Page 34: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Trick for supervised UVFA learning: FLE

Stage 1: FactorizeStage 2: Learn Embeddings

+

Page 35: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Stage 1: Factorize (low-rank)

~ x =

Page 36: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

● target embeddings for states and goals

Stage 1: Factorize (low-rank)

~ x =

Page 37: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

● regression from state/subgoal featuresto target embeddings

Stage 2: Learn Embeddings

s,a

Page 38: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

● regression from state/subgoal featuresto target embeddings

(optional Stage 3): end-to-end fine-tuning

Stage 2: Learn Embeddings

s,a

Page 39: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

FLE vs end-to-end regression

● between 10x and 100x faster

Page 40: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Outline

• Motivation• learn values for forecasts• efficiently for many subgoals

• Approach• new architecture: UVFA• one neat trick: FLE

• Results

Page 41: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: Low-rank is enough

Page 42: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: Low-rank embeddings

Page 43: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: Low-rank embeddings

Page 44: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: Generalizing to new subgoals

Page 45: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: Transfer to new subgoals

Refining UVFA is much faster than learning from scratch

Page 46: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: Pacman pellet subgoals

training set test set

Page 47: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: pellet subgoal values (test set)

“truth”

Page 48: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Results: pellet subgoal values (test set)

“truth”

UVFA generalization

Page 49: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Summary

• UVFA • compactly represent values • for many subgoals• generalization• transfer learning

Page 50: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Summary

• UVFA • compactly represent values • for many subgoals• generalization• transfer learning

• FLE• a trick for efficiently training UVFAs in 2 stages• side-effect: interesting embedding spaces• scales to complex domains (Pacman from raw vision)

Page 51: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver
Page 52: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Bonus results: Extrapolation

even to subgoals in unseen fourth room:

truth

UV

FA

Page 53: Tom Schaul, Dan Horgan, Karol Gregor, Dave Silver...Universal Value Function Approximators. Tom. Schaul, Dan Horgan, Karol Gregor, Dave Silver

Bonus results: Extrapolation

even to subgoals in unseen fourth room:

truth

UV

FA