welcome!

NIPS 2007 Workshop

Welcome!

Hierarchical organization of behavior

•Thank you for coming

•Apologies to the skiers…

•Why we will be strict about timing

•Why we want the workshop to be interactive

Rewards/punishments may be delayedOutcomes may depend on sequence of actions Credit assignment problem

RL: Decision making

Goal: maximize reward (minimize punishment)

RL in a nutshell: formalization

states - actions - transitions - rewards - policy - long term values

Components of an RL

task

Policy: p(S,a)State values: V(S)State-action values: Q(S,a)

S1

S3S2

44 00 22 22

RL

RL in a nutshell: forward search

S1S3

S2LR

LRLR

= 4= 0= 2= 2

Model based RL

learn model through experience (cognitive map)choosing actions is hardgoal directed behavior; cortical

Model = T(ransitions) and R(ewards)

S1

S3S2

44 00 22 22

RL

Trick #1: Long-term values are recursiveQ(S,a) = r(S,a) + V(Snext)

RL in a nutshell: cached valuesM

odel-free RL

temporal difference learning

Q(S,a) = r(S,a) + max Q(S’,a’)

TD learning:start with initial (wrong) Q(S,a)

PE = r(S,a) + max Q(S’,a’) - Q(S,a)Q(S,a)new = Q(S,a)old + PE

S1

S3S2

44 00 22 22

RL

RL in a nutshell: cached valuesM

odel-free RL

choosing actions is easy (but need lots of practice to learn)habitual behavior; basal ganglia

temporal difference learning

S1

S3S2

44 00 22 22

RL

Trick #2: Can learn values without a model

Q(S1,L) 4Q(S1,R) 2

Q(S2,L) 4Q(S2,R) 0

Q(S3,L) 2Q(S3,R) 2

RL in real world tasks…

model based vs. model free learning and control

Q(S1,L) 4Q(S1,R) 2

Q(S2,L) 4Q(S2,R) 0

Q(S3,L) 2Q(S3,R) 2 S1

S3

S2LR

LRLR

= 4= 0= 2= 2

S1

S3S2

44 00 22 22

RL

Scaling problem!

Real-world behavior is hierarchicalHierarchical RL: W

hat is it?

1. set water temp2. get wet3. shampoo4. soap

5. turn off water6. dry off

add hot

success

add coldwait 5sec

too co

ld

too hotchangejust right

simplified control, disambiguation, encapsulation

1. pour coffee2. add sugar3. add milk4. stir

HRL: (in)formal framework

Termination condition = (sub)goal stateOption policy learning: via pseudo reward (model based or model free)

Hierarchical RL: What is

it?

options - skills - macros - temporally abstract actions(Sutton, McGovern, Dietterich, Barto, Precup, Singh, Parr…)

Option: set water temperatureS1S2S8…

S1

0.80.10.1

S2

0.10.10.8

S3

010

S1 (0.1)S2 (0.1)S3 (0.9)

…initiation set policy

termination

conditions

S: start G: goalOptions: going to doorsActions: + 2 door options

HRL: a toy exampleHierarchical RL: W

hat is it?

Advantages of HRL1. Faster learning

(mitigates scaling problem)

Hierarchical RL: What is

it?

RL: no longer ‘tabula rasa’

2. Transfer of knowledge from previous tasks(generalization, shaping)

Disadvantages (or: the cost) of HRLHierarchical RL: W

hat is it?

1. Need ‘right’ options - how to learn them?2. Suboptimal behavior (“negative transfer”;

habits)3. More complex learning/control structure

no free lunches…

welcome!

Documents

model freehierarchical

rl taskpolicy

hierarchicalhierarchical

model free learning

astate values

longterm values

toy examplehierarchical

scaling problemhierarchical