neural reinforcement learning peter dayan gatsby computational neuroscience unit thanks to yael niv...

56
Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some sl

Upload: alexander-perry

Post on 17-Jan-2016

232 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Neural Reinforcement Learning

Peter DayanGatsby Computational Neuroscience Unit

thanks to Yael Niv for some slides

Page 2: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

2

Marrian Conditioning

• Ethology– optimality– appropriateness

• Psychology– classical/operant conditioning

• Computation– dynamic progr.– Kalman filtering

• Algorithm– TD/delta rules– simple weights• Neurobiology

neuromodulators; midbrain; sub-cortical;cortical structures

prediction: of important eventscontrol: in the light of those predictions

Page 3: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides
Page 4: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Plan

• prediction; classical conditioning– Rescorla-Wagner/TD learning– dopamine

• action selection; instrumental conditioning– direct/indirect immediate actors– trajectory optimization– multiple mechanisms– vigour

• Bayesian decision theory and computational psychiatry

Page 5: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

= Conditioned Stimulus

= Unconditioned Stimulus

= Unconditioned Response (reflex);Conditioned Response (reflex)

Animals learn predictions

Ivan Pavlov

Page 6: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Animals learn predictions

Ivan Pavlov

Acquisition Extinction

020406080

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Blocks of 10 Trials

% C

Rs

very general across species, stimuli, behaviors

Page 7: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

But do they really?

temporal contiguity is not enough - need contingencytemporal contiguity is not enough - need contingency

1. Rescorla’s control

P(food | light) > P(food | no light)

Page 8: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

But do they really?

contingency is not enough either… need surprisecontingency is not enough either… need surprise

2. Kamin’s blocking

Page 9: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

But do they really?

seems like stimuli compete for learningseems like stimuli compete for learning

3. Reynold’s overshadowing

Page 10: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Theories of prediction learning: Goals

• Explain how the CS acquires “value”• When (under what conditions) does this happen?• Basic phenomena: gradual learning and extinction curves• More elaborate behavioral phenomena• (Neural data)

P.S. Why are we looking at old-fashioned Pavlovian conditioning?

it is the perfect uncontaminated test case for examining prediction learning on its own

Page 11: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

error-driven learning: change in value is proportional to the differencebetween actual and predicted outcome

Assumptions:

1. learning is driven by error (formalizes notion of surprise)

2. summations of predictors is linear

A simple model - but very powerful!– explains: gradual acquisition & extinction, blocking, overshadowing,

conditioned inhibition, and more..– predicted overexpectation

note: US as “special stimulus”

Rescorla & Wagner (1972)

jCSUSCS jiVrV

Page 12: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

• how does this explain acquisition and extinction?

• what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0

– what would V be on average after learning?

– what would the error term be on average after learning?

Rescorla-Wagner learning

Vt1 Vt rt Vt

Page 13: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …?

Rescorla-Wagner learning

Vt (1 )t i rii1

t

tttt VrVV 1

0

0.1

0.2

0.3

0.4

0.5

0.6

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1

Vt1 (1 )Vt rt

recent rewards weigh more heavily

why is this sensible?

learning rate = forgetting rate!

the R-W rule estimates expected reward using a weighted average of past rewards

Page 14: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Summary so far

Predictions are useful for behavior

Animals (and people) learn predictions (Pavlovian conditioning = prediction learning)

Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality

Marr:

VrV

EV

VrE

VV

USCS

CS

US

jCS

i

i

j

2

Page 15: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

But: second order conditioning

15

20

25

30

35

40

45

50

number of phase 2 pairings

cond

itio

ned

resp

ondi

ng

animals learn that a predictor of a predictor is also a predictor of reward!

not interested solely in predicting immediate reward

animals learn that a predictor of a predictor is also a predictor of reward!

not interested solely in predicting immediate reward

phase 1:

phase 2:

test: ?what do you think will happen?

what would Rescorla-Wagner learning predict here?

Page 16: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

lets start over: this time from the top

Marr’s 3 levels:• The problem: optimal prediction of future reward

• what’s the obvious prediction error?

• what’s the obvious problem with this?

Vt E riit

T

want to predict expected sum of

future reward in a trial/episode

(N.B. here t indexes time within a trial)

t

T

tiit Vr

CSVr RW

Page 17: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

lets start over: this time from the top

Marr’s 3 levels:• The problem: optimal prediction of future reward

Vt E riit

T

want to predict expected sum of

future reward in a trial/episode

tttt

tt

Tttt

Ttttt

VVrE

VrE

rrrErE

rrrrEV

1

1

21

21

...

...

Bellman eqnfor policyevaluation

Page 18: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

lets start over: this time from the top

Marr’s 3 levels:• The problem: optimal prediction of future reward• The algorithm: temporal difference learning

Vt E rt Vt1

Vt (1 )Vt (rt Vt1)

Vt Vt (rt Vt1 Vt )

temporal difference prediction error t

VT 1 VT rT VT compare to:

Page 19: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

19

Dopamine

Page 20: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

20

Rewards rather than Punishments

no prediction prediction, reward prediction, no reward

(t)=r(t)+V(t+1)- V(t)TD error

V(t)

R

RL

dopamine cells in VTA/SNc Schultz et al

Page 21: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Risk Experiment

< 1 sec

0.5 sec

You won40 cents

5 secISI

19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced

2-5secITI

5 stimuli:

40¢20¢

0/40¢0¢0¢

Page 22: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Neural results: Prediction Errorswhat would a prediction error look like (in BOLD)?

Page 23: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Neural results: Prediction errors in NAC

unbiased anatomical ROI in nucleus accumbens (marked per subject*)

* thanks to Laura deSouza

raw BOLD(avg over all subjects)

can actually decide between different neuroeconomic models of risk

Page 24: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Plan

• prediction; classical conditioning– Rescorla-Wagner/TD learning– dopamine

• action selection; instrumental conditioning– direct/indirect immediate actors– trajectory optimization– multiple mechanisms– vigour

• Bayesian decision theory and computational psychiatry

Page 25: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Action Selection

• Evolutionary specification• Immediate reinforcement:

– leg flexion– Thorndike puzzle box– pigeon; rat; human matching

• Delayed reinforcement:– these tasks– mazes– chess

Page 26: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides
Page 27: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Immediate Reinforcement

• stochastic policy:

• based on action values:

27

RL mm ;

Page 28: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

28

Indirect Actoruse RW rule:

25.0;05.0 RL p

R

p

L rr switch every 100 trials

Page 29: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Direct ActorRL rRPrLPE ][][)( m

][][][

][][][

RPLPm

RPRPLP

m

LPRL

( )[ ] [ ] [ ]L L R

L

EP L r P L r P R r

m

m

( )[ ] ( )L

L

EP L r E

m

mm

( )( ) if L is chosenL

L

Er E

m

mm

(1 )( ) ( ( ))( )L R L R am m m m r E L R m

Page 30: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

30

Direct Actor

Page 31: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

31

Action at a (Temporal) Distance

• learning an appropriate action at S1:– depends on the actions at S2 and S3

– gains no immediate feedback

• idea: use prediction as surrogate feedback

S1

S3S2

04 2 2

Page 32: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

32

Direct Action Propensities

S1 S2 S3

)()1()()( tVtVtt 230)(t 210)(t

1-1

start with policy: ));();(();( RSmLSmLS evaluate it: )(),(),( 321 SVSVSV

improve it: )();( RmLm

thus choose L more frequently than R )(Lm

S2

S1

S3

Page 33: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

33

Policy

• spiraling links between striatum and dopamine system: ventral to dorsal

if 0• value is too pessimistic• action is better than average

V

S1 S2 S3

S2

S1

S3

vmPFCOFC/dACCdPFCSMAMx

Page 34: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Variants: SARSA CuxxVrECQ tttt ,1|)(),1( 1

**

),1(),2(),1(),1( CQuQrCQCQ actualt

Morris et al, 2006

Page 35: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Variants: Q learning CuxxVrECQ tttt ,1|)(),1( 1

**

),1(),2(max),1(),1( CQuQrCQCQ ut

Roesch et al, 2007

Page 36: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Summary

• prediction learning– Bellman evaluation

• actor-critic– asynchronous policy iteration

• indirect method (Q learning)– asynchronous value iteration

1|)()1( 1** ttt xxVrEV

CuxxVrECQ tttt ,1|)(),1( 1**

Page 37: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Three Decision Makers

• tree search• position evaluation• situation memory

Page 38: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Tree-Search/Model-Based System• Tolmanian forward model• forwards/backwards tree search• motivationally flexible

• OFC; dlPFC; dorsomedial striatum; BLA?• statistically efficient• computationally catastrophic

Page 39: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Habit/Model-Free System• works by minimizing inconsistency:

• model-free, cached• dopamine; CEN?• dorsolateral striatum

• motivationally insensitive• statistically inefficient• computationally congenial

...)1()())(( trtrtxV ))1(()( txVtr

))(())1(()()( txVtxVtrt

Page 40: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Need a Canary...

• if a c and c £££ , then do more of a or b?– MB: b– MF: a (or even no effect)

a b

c

Page 41: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Behaviour

• action values depend on both systems:

• expect that will vary by subject (but be fixed)

),(),(, axQaxQaxQ MBMFtot

Page 42: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Neural Prediction Errors (12)

• note that MB RL does not use this prediction error – training signal?

R ventral striatum

(anatomical definition)

Page 43: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Pavlovian Control

Keay & Bandler, 2001

Page 44: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

44

Vigour

• Two components to choice:– what:

• lever pressing• direction to run• meal to choose

– when/how fast/how vigorous• free operant tasks

• real-valued DP

Page 45: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

45

The model

choose(action,) = (LP,1)

1 time

CostsRewards

choose(action,)= (LP,2)

CostsRewards

cost

LP

NP

Other

?

how fast

2 timeS1 S2

V

Cvigour cost

UC

unit cost(reward)

S0

UR

PR

goal

Page 46: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

46

The model

choose(action,) = (LP,1)

1 time

CostsRewards

choose(action,)= (LP,2)

CostsRewards

2 timeS1 S2S0

Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time)

ARL

Page 47: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

47

Compute differential values of actions

Differential value of taking action L

with latency when in state x

ρ = average rewards

minus costs, per unit time

• steady state behavior (not learning dynamics)

(Extension of Schwartz 1993)

QL,(x) = Rewards – Costs + Future Returns

Average Reward RL

)'(xV

v

uCC

Page 48: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

48

Þ Choose action with largest expected reward minus cost

1. Which action to take?

• slow delays (all) rewards• net rate of rewards = cost of

delay (opportunity cost of time)

Þ Choose rate that balances vigour and opportunity costs

2.How fast to perform it?• slow less costly (vigour

cost)

Average Reward Cost/benefit Tradeoffs

explains faster (irrelevant) actions under hunger, etc

masochism

Page 49: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

49

0 0.5 1 1.50

0.2

0.4pr

obab

ility

0 20 400

10

20

30

rate

per

min

ute

1st NP

LP

Optimal response rates

Exp

eri

me

nta

l da

ta

Niv, Dayan, Joel, unpublished1st Nose poke

seconds since reinforcement

Mo

de

l sim

ula

tion

1st Nose poke

seconds since reinforcement0 0.5 1 1.5

0

0.2

0.4

prob

abili

ty

0 20 400

10

20

30

rate

per

min

ute

seconds

seconds

Page 50: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

51

Effects of motivation (in the model)RR25

low utilityhigh utility

mea

n la

ten

cy

LP Other

energizing effect

RxVC

CRpuxQ vur

)'(),,(

0),,(

2

RCuxQ v

opt

vopt

R

C

Page 51: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

52

What about tonic dopamine?

Phasic dopamine firing = reward prediction error

moreless

Relation to Dopamine

Page 52: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

53

Tonic dopamine hypothesis

Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992

reaction time

firin

g ra

te

…also explains effects of phasic dopamine on response times

$ $ $ $ $ $♫ ♫ ♫ ♫ ♫♫

Page 53: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

Plan

• prediction; classical conditioning– Rescorla-Wagner/TD learning– dopamine

• action selection; instrumental conditioning– direct/indirect immediate actors– trajectory optimization– multiple mechanisms– vigour

• Bayesian decision theory and computational psychiatry

Page 54: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

The Computational Level

• (approximate) Bayesian decision theory– states

• ignorant: have to represent and infer– combine priors & likelihoods

– actions • catastrophic calculations

– short-cuts from habitization; evolutionary programming

– utilities• state-dependent

– some approximations can’t keep up

sensory cortex. PFC

striatum; (pre-)motor cortex, PFC

hypothalamus; VTA/SNc; OFC

choice: action that maximises the expected utility over the states

Page 55: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

How Does the Brain Fail to Work?

• wrong problem– priors; likelihoods; utilities

• right problem, wrong inference– e.g., early/over-habitization; – over-reliance on Pavlovian mechanisms

• right problem, right inference, wrong environment– learned helplessness– discounting from inconsistency

not completely independent: e.g., miscalibration

Page 56: Neural Reinforcement Learning Peter Dayan Gatsby Computational Neuroscience Unit thanks to Yael Niv for some slides

57

Conditioning

• Ethology– optimality– appropriateness

• Psychology– classical/operant conditioning

• Computation– dynamic progr.– Kalman filtering

• Algorithm– TD/delta rules– simple weights

• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum

prediction: of important eventscontrol: in the light of those predictions