1 dopamine and prediction error no predictionprediction, rewardprediction, no reward td error vtvt r...

1

dopamine and prediction error

no prediction prediction, reward prediction, no reward

TD error

Vt

R

RL

tttt VVr 1

)(t

Schultz 1997

humans are no different

• dorsomedial striatum/PFC– goal-directed control

• dorsolateral striatum– habitual control

• ventral striatum– Pavlovian control; value signals

• dopamine...

in humans…

< 1 sec

0.5 sec

You won40 cents

5 secISI

19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced

2-5secITI

5 stimuli:

40¢20¢

0/40¢0¢0¢

what would a prediction error look like (in BOLD)?

prediction errors in NAC

unbiased anatomical ROI in nucleus accumbens (marked per subject*)

* thanks to Laura deSouza

raw BOLD(avg over all subjects)

can actually decide between different neuroeconomic models of risk

Polar Exploration

Peter Dayan

Nathaniel Daw John O’Doherty Ray Dolan

Exploration vs. exploitation

Classic dilemma in learned decision making

For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained


• Exploitation– Choose action expected to be best– May never discover something better

Time

Reward



• Exploration:– Choose action expected to be worse

Time

Reward



• Exploration:– Choose action expected to be worse– If it is, then go back to the original

Time

Reward



• Exploration:– Choose action expected to be worse

Time

Reward



• Exploration:– Choose action expected to be worse– If it is better, then exploit in the future

Time

Reward



• Exploration:– Choose action expected to be worse– Balanced by the long-term gain if it turns out better– (Even for risk or ambiguity averse subjects)– nb: learning non trivial when outcomes noisy or changing

Time

Reward

Bayesian analysis (Gittins 1972)• Tractable dynamic program in

restricted class of problems– “n-armed bandit”

• Solution requires balancing– Expected outcome values– Uncertainty (need for exploration)– Horizon/discounting (time to exploit)

• Optimal policy: Explore systematically– Choose best sum of value plus bonus– Bonus increases with uncertainty

• Intractable in general setting– Various heuristics used in practice

ActionV

alue

Experiment• How do humans handle tradeoff?

• Computation: Which strategies fit behavior?– Several popular approximations

• Difference: what information influences exploration?

• Neural substrate: What systems are involved?– PFC, high level control

• Competitive decision systems (Daw et al. 2005)

– Neuromodulators• dopamine (Kakade & Dayan 2002)• norepinephrine (Usher et al. 1999)

Task design

Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),

in scanner

Slotsrevealed

TrialOnset

Task design


in scanner

Subject makes choice -chosen slot spins.

Slotsrevealed

TrialOnset

+~430 ms

+

Task design


in scanner


Slotsrevealed

Outcome:Payoffrevealed

TrialOnset

+~430 ms

+~3000 ms

obtained57

points

+

+

Task design


in scanner


Slotsrevealed

Outcome:Payoffrevealed

TrialOnset

+~430 ms

+~3000 ms

Screen cleared

+~1000 ms

Trial ends

obtained57

points

+

+

+

Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them

Payoff structureP

ayof

f

Payoff structureNonstationary to encourage ongoing exploration

(Gaussian drift w/ decay)

Analysis strategy

• Behavior: Fit an RL model to choices– Find best fitting parameters– Compare different exploration models

• Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.)– Use these as regressors for the fMRI signal– After Sugrue et al.

Behavior

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

Behavior model

1. Estimate payoffs


Kalman filterError update(like TD)Exact inference mgreen , mred etc

sgreen , sred etc


Pgreen , Pred etc

x

x

Behavior model

1. Estimate payoffs


Kalman filterError update(like TD)Exact inference pa

yoff

trial t t+1

x



Pgreen , Pred etc

x

x

Behavior model

1. Estimate payoffs



yoff

trial t t+1

x

x



Pgreen , Pred etc

red red

x

x

Behavior model

1. Estimate payoffs



yoff

trial t t+1

x

x



Pgreen , Pred etc

x

x

red red

Behavior model

1. Estimate payoffs



yoff

trial t t+1



Pgreen , Pred etc

red red

2 2(1 )red red 2 2 2/red d ore

Behrens & volatility

Behavior model

1. Estimate payoffs


Kalman filter

Compare rules:How is exploration directed?



Pgreen , Pred etc

Behavior model




Pgreen , Pred etc


Behavior model




Action

Val

ue

(dumber) (smarter)

Behavior model




Action

Val

ue

(dumber) (smarter)

Pro

babi

lity

Randomly“e-greedy”

1 3 if max(all )

oth

erwise

red

redP

Behavior model




Action

Val

ue

(dumber) (smarter)

Pro

babi

lity


By value“softmax”

exp( )red redβP μ1 3 if max(all )

oth

erwise

red

redP

Behavior model




Action

Val

ue

(dumber) (smarter)

Pro

babi

lity


By value“softmax”

By value and uncertainty“uncertainty bonuses”

1 3 if max(all )

oth

erwise

red

redP

exp( )red redβP μ exp( [ ])red red redP μ σβ +φ

Model comparison

• Assess models based on likelihood of actual choices– Product over subjects and trials of modeled

probability of each choice– Find maximum likelihood parameters

• Inference parameters, choice parameters• Parameters yoked between subjects• (… except choice noisiness, to model all

heterogeneity)

Behavioral results

• Strong evidence for exploration directed by value

• No evidence for direction by uncertainty– Tried several

variations

e-greedy softmax uncertainty bonuses

-log likelihood(smaller is better)

4208.3 3972.1 3972.1

19 19 20# parameters

Imaging methods

• 1.5 T Siemens Sonata scanner• Sequence optimized for OFC (Deichmann

et al. 2003)• 2x385 volumes; 36 slices; 3mm thickness• 3.24 secs TR• SPM2 random effects model

• Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.

Imaging results

• TD error: dopamine targets (dorsal and ventral striatum)

• Replicate previous studies, but weakish– Graded payoffs? p<0.01

p<0.001

L

vStr

dStr

x,y,z=9,12,-9

x,y,z=9,0,18

Value-related correlates

p<0.01p<0.001

L

p<0.01p<0.001

vmPFC vmPFC

mOFC mOFCL

probability (or exp. value) of chosen action: vmPFC

payoff amount: OFC

% s

igna

l cha

nge

% s

igna

l cha

nge

probability of chosen action

payoff

x,y,z=-3,45,-18

x,y,z=3,30,-21

Exploration• Non-greedy > greedy choices:

exploration• Frontopolar cortex• Survives whole-brain correction

LrFP

rFP

LFP

p<0.01

p<0.001

x,y,z=-27,48,4; 27,57,6

TimecoursesFrontal pole IPS

• Do other factors explain differential BOLD activity better?– Multiple regression vs. RT, actual reward, predicted reward, choice prob,

stay vs. switch, uncertainty, more– Only explore/exploit is significant– (But 5 additional putative explore areas eliminated)

• Individual subjects: BOLD differences stronger for better behavioral fit

Checks

Frontal poles

• Imaging – high level control– Coordinating goals/subgoals (Koechlin et al. 1999, Braver &

Bongiolatti 2002; Badre & Wagner 2004)– Mediating cognitive processes (Ramnani & Owen 2004)– Nothing this computationally specific

• Lesions: task switching (Burgess et al. 2000)– more generic: perseveration

• “One of the least well understood regions of the human brain”

• No cortical connections outside PFC (“PFC for PFC”)

• Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003)

Interpretation• Cognitive decision to explore overrides habit circuitry?

Via parietal?– Higher FP response when exploration chosen most against the

odds– Explore RT longer

• Exploration/exploitation are neurally distinct• Computationally surprising, esp. bad for uncertainty

bonus schemes– proper exploration requires computational integration– no behavioral evidence either

• Why softmax? Can misexplore– Deterministic bonus schemes bad in adversarial/multiagent

setting– Dynamic temperature control? (norepinephrine; Usher et al.; Doya)

Conclusions

• Subjects direct exploration by value but not uncertainty

• Cortical regions differentially implicated in exploration– computational consequences

• Integrative approach: computation, behavior, imaging– Quantitatively assess & constrain models using raw

behavior– Infer subjective states using model, study neural

correlates

Open Issues

• model-based vs model-free vs Pavlovian control– environmental priors vs naive optimism vs

neophilic compulsion?• environmental priors and generalization

– curiosity/`intrinsic motivation’ from expected future reward

1 dopamine and prediction error no predictionprediction, rewardprediction, no reward td error vtvt r...

Documents

exploitation exploitation

better exploration

better time reward slide

worse time reward slide

practice action value

original time reward

future time reward slide

exploitation classic