1 dopamine and prediction error no predictionprediction, rewardprediction, no reward td error vtvt r...

55
1 dopamine and prediction error no prediction prediction, reward prediction, no reward TD error V t R R L t t t t V V r 1 ) ( t Schultz 199

Upload: naomi-russ

Post on 15-Dec-2015

233 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

1

dopamine and prediction error

no prediction prediction, reward prediction, no reward

TD error

Vt

R

RL

tttt VVr 1

)(t

Schultz 1997

Page 2: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

humans are no different

• dorsomedial striatum/PFC– goal-directed control

• dorsolateral striatum– habitual control

• ventral striatum– Pavlovian control; value signals

• dopamine...

Page 3: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

in humans…

< 1 sec

0.5 sec

You won40 cents

5 secISI

19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced

2-5secITI

5 stimuli:

40¢20¢

0/40¢0¢0¢

Page 4: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

what would a prediction error look like (in BOLD)?

Page 5: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

prediction errors in NAC

unbiased anatomical ROI in nucleus accumbens (marked per subject*)

* thanks to Laura deSouza

raw BOLD(avg over all subjects)

can actually decide between different neuroeconomic models of risk

Page 6: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Polar Exploration

Peter Dayan

Nathaniel Daw John O’Doherty Ray Dolan

Page 7: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

Classic dilemma in learned decision making

For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained

Page 8: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

• Exploitation– Choose action expected to be best– May never discover something better

Time

Reward

Page 9: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

• Exploitation– Choose action expected to be best– May never discover something better

• Exploration:– Choose action expected to be worse

Time

Reward

Page 10: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

• Exploitation– Choose action expected to be best– May never discover something better

• Exploration:– Choose action expected to be worse– If it is, then go back to the original

Time

Reward

Page 11: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

• Exploitation– Choose action expected to be best– May never discover something better

• Exploration:– Choose action expected to be worse

Time

Reward

Page 12: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

• Exploitation– Choose action expected to be best– May never discover something better

• Exploration:– Choose action expected to be worse– If it is better, then exploit in the future

Time

Reward

Page 13: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration vs. exploitation

• Exploitation– Choose action expected to be best– May never discover something better

• Exploration:– Choose action expected to be worse– Balanced by the long-term gain if it turns out better– (Even for risk or ambiguity averse subjects)– nb: learning non trivial when outcomes noisy or changing

Time

Reward

Page 14: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Bayesian analysis (Gittins 1972)• Tractable dynamic program in

restricted class of problems– “n-armed bandit”

• Solution requires balancing– Expected outcome values– Uncertainty (need for exploration)– Horizon/discounting (time to exploit)

• Optimal policy: Explore systematically– Choose best sum of value plus bonus– Bonus increases with uncertainty

• Intractable in general setting– Various heuristics used in practice

ActionV

alue

Page 15: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Experiment• How do humans handle tradeoff?

• Computation: Which strategies fit behavior?– Several popular approximations

• Difference: what information influences exploration?

• Neural substrate: What systems are involved?– PFC, high level control

• Competitive decision systems (Daw et al. 2005)

– Neuromodulators• dopamine (Kakade & Dayan 2002)• norepinephrine (Usher et al. 1999)

Page 16: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Task design

Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),

in scanner

Slotsrevealed

TrialOnset

Page 17: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Task design

Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),

in scanner

Subject makes choice -chosen slot spins.

Slotsrevealed

TrialOnset

+~430 ms

+

Page 18: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Task design

Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),

in scanner

Subject makes choice -chosen slot spins.

Slotsrevealed

Outcome:Payoffrevealed

TrialOnset

+~430 ms

+~3000 ms

obtained57

points

+

+

Page 19: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Task design

Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),

in scanner

Subject makes choice -chosen slot spins.

Slotsrevealed

Outcome:Payoffrevealed

TrialOnset

+~430 ms

+~3000 ms

Screen cleared

+~1000 ms

Trial ends

obtained57

points

+

+

+

Page 20: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them

Page 21: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them

Page 22: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Payoff structureP

ayof

f

Page 23: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Payoff structureNonstationary to encourage ongoing exploration

(Gaussian drift w/ decay)

Page 24: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Analysis strategy

• Behavior: Fit an RL model to choices– Find best fitting parameters– Compare different exploration models

• Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.)– Use these as regressors for the fMRI signal– After Sugrue et al.

Page 25: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior

Page 26: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior

Page 27: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior

Page 28: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior

Page 29: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior

Page 30: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

Page 31: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filterError update(like TD)Exact inference mgreen , mred etc

sgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

Page 32: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

x

x

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filterError update(like TD)Exact inference pa

yoff

trial t t+1

x

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

Page 33: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

x

x

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filterError update(like TD)Exact inference pa

yoff

trial t t+1

x

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

Page 34: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

x

x

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filterError update(like TD)Exact inference pa

yoff

trial t t+1

x

x

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

red red

Page 35: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

x

x

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filterError update(like TD)Exact inference pa

yoff

trial t t+1

x

x

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

x

x

red red

Page 36: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filterError update(like TD)Exact inference pa

yoff

trial t t+1

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

red red

2 2(1 )red red 2 2 2/red d ore

Behrens & volatility

Page 37: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

1. Estimate payoffs

2. Derive choice probabilities

Kalman filter

Compare rules:How is exploration directed?

mgreen , mred etcsgreen , sred etc

Choose randomly according to these

Pgreen , Pred etc

Page 38: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

2. Derive choice probabilities

Compare rules:How is exploration directed?

mgreen , mred etcsgreen , sred etc

Pgreen , Pred etc

Choose randomly according to these

Page 39: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

2. Derive choice probabilities

Compare rules:How is exploration directed?

mgreen , mred etcsgreen , sred etc

Action

Val

ue

(dumber) (smarter)

Page 40: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

2. Derive choice probabilities

Compare rules:How is exploration directed?

mgreen , mred etcsgreen , sred etc

Action

Val

ue

(dumber) (smarter)

Pro

babi

lity

Randomly“e-greedy”

1 3 if max(all )

oth

erwise

red

redP

Page 41: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

2. Derive choice probabilities

Compare rules:How is exploration directed?

mgreen , mred etcsgreen , sred etc

Action

Val

ue

(dumber) (smarter)

Pro

babi

lity

Randomly“e-greedy”

By value“softmax”

exp( )red redβP μ1 3 if max(all )

oth

erwise

red

redP

Page 42: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavior model

2. Derive choice probabilities

Compare rules:How is exploration directed?

mgreen , mred etcsgreen , sred etc

Action

Val

ue

(dumber) (smarter)

Pro

babi

lity

Randomly“e-greedy”

By value“softmax”

By value and uncertainty“uncertainty bonuses”

1 3 if max(all )

oth

erwise

red

redP

exp( )red redβP μ exp( [ ])red red redP μ σβ +φ

Page 43: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Model comparison

• Assess models based on likelihood of actual choices– Product over subjects and trials of modeled

probability of each choice– Find maximum likelihood parameters

• Inference parameters, choice parameters• Parameters yoked between subjects• (… except choice noisiness, to model all

heterogeneity)

Page 44: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavioral results

• Strong evidence for exploration directed by value

• No evidence for direction by uncertainty– Tried several

variations

e-greedy softmax uncertainty bonuses

-log likelihood(smaller is better)

4208.3 3972.1 3972.1

19 19 20# parameters

Page 45: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Behavioral results

• Strong evidence for exploration directed by value

• No evidence for direction by uncertainty– Tried several

variations

e-greedy softmax uncertainty bonuses

-log likelihood(smaller is better)

4208.3 3972.1 3972.1

19 19 20# parameters

Page 46: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Imaging methods

• 1.5 T Siemens Sonata scanner• Sequence optimized for OFC (Deichmann

et al. 2003)• 2x385 volumes; 36 slices; 3mm thickness• 3.24 secs TR• SPM2 random effects model

• Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.

Page 47: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Imaging results

• TD error: dopamine targets (dorsal and ventral striatum)

• Replicate previous studies, but weakish– Graded payoffs? p<0.01

p<0.001

L

vStr

dStr

x,y,z=9,12,-9

x,y,z=9,0,18

Page 48: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Value-related correlates

p<0.01p<0.001

L

p<0.01p<0.001

vmPFC vmPFC

mOFC mOFCL

probability (or exp. value) of chosen action: vmPFC

payoff amount: OFC

% s

igna

l cha

nge

% s

igna

l cha

nge

probability of chosen action

payoff

x,y,z=-3,45,-18

x,y,z=3,30,-21

Page 49: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Exploration• Non-greedy > greedy choices:

exploration• Frontopolar cortex• Survives whole-brain correction

LrFP

rFP

LFP

p<0.01

p<0.001

x,y,z=-27,48,4; 27,57,6

Page 50: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

TimecoursesFrontal pole IPS

Page 51: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

• Do other factors explain differential BOLD activity better?– Multiple regression vs. RT, actual reward, predicted reward, choice prob,

stay vs. switch, uncertainty, more– Only explore/exploit is significant– (But 5 additional putative explore areas eliminated)

• Individual subjects: BOLD differences stronger for better behavioral fit

Checks

Page 52: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Frontal poles

• Imaging – high level control– Coordinating goals/subgoals (Koechlin et al. 1999, Braver &

Bongiolatti 2002; Badre & Wagner 2004)– Mediating cognitive processes (Ramnani & Owen 2004)– Nothing this computationally specific

• Lesions: task switching (Burgess et al. 2000)– more generic: perseveration

• “One of the least well understood regions of the human brain”

• No cortical connections outside PFC (“PFC for PFC”)

• Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003)

Page 53: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Interpretation• Cognitive decision to explore overrides habit circuitry?

Via parietal?– Higher FP response when exploration chosen most against the

odds– Explore RT longer

• Exploration/exploitation are neurally distinct• Computationally surprising, esp. bad for uncertainty

bonus schemes– proper exploration requires computational integration– no behavioral evidence either

• Why softmax? Can misexplore– Deterministic bonus schemes bad in adversarial/multiagent

setting– Dynamic temperature control? (norepinephrine; Usher et al.; Doya)

Page 54: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Conclusions

• Subjects direct exploration by value but not uncertainty

• Cortical regions differentially implicated in exploration– computational consequences

• Integrative approach: computation, behavior, imaging– Quantitatively assess & constrain models using raw

behavior– Infer subjective states using model, study neural

correlates

Page 55: 1 dopamine and prediction error no predictionprediction, rewardprediction, no reward TD error VtVt R RL Schultz 1997

Open Issues

• model-based vs model-free vs Pavlovian control– environmental priors vs naive optimism vs

neophilic compulsion?• environmental priors and generalization

– curiosity/`intrinsic motivation’ from expected future reward