1 dopamine and prediction error no predictionprediction, rewardprediction, no reward td error vtvt r...
TRANSCRIPT
1
dopamine and prediction error
no prediction prediction, reward prediction, no reward
TD error
Vt
R
RL
tttt VVr 1
)(t
Schultz 1997
humans are no different
• dorsomedial striatum/PFC– goal-directed control
• dorsolateral striatum– habitual control
• ventral striatum– Pavlovian control; value signals
• dopamine...
in humans…
< 1 sec
0.5 sec
You won40 cents
5 secISI
19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced
2-5secITI
5 stimuli:
40¢20¢
0/40¢0¢0¢
what would a prediction error look like (in BOLD)?
prediction errors in NAC
unbiased anatomical ROI in nucleus accumbens (marked per subject*)
* thanks to Laura deSouza
raw BOLD(avg over all subjects)
can actually decide between different neuroeconomic models of risk
Polar Exploration
Peter Dayan
Nathaniel Daw John O’Doherty Ray Dolan
Exploration vs. exploitation
Classic dilemma in learned decision making
For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
Time
Reward
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse
Time
Reward
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse– If it is, then go back to the original
Time
Reward
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse
Time
Reward
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse– If it is better, then exploit in the future
Time
Reward
Exploration vs. exploitation
• Exploitation– Choose action expected to be best– May never discover something better
• Exploration:– Choose action expected to be worse– Balanced by the long-term gain if it turns out better– (Even for risk or ambiguity averse subjects)– nb: learning non trivial when outcomes noisy or changing
Time
Reward
Bayesian analysis (Gittins 1972)• Tractable dynamic program in
restricted class of problems– “n-armed bandit”
• Solution requires balancing– Expected outcome values– Uncertainty (need for exploration)– Horizon/discounting (time to exploit)
• Optimal policy: Explore systematically– Choose best sum of value plus bonus– Bonus increases with uncertainty
• Intractable in general setting– Various heuristics used in practice
ActionV
alue
Experiment• How do humans handle tradeoff?
• Computation: Which strategies fit behavior?– Several popular approximations
• Difference: what information influences exploration?
• Neural substrate: What systems are involved?– PFC, high level control
• Competitive decision systems (Daw et al. 2005)
– Neuromodulators• dopamine (Kakade & Dayan 2002)• norepinephrine (Usher et al. 1999)
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Slotsrevealed
TrialOnset
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Subject makes choice -chosen slot spins.
Slotsrevealed
TrialOnset
+~430 ms
+
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Subject makes choice -chosen slot spins.
Slotsrevealed
Outcome:Payoffrevealed
TrialOnset
+~430 ms
+~3000 ms
obtained57
points
+
+
Task design
Subjects (14 healthy, right-handed) repeatedly choose between four slot machines for points (“money”),
in scanner
Subject makes choice -chosen slot spins.
Slotsrevealed
Outcome:Payoffrevealed
TrialOnset
+~430 ms
+~3000 ms
Screen cleared
+~1000 ms
Trial ends
obtained57
points
+
+
+
Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them
Payoff structureNoisy to require integration of dataSubjects learn about payoffs only by sampling them
Payoff structureP
ayof
f
Payoff structureNonstationary to encourage ongoing exploration
(Gaussian drift w/ decay)
Analysis strategy
• Behavior: Fit an RL model to choices– Find best fitting parameters– Compare different exploration models
• Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.)– Use these as regressors for the fMRI signal– After Sugrue et al.
Behavior
Behavior
Behavior
Behavior
Behavior
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference mgreen , mred etc
sgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
red red
x
x
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
x
x
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
x
x
red red
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filterError update(like TD)Exact inference pa
yoff
trial t t+1
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
red red
2 2(1 )red red 2 2 2/red d ore
Behrens & volatility
Behavior model
1. Estimate payoffs
2. Derive choice probabilities
Kalman filter
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Choose randomly according to these
Pgreen , Pred etc
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Pgreen , Pred etc
Choose randomly according to these
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Pro
babi
lity
Randomly“e-greedy”
1 3 if max(all )
oth
erwise
red
redP
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Pro
babi
lity
Randomly“e-greedy”
By value“softmax”
exp( )red redβP μ1 3 if max(all )
oth
erwise
red
redP
Behavior model
2. Derive choice probabilities
Compare rules:How is exploration directed?
mgreen , mred etcsgreen , sred etc
Action
Val
ue
(dumber) (smarter)
Pro
babi
lity
Randomly“e-greedy”
By value“softmax”
By value and uncertainty“uncertainty bonuses”
1 3 if max(all )
oth
erwise
red
redP
exp( )red redβP μ exp( [ ])red red redP μ σβ +φ
Model comparison
• Assess models based on likelihood of actual choices– Product over subjects and trials of modeled
probability of each choice– Find maximum likelihood parameters
• Inference parameters, choice parameters• Parameters yoked between subjects• (… except choice noisiness, to model all
heterogeneity)
Behavioral results
• Strong evidence for exploration directed by value
• No evidence for direction by uncertainty– Tried several
variations
e-greedy softmax uncertainty bonuses
-log likelihood(smaller is better)
4208.3 3972.1 3972.1
19 19 20# parameters
Behavioral results
• Strong evidence for exploration directed by value
• No evidence for direction by uncertainty– Tried several
variations
e-greedy softmax uncertainty bonuses
-log likelihood(smaller is better)
4208.3 3972.1 3972.1
19 19 20# parameters
Imaging methods
• 1.5 T Siemens Sonata scanner• Sequence optimized for OFC (Deichmann
et al. 2003)• 2x385 volumes; 36 slices; 3mm thickness• 3.24 secs TR• SPM2 random effects model
• Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs.
Imaging results
• TD error: dopamine targets (dorsal and ventral striatum)
• Replicate previous studies, but weakish– Graded payoffs? p<0.01
p<0.001
L
vStr
dStr
x,y,z=9,12,-9
x,y,z=9,0,18
Value-related correlates
p<0.01p<0.001
L
p<0.01p<0.001
vmPFC vmPFC
mOFC mOFCL
probability (or exp. value) of chosen action: vmPFC
payoff amount: OFC
% s
igna
l cha
nge
% s
igna
l cha
nge
probability of chosen action
payoff
x,y,z=-3,45,-18
x,y,z=3,30,-21
Exploration• Non-greedy > greedy choices:
exploration• Frontopolar cortex• Survives whole-brain correction
LrFP
rFP
LFP
p<0.01
p<0.001
x,y,z=-27,48,4; 27,57,6
TimecoursesFrontal pole IPS
• Do other factors explain differential BOLD activity better?– Multiple regression vs. RT, actual reward, predicted reward, choice prob,
stay vs. switch, uncertainty, more– Only explore/exploit is significant– (But 5 additional putative explore areas eliminated)
• Individual subjects: BOLD differences stronger for better behavioral fit
Checks
Frontal poles
• Imaging – high level control– Coordinating goals/subgoals (Koechlin et al. 1999, Braver &
Bongiolatti 2002; Badre & Wagner 2004)– Mediating cognitive processes (Ramnani & Owen 2004)– Nothing this computationally specific
• Lesions: task switching (Burgess et al. 2000)– more generic: perseveration
• “One of the least well understood regions of the human brain”
• No cortical connections outside PFC (“PFC for PFC”)
• Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003)
Interpretation• Cognitive decision to explore overrides habit circuitry?
Via parietal?– Higher FP response when exploration chosen most against the
odds– Explore RT longer
• Exploration/exploitation are neurally distinct• Computationally surprising, esp. bad for uncertainty
bonus schemes– proper exploration requires computational integration– no behavioral evidence either
• Why softmax? Can misexplore– Deterministic bonus schemes bad in adversarial/multiagent
setting– Dynamic temperature control? (norepinephrine; Usher et al.; Doya)
Conclusions
• Subjects direct exploration by value but not uncertainty
• Cortical regions differentially implicated in exploration– computational consequences
• Integrative approach: computation, behavior, imaging– Quantitatively assess & constrain models using raw
behavior– Infer subjective states using model, study neural
correlates
Open Issues
• model-based vs model-free vs Pavlovian control– environmental priors vs naive optimism vs
neophilic compulsion?• environmental priors and generalization
– curiosity/`intrinsic motivation’ from expected future reward