global plan reinforcement learning i: –prediction –classical conditioning –dopamine...
TRANSCRIPT
![Page 1: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/1.jpg)
Global plan
• Reinforcement learning I: – prediction
– classical conditioning
– dopamine
• Reinforcement learning II:– dynamic programming; action selection
– Pavlovian misbehaviour
– vigor
• Chapter 9 of Theoretical Neuroscience
(thanks to Yael Niv)
![Page 2: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/2.jpg)
2
Conditioning
• Ethology– optimality– appropriateness
• Psychology– classical/operant
conditioning
• Computation– dynamic progr.– Kalman filtering
• Algorithm– TD/delta rules– simple weights
• Neurobiologyneuromodulators; amygdala; OFCnucleus accumbens; dorsal striatum
prediction: of important eventscontrol: in the light of those predictions
![Page 3: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/3.jpg)
= Conditioned Stimulus
= Unconditioned Stimulus
= Unconditioned Response (reflex);Conditioned Response (reflex)
Animals learn predictions
Ivan Pavlov
![Page 4: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/4.jpg)
Animals learn predictions
Ivan Pavlov
Acquisition Extinction
020406080
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Blocks of 10 Trials
% C
Rs
very general across species, stimuli, behaviors
![Page 5: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/5.jpg)
But do they really?
temporal contiguity is not enough - need contingencytemporal contiguity is not enough - need contingency
1. Rescorla’s control
P(food | light) > P(food | no light)
![Page 6: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/6.jpg)
But do they really?
contingency is not enough either… need surprisecontingency is not enough either… need surprise
2. Kamin’s blocking
![Page 7: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/7.jpg)
But do they really?
seems like stimuli compete for learningseems like stimuli compete for learning
3. Reynold’s overshadowing
![Page 8: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/8.jpg)
Theories of prediction learning: Goals
• Explain how the CS acquires “value”• When (under what conditions) does this happen?• Basic phenomena: gradual learning and extinction curves• More elaborate behavioral phenomena• (Neural data)
P.S. Why are we looking at old-fashioned Pavlovian conditioning?
it is the perfect uncontaminated test case for examining prediction learning on its own
![Page 9: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/9.jpg)
error-driven learning: change in value is proportional to the differencebetween actual and predicted outcome
Assumptions:
1. learning is driven by error (formalizes notion of surprise)
2. summations of predictors is linear
A simple model - but very powerful!– explains: gradual acquisition & extinction, blocking, overshadowing,
conditioned inhibition, and more..– predicted overexpectation
note: US as “special stimulus”
Rescorla & Wagner (1972)
jCSUSCS jiVrV
![Page 10: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/10.jpg)
• how does this explain acquisition and extinction?
• what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0
– what would V be on average after learning?
– what would the error term be on average after learning?
Rescorla-Wagner learning
Vt1 Vt rt Vt
![Page 11: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/11.jpg)
how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …?
Rescorla-Wagner learning
Vt (1 )t i rii1
t
tttt VrVV 1
0
0.1
0.2
0.3
0.4
0.5
0.6
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1
Vt1 (1 )Vt rt
recent rewards weigh more heavily
why is this sensible?
learning rate = forgetting rate!
the R-W rule estimates expected reward using a weighted average of past rewards
![Page 12: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/12.jpg)
Summary so far
Predictions are useful for behavior
Animals (and people) learn predictions (Pavlovian conditioning = prediction learning)
Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality
Marr:
VrV
EV
VrE
VV
USCS
CS
US
jCS
i
i
j
2
![Page 13: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/13.jpg)
But: second order conditioning
15
20
25
30
35
40
45
50
number of phase 2 pairings
con
dit
ion
ed r
esp
ond
ing
animals learn that a predictor of a predictor is also a predictor of reward!
not interested solely in predicting immediate reward
animals learn that a predictor of a predictor is also a predictor of reward!
not interested solely in predicting immediate reward
phase 1:
phase 2:
test: ?what do you think will happen?
what would Rescorla-Wagner learning predict here?
![Page 14: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/14.jpg)
lets start over: this time from the top
Marr’s 3 levels:• The problem: optimal prediction of future reward
• what’s the obvious prediction error?
• what’s the obvious problem with this?
Vt E riit
T
want to predict expected sum of
future reward in a trial/episode
(N.B. here t indexes time within a trial)
t
T
tiit Vr
CSVr RW
![Page 15: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/15.jpg)
lets start over: this time from the top
Marr’s 3 levels:• The problem: optimal prediction of future reward
Vt E riit
T
want to predict expected sum of
future reward in a trial/episode
tttt
tt
Tttt
Ttttt
VVrE
VrE
rrrErE
rrrrEV
1
1
21
21
...
...
Bellman eqnfor policyevaluation
![Page 16: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/16.jpg)
lets start over: this time from the top
Marr’s 3 levels:• The problem: optimal prediction of future reward• The algorithm: temporal difference learning
Vt E rt Vt1Vt (1 )Vt (rt Vt1)Vt Vt (rt Vt1 Vt )
temporal difference prediction error t
VT 1 VT rT VT compare to:
![Page 17: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/17.jpg)
17
prediction error
no prediction prediction, reward prediction, no reward
TD error
Vt
R
RL
tttt VVr 1
t
![Page 18: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/18.jpg)
Summary so far
Temporal difference learning versus Rescorla-Wagner
• derived from first principles about the future
• explains everything that R-W does, and more (eg. 2nd order conditioning)
• a generalization of R-W to real time
![Page 19: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/19.jpg)
Back to Marr’s 3 levels
• The problem: optimal prediction of future reward• The algorithm: temporal difference learning• Neural implementation: does the brain use TD learning?
![Page 20: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/20.jpg)
Dopamine
Dorsal Striatum (Caudate, Putamen)
Ventral TegmentalArea
Substantia Nigra
Amygdala
Nucleus Accumbens(Ventral Striatum)
Prefrontal CortexDorsal Striatum (Caudate, Putamen)
Ventral TegmentalArea
Substantia Nigra
Amygdala
Nucleus Accumbens(Ventral Striatum)
Prefrontal CortexDorsal Striatum (Caudate, Putamen)
Ventral TegmentalArea
Substantia Nigra
Amygdala
Nucleus Accumbens(Ventral Striatum)
Prefrontal CortexDorsal Striatum (Caudate, Putamen)
Ventral TegmentalArea
Substantia Nigra
Amygdala
Nucleus Accumbens(Ventral Striatum)
Prefrontal CortexParkinson’s Disease Motor control + initiation?
Intracranial self-stimulation;
Drug addiction;Natural rewards Reward pathway? Learning?
Also involved in:• Working memory• Novel situations• ADHD• Schizophrenia• …
![Page 21: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/21.jpg)
Role of dopamine: Many hypotheses
• Anhedonia hypothesis• Prediction error (learning, action selection)• Salience/attention• Incentive salience• Uncertainty• Cost/benefit computation • Energizing/motivating behavior
![Page 22: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/22.jpg)
22
dopamine and prediction error
no prediction prediction, reward prediction, no reward
TD error
Vt
R
RL
tttt VVr 1
)(t
![Page 23: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/23.jpg)
prediction error hypothesis of dopamine
Tobler et al, 2005
Fio
rillo
et a
l, 20
03
The idea: Dopamine encodes a reward prediction error
The idea: Dopamine encodes a reward prediction error
![Page 24: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/24.jpg)
prediction error hypothesis of dopamine
model prediction error
mea
sure
d fir
ing
rate
Bayer & Glimcher (2005)
at end of trial: t = rt - Vt (just like R-W)
Vt (1 )t i rii1
t
![Page 25: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/25.jpg)
what drives the dips?
• why an effect of reward at all?– Pavlovian
influence
Matsumoto & Hikosaka (2007)
![Page 26: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/26.jpg)
what drives the dips?
• rHab -> rSTN
• RMTg (predicted R/S)
Matsumoto & Hikosaka (2007)
Jhou et al, 2009
![Page 27: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/27.jpg)
Where does dopamine project to? Basal ganglia
Several large subcortical nuclei(unfortunate anatomical names follow structure rather than function, eg caudate + putamen + nucleus accumbens are all relatively similar pieces of striatum; but globus pallidus & substantia nigra each comprise two different things)
![Page 28: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/28.jpg)
Where does dopamine project to? Basal ganglia
inputs to BG are from all over the cortex (and topographically mapped)
Voorn et al, 2004
![Page 29: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/29.jpg)
striatal complexities
Cohen & Frank, 2009
![Page 30: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/30.jpg)
Risk Experiment
< 1 sec
0.5 sec
You won40 cents
5 secISI
19 subjects (dropped 3 non learners, N=16)3T scanner, TR=2sec, interleaved234 trials: 130 choice, 104 single stimulusrandomly ordered and counterbalanced
2-5secITI
5 stimuli:
40¢20¢
0/40¢0¢0¢
![Page 31: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/31.jpg)
Neural results: Prediction Errorswhat would a prediction error look like (in BOLD)?
![Page 32: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/32.jpg)
Neural results: Prediction errors in NAC
unbiased anatomical ROI in nucleus accumbens (marked per subject*)
* thanks to Laura deSouza
raw BOLD(avg over all subjects)
can actually decide between different neuroeconomic models of risk
![Page 33: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/33.jpg)
35
HighPain
LowPain
0.8 1.0
0.8 1.0
0.2
0.2
Prediction error
punishment prediction error
Value
TD error tttt VVr 1
![Page 34: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/34.jpg)
36
TD model
?
A – B – HIGH C – D – LOW C – B – HIGH A – B – HIGH A – D – LOW C – D – LOW A – B – HIGH A – B – HIGH C – D – LOW C – B – HIGH
Brain responses Prediction error
experimental sequence…..
MR scanner
Ben Seymour; John O’Doherty
punishment prediction error
![Page 35: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/35.jpg)
37
TD prediction error: ventral striatum
Z=-4 R
punishment prediction error
![Page 36: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/36.jpg)
38
right anterior insula
dorsal raphe (5HT)?
punishment prediction
![Page 37: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/37.jpg)
punishment
• dips below baseline in dopamine– Frank: D2 receptors
particularly sensitive– Bayer & Glimcher: length of
pause related to size of negative prediction error
• but: – can’t afford to wait that long– negative signal for such an
important event– opponency a more
conventional solution:• serotonin…
![Page 38: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/38.jpg)
40
generalization
![Page 39: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/39.jpg)
41
generalization
![Page 40: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/40.jpg)
other paradigms
• inhibitory conditioning• transreinforcer blocking• motivational sensitivities• backwards blocking
– Kalman filtering
• downwards unblocking• primacy as well as recency (highlighting)
– assumed density filtering
![Page 41: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/41.jpg)
Summary of this part: prediction and RL
Prediction is important for action selection
• The problem: prediction of future reward• The algorithm: temporal difference learning• Neural implementation: dopamine dependent learning in BG
Þ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model
Þ Precise (normative!) theory for generation of dopamine firing patterns
Þ Explains anticipatory dopaminergic responding, second order conditioning
Þ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas
![Page 42: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/42.jpg)
Striatum and learned values
Striatal neurons show ramping activity that precedes a reward (and changes with learning!)
(Schultz)
start food
(Daw)
![Page 43: Global plan Reinforcement learning I: –prediction –classical conditioning –dopamine Reinforcement learning II: –dynamic programming; action selection –Pavlovian](https://reader030.vdocuments.site/reader030/viewer/2022032703/56649d095503460f949db5d2/html5/thumbnails/43.jpg)
Phasic dopamine also responds to…
• Novel stimuli• Especially salient (attention grabbing) stimuli• Aversive stimuli (??)
• Reinforcers and appetitive stimuli induce approach behavior and learning, but also have attention functions (elicit orienting response) and disrupt ongoing behaviour.
→ Perhaps DA reports salience of stimuli (to attract attention; switching) and not a prediction error? (Horvitz, Redgrave)