sequential & temporally-delayed learningski.clps.brown.edu/cogsim/cogsim.8temporal.pdf ·...
TRANSCRIPT
Seq
uen
tial&
Tem
porally
-Delay
edLearn
ing
1.TheProblem
.
2.Seq
uen
tialLearn
ing&
Contex
t.
3.Tem
porally
-delay
edLearn
ing&
Rein
forcem
ent.
TheProblem
Erro
r-driv
en+Heb
bian
:Solvetask
s,learnsystem
aticrep
resentatio
ns,
gen
eralizeto
new
stimuli.
What’s
left?...
TheProblem
Erro
r-driv
en+Heb
bian
:Solvetask
s,learnsystem
aticrep
resentatio
ns,
gen
eralizeto
new
stimuli.
What’s
left?...
Tim
e!
TheProblem
Erro
r-driv
en+Heb
bian
:Solvetask
s,learnsystem
aticrep
resentatio
ns,
gen
eralizeto
new
stimuli.
What’s
left?...
Tim
e!
Curren
tly:netw
orkslearn
immed
iateconseq
uen
ceofagiven
input.
TheProblem
Erro
r-driv
en+Heb
bian
:Solvetask
s,learnsystem
aticrep
resentatio
ns,
gen
eralizeto
new
stimuli.
What’s
left?...
Tim
e!
Curren
tly:netw
orkslearn
immed
iateconseq
uen
ceofagiven
input.
•What
ifcu
rrentinputonly
mak
essen
seas
part
ofaseq
uen
ceofinputs
(e.g.,lan
guag
e,social
interactio
ns)?
TheProblem
Erro
r-driv
en+Heb
bian
:Solvetask
s,learnsystem
aticrep
resentatio
ns,
gen
eralizeto
new
stimuli.
What’s
left?...
Tim
e!
Curren
tly:netw
orkslearn
immed
iateconseq
uen
ceofagiven
input.
•What
ifcu
rrentinputonly
mak
essen
seas
part
ofaseq
uen
ceofinputs
(e.g.,lan
guag
e,social
interactio
ns)?
•What
iftheconseq
uen
ceofthisinputcomes
later(e.g
.,sch
ool/work,
life)?
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Ismypurple
colorfav
orite.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Ismypurple
colorfav
orite.
Ispurple
mycolorfav
orite.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Ismypurple
colorfav
orite.
Ispurple
mycolorfav
orite.
Thegirl
pick
edupthepen
.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Ismypurple
colorfav
orite.
Ispurple
mycolorfav
orite.
Thegirl
pick
edupthepen
.
Thepig
racedaro
undthepen
.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Ismypurple
colorfav
orite.
Ispurple
mycolorfav
orite.
Thegirl
pick
edupthepen
.
Thepig
racedaro
undthepen
.
Werep
resentthecontex
t,notjustthecu
rrentinput.
Seq
uen
ceLearn
ing
How
dowedoit?
Forexam
ple:
Myfav
orite
colorispurple.
Purple
mycolorfav
orite
is.
Ismypurple
colorfav
orite.
Ispurple
mycolorfav
orite.
Thegirl
pick
edupthepen
.
Thepig
racedaro
undthepen
.
Werep
resentthecontex
t,notjustthecu
rrentinput.
inlan
guag
e,social
interactio
ns,driv
ing(w
hogoes
ata4-w
aysto
p?)
Rep
resentin
gContex
tforSeq
uen
ceLearn
ing
How
does
thebrain
doit?
How
would
weget
ourmodels
todoit?
Rep
resentin
gContex
tforSeq
uen
ceLearn
ing
How
does
thebrain
doit?
How
would
weget
ourmodels
todoit?
Addlay
ersto
keep
trackofcontex
t(prefro
ntal
cortex
;hippocam
pus...).
AnExam
ple
Task
BTXSE
BPVPSE
BTSXXTVVE
BPTVPSE
AnExam
ple
Task
BTXSE
BPVPSE
BTSXXTVVE
BPTVPSE
Which
ofthefollo
wingseq
uen
cesare
allowed
?:
BTXXTTVVE
TSXSE
VVSXE
BSSXSE
AnExam
ple
Task
BTXSE,B
PVPSE,B
TSXXTVVE,BPTVPSE,B
TXXTTVVE
TSXSE,V
VSXE,B
SSXSE
AnExam
ple
Task
BTXSE,B
PVPSE,B
TSXXTVVE,BPTVPSE,B
TXXTTVVE
TSXSE,V
VSXE,B
SSXSE
BTP
V
SV
PE
startend
ST
0
12
34
5
XX
AnExam
ple
Task
BTXSE,B
PVPSE,B
TSXXTVVE,BPTVPSE,B
TXXTTVVE
TSXSE,V
VSXE,B
SSXSE
BTP
V
SV
PE
startend
ST
0
12
34
5
XX
Weim
plicitly
learnsu
chgram
mars
(e.g.,p
ressingbutto
nsfaster
toletters
that
follo
wgram
mar).
Tim
e&
Seq
uen
ces
Curren
tly:netw
orkslearn
immed
iateconseq
uen
ceofagiven
input.
What
ifcu
rrentinputonly
mak
essen
seas
part
ofatem
porally
-exten
ded
sequen
ceofinputs?
(contex
t)
What
iftheconseq
uen
ceofthisinputcomes
laterin
time?
(nextweek
)
WhyCopytheHidden
Rep
resentatio
n?
WhyCopytheHidden
Rep
resentatio
n?
•Copyinginputoroutputonly
letsthenetw
ork
hold
onto
oneprev
ious
item
•Copyingthehidden
layer
letsthenetw
ork
hold
onto
anarb
itrarily
largenumber
ofitem
s–
eventhough
itisalw
aysjustcopying
lasthidden
stateattim
et-1.
•Thenetw
ork
learnshow
strongly
tohold
onto
past
items
Sim
ple
SRN
story
isnotflaw
less
•How
ishidden→
“copy”functio
nim
plem
ented
biologically
?
•Durin
gsettlin
g,co
ntex
tmustbeactiv
elymain
tained
(ongoinghidden
activity
has
noeffect
oncontex
t).
•Assu
mes
allcontex
tisrelev
ant:What
ifdistractin
ginform
ation
presen
tedin
middle
ofseq
uen
ce?Wan
tto
only
hold
onto
relevan
t
contex
t.
→Stay
tuned
forsp
ecializedbiological/
computatio
nal
mech
anism
sfor
updatin
g/gatin
gvs.robust
main
tenan
ceofcontex
t.
Motiv
atingMotiv
ation
Whydoes
anyonegoto
university
?
Motiv
atingMotiv
ation
Whydoes
anyonegoto
university
?
(or,w
hydoweev
erdoan
ythingbesid
eseat,sleep
,hav
esex
,etc)?
Motiv
atingMotiv
ation
Whydoes
anyonegoto
university
?
(or,w
hydoweev
erdoan
ythingbesid
eseat,sleep
,hav
esex
,etc)?
e.g.,W
hyam
Ihere
today,in
steadoflyingonabeach
inMexico
,drin
king
mojito
san
dread
ingagoodbook?
Motiv
atingMotiv
ation
Whydoes
anyonegoto
university
?
(or,w
hydoweev
erdoan
ythingbesid
eseat,sleep
,hav
esex
,etc)?
e.g.,W
hyam
Ihere
today,in
steadoflyingonabeach
inMexico
,drin
king
mojito
san
dread
ingagoodbook?
Challen
ge:
mak
earesp
onsib
leneu
ralnetw
ork!
TheMotiv
ational
Bootstrap
•Somemotiv
ationsmustbebuilt-in
(elsewewould
die)
•Where
doart/
science
comefro
m?
–Need
tolearn
ontopofbuilt-in
driv
es
TheMotiv
ational
Bootstrap
•Somemotiv
ationsmustbebuilt-in
(elsewewould
die)
•Where
doart/
science
comefro
m?
–Need
tolearn
ontopofbuilt-in
driv
es
Cultu
re&
social
driv
esprovidecu
mulativ
esh
apingoflearn
ing.
TheMotiv
ational
Bootstrap
•Somemotiv
ationsmustbebuilt-in
(elsewewould
die)
•Where
doart/
science
comefro
m?
–Need
tolearn
ontopofbuilt-in
driv
es
Cultu
re&
social
driv
esprovidecu
mulativ
esh
apingoflearn
ing.
So,w
hydoes
anyonegoto
university
?
•Socially
-med
iatedstan
dard
sofsu
ccess.
•Stro
ngbuilt-in
desire
tosh
arew/others.
•Stro
ngbuilt-in
desire
tolearn
(dopam
ine?)
What
I’mActu
allyTalk
ingAbout
Skinnerian
learning
Thebasic
stuffthat
every
mam
mal
has
incommon:
Neu
ralmech
anism
sofPav
lovian
conditio
ning
(from
acomputatio
nal
persp
ective).
What
I’mActu
allyTalk
ingAbout
Skinnerian
learning
Thebasic
stuffthat
every
mam
mal
has
incommon:
Neu
ralmech
anism
sofPav
lovian
conditio
ning
(from
acomputatio
nal
persp
ective).
Nosu
perv
isedtarg
etsig
nal
availab
le:only
good/bad
outco
mes
Enab
lesbootstrap
ofnew
stimuli(C
S’s)
onto
built-in
desires
(US’s):
CS(m
oney
)→
US(fo
od,etc)
What
I’mActu
allyTalk
ingAbout
Skinnerian
learning
Thebasic
stuffthat
every
mam
mal
has
incommon:
Neu
ralmech
anism
sofPav
lovian
conditio
ning
(from
acomputatio
nal
persp
ective).
Nosu
perv
isedtarg
etsig
nal
availab
le:only
good/bad
outco
mes
Enab
lesbootstrap
ofnew
stimuli(C
S’s)
onto
built-in
desires
(US’s):
CS(m
oney
)→
US(fo
od,etc)
Butwhat
ifconseq
uen
ceofgiven
inputcomes
laterin
time?
Tem
porally
-delay
edLearn
ing&
Rein
forcem
ent
Rein
forcem
entoften
delay
edfro
mtheev
ents
that
leadto
it:
need
to“sp
anthegap
”.
Tem
porally
-delay
edLearn
ing&
Rein
forcem
ent
Rein
forcem
entoften
delay
edfro
mtheev
ents
that
leadto
it:
need
to“sp
anthegap
”.
Key
idea:
•Wewan
tto
pred
ictfuture
reward
sconsisten
tlyover
time.
•Thisallo
wusto
learnwhat
even
tsare
associated
with
reward
s,earlier
andearlier
back
intim
e.
Weuse
theTem
poral
Differen
ces(TD)alg
orith
m(Sutto
n&
Barto
).
Rein
forcem
entlearn
ingan
ddopam
ine:
pred
ictionerro
rsPositiv
ePE:
Neg
ativePE:
dopam
ine:
Sch
ultz,
Sato
h,R
oesch
,Zag
houl,G
limch
er,Hylan
d..an
dman
ymore
Basic
Data:
VTA
dopam
inefirin
gin
Conditio
ning
Sch
ultz,
Montag
ue&
Day
an,2007
Dopam
inean
dRew
ardProbab
ility
Dopam
inean
dRew
ardProbab
ility
Burst/
Pau
secorrelatio
nswith
Rew
Pred
ictionErro
rs
Bay
eret
al,2007
JNeu
rophys
Tem
poral
Differen
ceLearn
ing:Equatio
ns
Valu
efunctio
n,su
mofdisco
unted
future
reward
s:
V(t)
=〈γ
0r(t)
+γ1r(t+
1)+
γ2r(t+
2)...〉
(1)
Tem
poral
Differen
ceLearn
ing:Equatio
ns
Valu
efunctio
n,su
mofdisco
unted
future
reward
s:
V(t)
=〈γ
0r(t)
+γ1r(t+
1)+
γ2r(t+
2)...〉
(1)
Recu
rsivedefi
nitio
n:
V(t)
=〈r(t)
+γV(t+
1)〉
(2)
Tem
poral
Differen
ceLearn
ing:Equatio
ns
Valu
efunctio
n,su
mofdisco
unted
future
reward
s:
V(t)
=〈γ
0r(t)
+γ1r(t+
1)+
γ2r(t+
2)...〉
(1)
Recu
rsivedefi
nitio
n:
V(t)
=〈r(t)
+γV(t+
1)〉
(2)
Erro
rin
pred
ictedrew
ard(fro
mprev
iousto
nexttim
e-step):
δ(t)
=(
r(t)
+γV̂(t+
1))
−V̂(t)
(3)
Tem
poral
Differen
ceLearn
ing:Equatio
ns
Valu
efunctio
n,su
mofdisco
unted
future
reward
s:
V(t)
=〈γ
0r(t)
+γ1r(t+
1)+
γ2r(t+
2)...〉
(1)
Recu
rsivedefi
nitio
n:
V(t)
=〈r(t)
+γV(t+
1)〉
(2)
Erro
rin
pred
ictedrew
ard(fro
mprev
iousto
nexttim
e-step):
δ(t)
=(
r(t)
+γV̂(t+
1))
−V̂(t)
(3)
Update
valu
eestim
ate:
V̂(t)←
V̂(t)
+αδ(t)
(4)
α=learn
ingrate
TD
andDopam
ineRelatio
nsh
ip
Sch
ultz,D
ayan
&Montag
ue,1997,S
cience
δV
Model:
CSat
t=2,U
Sat
t=16
a)
TD Error 0−
0.2−
0.4−
0.6−
0.8−
1−02
46
810
1214
1618
20
Tim
eb)
0−
0.2−
0.4−
0.6−
0.8−
1−02
46
810
1214
1618
20
Tim
eTD Errorc)
02
46
810
1214
1618
20
0−
0.2−
0.4−
0.6−
0.8−
1−
Tim
e
TD Error
Netw
ork
Implem
entatio
n
Stim
uli
Hidden
V(t)
^^V(t+
1) + r(t)
γ
δ(t)
−
Phase-b
asedIm
plem
entatio
n
Stim
ulus
1
−1+1
V̂(1)
δδ
−2+2
−3+3
32
rV̂
(t)
V̂(2)
V̂(t)
V(t+
1)^
γV̂
(t)V
(t+1)
^γ
γ _ 1γ _ 1
V̂(2)
V̂(3)
V̂(3)
r(3)
Tim
e
(ExtR
ew)
TD
Rew
Integ
TDRew
Integ
=TDRew
Pred
+ExtRew
Minusphase:
TDRew
Integ
clamped
toprev
plusphase
valu
e.
Phase-b
asedIm
plem
entatio
n
Stim
ulus
1
−1+1
V̂(1)
δδ
−2+2
−3+3
32
rV̂
(t)
V̂(2)
V̂(t)
V(t+
1)^
γV̂
(t)V
(t+1)
^γ
γ _ 1γ _ 1
V̂(2)
V̂(3)
V̂(3)
r(3)
Tim
e
(ExtR
ew)
TD
Rew
Integ
TDRew
Integ
=TDRew
Pred
+ExtRew
Minusphase:
TDRew
Integ
clamped
toprev
plusphase
valu
e.
Plusphase:
TDRew
Integ
settlesvia
weig
hts
=expected
reward
att+1,p
lusan
yExtRew
attim
et.
Phase-b
asedIm
plem
entatio
n
Stim
ulus
1
−1+1
V̂(1)
δδ
−2+2
−3+3
32
rV̂
(t)
V̂(2)
V̂(t)
V(t+
1)^
γV̂
(t)V
(t+1)
^γ
γ _ 1γ _ 1
V̂(2)
V̂(3)
V̂(3)
r(3)
Tim
e
(ExtR
ew)
TD
Rew
Integ
TDRew
Integ
=TDRew
Pred
+ExtRew
Minusphase:
TDRew
Integ
clamped
toprev
plusphase
valu
e.
Plusphase:
TDRew
Integ
settlesvia
weig
hts
=expected
reward
att+1,p
lusan
yExtRew
attim
et.
Learn
ingsig
nal
δ(=
“TD”)
trainspred
ictionforprev
ioustim
estep
.( elig
ibility
tracesneed
ed)
Exploratio
n:[rl
cond.proj]
Input
time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
tone light
odor
stimuli
TD
Rew
Pred...
’Complete
Serial
Compound’(C
SC)inputrep
resentatio
n:
uniqueunitforeach
stimulusat
eachtim
epoint
(used
inSutto
n&
Barto
,Montag
ueet
al,etc)
Notrealistic,b
utgoodfordem
onstratio
n.Thisassu
mptio
ncan
berelax
ed
with
outch
angingcore
ideas
(e.g.Ludvig
etal,2008).
Exploratio
n:[rl
cond.proj]
Stan
dard
TD:
V̂(t)
=∑
iwi x
i (t)
[xiare
inputs:
tone,lig
ht]
Here:
passed
thru
activatio
nfunctio
n–has
tosu
rpass
thresh
old,su
bject
to
inhibito
rycompetitio
nfro
mother
valu
erep
s
DA
andTim
ing:Late
andEarly
Rew
ards
Hollerm
an&
Sch
ultz
1998
DA
andLearn
ing:Audito
ryCortex
Bao
etal,2001,N
ature
Learn
ingTheo
ry:Blocking(Beh
avior)
Learn
ingTheo
ry:Blocking(D
opam
ine)
Waelti
etal,
2001,Natu
re
Blocked
stimulus
Contro
l(notblocked
)stim
Learn
ingTheo
ry:(U
n)Blocking
TD
pred
ictionerro
ran
dhuman
functio
nal
imag
ing
O’D
oherty
etal,2004,S
cience
Ven
tralstriatu
m=DA
enrich
ed,correlates
with
TD
PE=Critic!
Optical
phasic
DA
stimulatio
ncau
sallyinduces
conditio
ning
(Tsai
etal,
2009,Scien
ce)
DA
neu
ronsp
ikingdurin
grein
forcem
enttask
inhuman
s
(Zag
houlet
al,2009,
Scien
ce)
How
aredopam
ine-b
asedRPEsig
nals
used
toselect
actions?
Will
consid
erbiological
implem
entatio
nin
basal
gan
glia
later
Qlearn
ing:exten
dingpred
ictionerro
rlearn
ingto
actions
Erro
rin
pred
ictedrew
ard:
δt=
(
rt+
γmax
aQ
t (st+
1,a))
−Q
t (s,a)
Update
valu
eestim
ate:
Qt (s,a)←
Qt (s,a)+
αδ(t)
Select
amongQ
valu
es:
Pt (a)=
eQt (s,a)
β
∑
ni=1e
Qt (s,i)
β
γ=disco
unt,α=learn
ingrate,
β=“tem
peratu
re”/exploratio
nparam
eter
Watk
ins&
Day
an1992
Deep
MindRLNetw
ork
(“DQN”)
Play
sAtari
forvideo
,DQN
spacein
vad
ers.mov
Deep
MindRLNetw
ork
(“DQN”)
Play
sAtari
Extra
Thefollo
wingslid
esdescrib
earecen
tlydev
eloped
alternativ
eto
TD,called
PVLV,w
hich
wethinkismore
biologically
plau
sible
andcomputatio
nally
powerfu
l.Thismaterial
isoptio
nal
forthecourse.
TheProblem
Q:H
ow
dowelearn
toattach
positiv
e/neg
ativevalen
ceto
enviro
nmen
tal
stimuli?
TheProblem
Q:H
ow
dowelearn
toattach
positiv
e/neg
ativevalen
ceto
enviro
nmen
tal
stimuli?
A:T
hesam
eway
welearn
lots
ofother
stuff:
theDelta
Rule!
δpv=
r−
V̂pv
V̂pv :expected
reward
based
onprio
rasso
ciations
r:rew
ardδpv :learn
ingsig
nal
TheProblem
Q:B
utwhat
hap
pen
swhen
enviro
nmen
talstim
ulusoccu
rsbefo
rerew
ard?
Basic
Data:
VTA
DA
Neu
ralFirin
gin
Conditio
ning
Befo
re:After:
Basic
Data:
VTA
DA
Neu
ralFirin
gin
Conditio
ning
Rew
DA
Rew
DA
a) Acquisition
b) Trained, rew
ard omission
CS
CS
Basic
Data:
VTA
DA
Neu
ralFirin
gin
Conditio
ning
Rew
DA
Rew
DA
a) Acquisition
b) Trained, rew
ard omission
CS
CS
Dopam
inesp
ikes/
dipsare
learningsig
nals
Basic
Data:
VTA
DA
Neu
ralFirin
gin
Conditio
ning
Rew
DA
Rew
DA
a) Acquisition
b) Trained, rew
ard omission
CS
CS
Dopam
inesp
ikes/
dipsare
learningsig
nals
Delta
rule
failsto
accountforpred
ictiveDA
spike!
Stan
dard
Approach
:TD
Pred
ictall
future
reward
s(disco
unted
):
Vt=
∑
τ=∞
τ=t+
1γτ−(t+
1)r
τ
Recu
rsively
:
V̂t−
1=
rt+
γV̂t
Erro
r=Tem
poral
Differen
ce=TD:
DA
=δt=
[rt+
γV̂t ]−
V̂t−
1
TD
Illustrated
−+
γ+
1
δ
V̂1
r1
S1
δ
V̂2
V̂r2
S2
2
S3
3
r3V̂
V̂3
δ
r^
S4
4
V̂V
44
δ
tonetone
tonetone
Tim
e
V̂0
−+
γ+
−+
γ+
−+
γ+
23
1
δinit
final
Input
V̂
43
21
Problem
swith
TD
•Great
algorith
m,d
evelo
ped
incomputer
science
/mach
inelearn
ing,
butisthisactu
allywhat
thebrain
does?
•Even
ifso,d
oesn
’tsp
ecifyhow
these
signals
arecomputed
bysystem
s
upstream
ofDA...
justpred
ictsDA
andδbutsay
snothingab
outV,etc.
•Curren
trew
ardvalu
eisalw
aysrelativ
eto
what
hap
pen
edjustbefo
re.
Toomuch
temporal
dep
enden
cy?
•Chain
ingnotseen
inneu
ralreco
rdings.
•What
determ
ines
“disco
untfacto
r”γ,b
iologically
?
Rat
study:sim
ultan
eousCSan
dUSDA
spike
Pan
etal,2005,Jo
urnal
ofNeu
roscien
ce
Inconsisten
twith
standard
TD!
ThePVLVAltern
ative
PVLV=Prim
aryValu
e,Learn
edValu
e(O
’Reilly,F
rank,H
azy&
Watz,2007,B
ehav
Neu
rosci)
•Norew
ardpred
ictions,ju
stasso
ciations!
•Notem
poral
dep
enden
cies:DA
dep
endsonly
oncu
rrentstate.
•Uses
samebasic
delta-ru
lelearn
ingas
TD
(Resco
rla-Wag
ner).
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
excitatoryinhibitory
CN
A
VS
patchV
Spatch
Stim
uli(C
S)
(DA
)V
TA
/SN
cC
ereb.(T
iming)
PP
TS
timuli
(US
)
(PV
)i(LV
)i
(LV )e
(PV
)e
ventral
striatum
NAc
LHA
CS
Tim
ing
DA
PV
i
LVe
US
/PV
e
•PV(Prim
aryValu
e):Prim
aryrew
ards(U
S),can
celed.
•LV(Learn
edValu
e):Learn
edasso
ciations(C
S→
DA).
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PV
:Prim
aryV
alue
•Train
edat
eachpointin
timeonactu
alrew
ardvalu
epresen
t:
δt=
rt−
V̂t
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PV
:Prim
aryV
alue
•Train
edat
eachpointin
timeonactu
alrew
ardvalu
epresen
t:
δt=
rt−
V̂t
•Thisuses
immed
iatepred
iction(V̂
t )ofcu
rrentrew
valu
e(rt )
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PV
:Prim
aryV
alue
•Train
edat
eachpointin
timeonactu
alrew
ardvalu
epresen
t:
δt=
rt−
V̂t
•Thisuses
immed
iatepred
iction(V̂
t )ofcu
rrentrew
valu
e(rt )
•Acco
unts
forcan
celingofDA
spike@rew
,andDA
dipswhen
norew
received
.
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PV
:Prim
aryV
alue
•Train
edat
eachpointin
timeonactu
alrew
ardvalu
epresen
t:
δt=
rt−
V̂t
•Thisuses
immed
iatepred
iction(V̂
t )ofcu
rrentrew
valu
e(rt )
•Acco
unts
forcan
celingofDA
spike@rew
,andDA
dipswhen
norew
received
.
•Butthisdoesn
’tacco
untforpred
ictiveDA
spikes...
(actually
results
in
pred
ictiveDA
dips!)
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
LV:Learned
value
•Rep
resents
perceiv
edvalu
esofstim
seven
when
thereis
nocurrent
rewexpectation
.
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
LV:Learned
value
•Rep
resents
perceiv
edvalu
esofstim
seven
when
thereis
nocurrent
rewexpectation
.
•Only
gets
trainingsig
nal
@rew
,orwhen
PVexpects
somerew
.(ie
learningisfiltered
byprim
aryPVsystem
.)
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
LV:Learned
value
•Rep
resents
perceiv
edvalu
esofstim
seven
when
thereis
nocurrent
rewexpectation
.
•Only
gets
trainingsig
nal
@rew
,orwhen
PVexpects
somerew
.(ie
learningisfiltered
byprim
aryPVsystem
.)
•→
Learn
sat
timeofrew
,butnotat
CSonset.
•→
Gen
eralizesrew
valu
esto
CS...
•→
Acco
unts
forDA
spikes
forstim
ulithat
hav
eprev
iously
been
associated
with
reward
!
PVLV:C
omputatio
nally
Powerfu
l
Compariso
nwith
TD
onRan
dom
Delay
s(break
sTD
chain
ing):
050
100150
200250
Epochs
-0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Avg DA Value
Delay 3
Delay 6
Delay 12
Rnd D
elay, p=.2, TD
Disc .95 Lrate .1
050
100150
200250
Epochs
0.00
0.25
0.50
0.75
1.00
Avg DA Value
Delay 3
Delay 6
Delay 12
Rnd D
elay, p=.2, PV
LV Lrate .005
PVLV:C
omputatio
nally
Powerfu
l
Compariso
nwith
TD
onRan
dom
Delay
s(break
sTD
chain
ing):
050
100150
200250
Epochs
-0.05
0.00
0.05
0.10
0.15
0.20
0.25
0.30
Avg DA Value
Delay 3
Delay 6
Delay 12
Rnd D
elay, p=.2, TD
Disc .95 Lrate .1
050
100150
200250
Epochs
0.00
0.25
0.50
0.75
1.00
Avg DA Value
Delay 3
Delay 6
Delay 12
Rnd D
elay, p=.2, PV
LV Lrate .005
Enab
lesworkingmem
ory
model
tolearn
complex
WM
tasks.
Sim
ilarto
Brown,Bullo
ck&
Grossb
erg,’99
Diffs:
Anato
mical
(CNA
vs.VS;D
orsal
vs.Ven
tralPatch
)
Functio
nal
(intrin
sictim
ing?LVsystem
cannottrain
itself).
PVLVacco
unts
fortim
ingdata
better
than
TD!
•Data:
durin
gtran
sientlearn
ingperio
d,both
rewsan
dCSelicit
activatio
n.
•Thisacco
unted
forbyPV,L
Vsystem
soperatin
gin
parallel.
•TD:p
redicts
chain
ingback
intim
efro
mrew
toCS.
PVLVacco
unts
fortim
ingdata
better
than
TD!
•Data:
delay
edrew
ardscau
sedips@usu
altim
e,then
spikes
•Thisacco
unted
forbyboth
TD
andPV.
PVLVacco
unts
fortim
ingdata
better
than
TD!
•Data:
delay
edrew
ardscau
sedips@usu
altim
e,then
spikes
•Thisacco
unted
forbyboth
TD
andPV.
•Data:
earlyrew
ardscau
sesp
ikes,
then
dips@usu
altim
e
•Thisacco
unted
forbyPV(sp
ike),P
V(dip),b
utTD
only
accountsfor
spike.
More
Key
Pred
ictionsfro
mPVLV
excitatoryinhibitory
CN
A
VS
patchV
Spatch
Stim
uli(C
S)
(DA
)V
TA
/SN
cC
ereb.(T
iming)
PP
TS
timuli
(US
)
(PV
)i(LV
)i
(LV )e
(PV
)e
ventral
striatum
NAc
LHA
•CNA
=Pav
lovian
conditio
ning
(e.g.,Killcro
sset
al.’97).
•NAc(patch
/sh
ell)=Extin
ction
(Ferry
etal.
’00;Annett
etal.,
89),Blocking(data?).
•NAc(m
atrix/core)
=Basic
actions(O
R’s,ap
proach
,av
oid).
•CNA
can’ttrain
itself:No2n
d
order
conditio
ning!
•BLA
=2n
dorder
cond,u
ses
DA-in
dep
enden
tmech
anism
s
(CNA/BLA
double-d
issoc).
Conclu
sions
PVLVprovides
computatio
nally
motiv
atedarch
itecture
that
seemsto
fitwith
biology&
beh
avioral
data.
These
learningmech
anism
sen
able
arbitrary
stimuli/
goals
tobeplugged
into
ourfixed
setofbuilt-in
motiv
ational
driv
es.
Conclu
sions
PVLVprovides
computatio
nally
motiv
atedarch
itecture
that
seemsto
fitwith
biology&
beh
avioral
data.
These
learningmech
anism
sen
able
arbitrary
stimuli/
goals
tobeplugged
into
ourfixed
setofbuilt-in
motiv
ational
driv
es.
Someth
ingmotiv
atesev
erygen
eratedmen
tal-state,alway
s!
PVLV,W
M,an
dDA
DA
PF
C(spans the delay)
DA
a)b)
(causes updating)
(maint in P
FC
)
CS
CS
BG
−G
o
(reinforces Go)
US
/r
US
/r
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PVlearn
ing:
δpv=
r−
V̂pv
–or–
δpv=
PVe−
PVi
∆wi=
ǫxi δpv
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PVlearn
ing:
δpv=
r−
V̂pv
–or–
δpv=
PVe−
PVi
∆wi=
ǫxi δpv
LVlearn
ing(filtered
byPV):
∆wi=
{
ǫ(rt−
V̂lv )
xi
ifV̂pv>
θpvorrt>
00
otherw
ise
PVLV:T
woSep
arateMech
anism
s(PV,L
V)
PVlearn
ing:
δpv=
r−
V̂pv
–or–
δpv=
PVe−
PVi
∆wi=
ǫxi δpv
LVlearn
ing(filtered
byPV):
∆wi=
{
ǫ(rt−
V̂lv )
xi
ifV̂pv>
θpvorrt>
00
otherw
ise
Global
DA
(PVdominates):
δt=
{
δpv
ifV̂pv>
θpvorrt>
0δlv
otherw
ise
δlv
=LVe−
LVi
LVExtras
•DA
spikes
only
observ
ed@CSonset,d
on’tcontin
uethroughoutdelay
until
reward
.Problem
forPV?
LVExtras
•DA
spikes
only
observ
ed@CSonset,d
on’tcontin
uethroughoutdelay
until
reward
.Problem
forPV?
•Solutio
n:PVsystem
has
synap
ticdep
ression,acco
mmodates
to
constan
tsen
sory
inputs;o
nly
perceiv
esvalu
esofstim
sthat
were
not
presen
tin
lasttim
estep
.
•Thisisalso
importan
tforPFClearn
ing..(stay
tuned
)