sequential & temporally-delayed learningski.clps.brown.edu/cogsim/cogsim.8temporal.pdf ·...

Seq

uen

tial&

Tem

porally

-Delay

edLearn

ing

1.TheProblem

.

2.Seq

uen

tialLearn

ing&

Contex

t.

3.Tem

porally

-delay

edLearn

ing&

Rein

forcem

ent.

TheProblem

Erro

r-driv

en+Heb

bian

:Solvetask

s,learnsystem

aticrep

resentatio

ns,

gen

eralizeto

new

stimuli.

What’s

left?...

TheProblem

Erro

r-driv

en+Heb

bian

:Solvetask

s,learnsystem

aticrep

resentatio

ns,

gen

eralizeto

new

stimuli.

What’s

left?...

Tim

e!

TheProblem

Erro

r-driv

en+Heb

bian

:Solvetask

s,learnsystem

aticrep

resentatio

ns,

gen

eralizeto

new

stimuli.

What’s

left?...

Tim

e!

Curren

tly:netw

orkslearn

immed

iateconseq

uen

ceofagiven

input.

TheProblem

Erro

r-driv

en+Heb

bian

:Solvetask

s,learnsystem

aticrep

resentatio

ns,

gen

eralizeto

new

stimuli.

What’s

left?...

Tim

e!

Curren

tly:netw

orkslearn

immed

iateconseq

uen

ceofagiven

input.

•What

ifcu

rrentinputonly

mak

essen

seas

part

ofaseq

uen

ceofinputs

(e.g.,lan

guag

e,social

interactio

ns)?

TheProblem

Erro

r-driv

en+Heb

bian

:Solvetask

s,learnsystem

aticrep

resentatio

ns,

gen

eralizeto

new

stimuli.

What’s

left?...

Tim

e!

Curren

tly:netw

orkslearn

immed

iateconseq

uen

ceofagiven

input.

•What

ifcu

rrentinputonly

mak

essen

seas

part

ofaseq

uen

ceofinputs

(e.g.,lan

guag

e,social

interactio

ns)?

•What

iftheconseq

uen

ceofthisinputcomes

later(e.g

.,sch

ool/work,

life)?

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Ismypurple

colorfav

orite.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Ismypurple

colorfav

orite.

Ispurple

mycolorfav

orite.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Ismypurple

colorfav

orite.

Ispurple

mycolorfav

orite.

Thegirl

pick

edupthepen

.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Ismypurple

colorfav

orite.

Ispurple

mycolorfav

orite.

Thegirl

pick

edupthepen

.

Thepig

racedaro

undthepen

.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Ismypurple

colorfav

orite.

Ispurple

mycolorfav

orite.

Thegirl

pick

edupthepen

.

Thepig

racedaro

undthepen

.

Werep

resentthecontex

t,notjustthecu

rrentinput.

Seq

uen

ceLearn

ing

How

dowedoit?

Forexam

ple:

Myfav

orite

colorispurple.

Purple

mycolorfav

orite

is.

Ismypurple

colorfav

orite.

Ispurple

mycolorfav

orite.

Thegirl

pick

edupthepen

.

Thepig

racedaro

undthepen

.

Werep

resentthecontex

t,notjustthecu

rrentinput.

inlan

guag

e,social

interactio

ns,driv

ing(w

hogoes

ata4-w

aysto

p?)

Rep

resentin

gContex

tforSeq

uen

ceLearn

ing

How

does

thebrain

doit?

How

would

weget

ourmodels

todoit?

Rep

resentin

gContex

tforSeq

uen

ceLearn

ing

How

does

thebrain

doit?

How

would

weget

ourmodels

todoit?

Addlay

ersto

keep

trackofcontex

t(prefro

ntal

cortex

;hippocam

pus...).

AnExam

ple

Task

BTXSE

BPVPSE

BTSXXTVVE

BPTVPSE

AnExam

ple

Task

BTXSE

BPVPSE

BTSXXTVVE

BPTVPSE

Which

ofthefollo

wingseq

uen

cesare

allowed

?:

BTXXTTVVE

TSXSE

VVSXE

BSSXSE

AnExam

ple

Task

BTXSE,B

PVPSE,B

TSXXTVVE,BPTVPSE,B

TXXTTVVE

TSXSE,V

VSXE,B

SSXSE

AnExam

ple

Task

BTXSE,B

PVPSE,B

TSXXTVVE,BPTVPSE,B

TXXTTVVE

TSXSE,V

VSXE,B

SSXSE

BTP

V

SV

PE

startend

ST

0

12

34

5

XX

AnExam

ple

Task

BTXSE,B

PVPSE,B

TSXXTVVE,BPTVPSE,B

TXXTTVVE

TSXSE,V

VSXE,B

SSXSE

BTP

V

SV

PE

startend

ST

0

12

34

5

XX

Weim

plicitly

learnsu

chgram

mars

(e.g.,p

ressingbutto

nsfaster

toletters

that

follo

wgram

mar).

Tim

e&

Seq

uen

ces

Curren

tly:netw

orkslearn

immed

iateconseq

uen

ceofagiven

input.

What

ifcu

rrentinputonly

mak

essen

seas

part

ofatem

porally

-exten

ded

sequen

ceofinputs?

(contex

t)

What

iftheconseq

uen

ceofthisinputcomes

laterin

time?

(nextweek

)

WhyCopytheHidden

Rep

resentatio

n?

WhyCopytheHidden

Rep

resentatio

n?

•Copyinginputoroutputonly

letsthenetw

ork

hold

onto

oneprev

ious

item

•Copyingthehidden

layer

letsthenetw

ork

hold

onto

anarb

itrarily

largenumber

ofitem

s–

eventhough

itisalw

aysjustcopying

lasthidden

stateattim

et-1.

•Thenetw

ork

learnshow

strongly

tohold

onto

past

items

Sim

ple

SRN

story

isnotflaw

less

•How

ishidden→

“copy”functio

nim

plem

ented

biologically

?

•Durin

gsettlin

g,co

ntex

tmustbeactiv

elymain

tained

(ongoinghidden

activity

has

noeffect

oncontex

t).

•Assu

mes

allcontex

tisrelev

ant:What

ifdistractin

ginform

ation

presen

tedin

middle

ofseq

uen

ce?Wan

tto

only

hold

onto

relevan

t

contex

t.

→Stay

tuned

forsp

ecializedbiological/

computatio

nal

mech

anism

sfor

updatin

g/gatin

gvs.robust

main

tenan

ceofcontex

t.

Motiv

atingMotiv

ation

Whydoes

anyonegoto

university

?

Motiv

atingMotiv

ation

Whydoes

anyonegoto

university

?

(or,w

hydoweev

erdoan

ythingbesid

eseat,sleep

,hav

esex

,etc)?

Motiv

atingMotiv

ation

Whydoes

anyonegoto

university

?

(or,w

hydoweev

erdoan

ythingbesid

eseat,sleep

,hav

esex

,etc)?

e.g.,W

hyam

Ihere

today,in

steadoflyingonabeach

inMexico

,drin

king

mojito

san

dread

ingagoodbook?

Motiv

atingMotiv

ation

Whydoes

anyonegoto

university

?

(or,w

hydoweev

erdoan

ythingbesid

eseat,sleep

,hav

esex

,etc)?

e.g.,W

hyam

Ihere

today,in

steadoflyingonabeach

inMexico

,drin

king

mojito

san

dread

ingagoodbook?

Challen

ge:

mak

earesp

onsib

leneu

ralnetw

ork!

TheMotiv

ational

Bootstrap

•Somemotiv

ationsmustbebuilt-in

(elsewewould

die)

•Where

doart/

science

comefro

m?

–Need

tolearn

ontopofbuilt-in

driv

es

TheMotiv

ational

Bootstrap

•Somemotiv


(elsewewould

die)

•Where

doart/

science

comefro

m?

–Need

tolearn

ontopofbuilt-in

driv

es

Cultu

re&

social

driv

esprovidecu

mulativ

esh

apingoflearn

ing.

TheMotiv

ational

Bootstrap

•Somemotiv


(elsewewould

die)

•Where

doart/

science

comefro

m?

–Need

tolearn

ontopofbuilt-in

driv

es

Cultu

re&

social

driv

esprovidecu

mulativ

esh

apingoflearn

ing.

So,w

hydoes

anyonegoto

university

?

•Socially

-med

iatedstan

dard

sofsu

ccess.

•Stro

ngbuilt-in

desire

tosh

arew/others.

•Stro

ngbuilt-in

desire

tolearn

(dopam

ine?)

What

I’mActu

allyTalk

ingAbout

Skinnerian

learning

Thebasic

stuffthat

every

mam

mal

has

incommon:

Neu

ralmech

anism

sofPav

lovian

conditio

ning

(from

acomputatio

nal

persp

ective).

What

I’mActu

allyTalk

ingAbout

Skinnerian

learning

Thebasic

stuffthat

every

mam

mal

has

incommon:

Neu

ralmech

anism

sofPav

lovian

conditio

ning

(from

acomputatio

nal

persp

ective).

Nosu

perv

isedtarg

etsig

nal

availab

le:only

good/bad

outco

mes

Enab

lesbootstrap

ofnew

stimuli(C

S’s)

onto

built-in

desires

(US’s):

CS(m

oney

)→

US(fo

od,etc)

What

I’mActu

allyTalk

ingAbout

Skinnerian

learning

Thebasic

stuffthat

every

mam

mal

has

incommon:

Neu

ralmech

anism

sofPav

lovian

conditio

ning

(from

acomputatio

nal

persp

ective).

Nosu

perv

isedtarg

etsig

nal

availab

le:only

good/bad

outco

mes

Enab

lesbootstrap

ofnew

stimuli(C

S’s)

onto

built-in

desires

(US’s):

CS(m

oney

)→

US(fo

od,etc)

Butwhat

ifconseq

uen

ceofgiven

inputcomes

laterin

time?

Tem

porally

-delay

edLearn

ing&

Rein

forcem

ent

Rein

forcem

entoften

delay

edfro

mtheev

ents

that

leadto

it:

need

to“sp

anthegap

”.

Tem

porally

-delay

edLearn

ing&

Rein

forcem

ent

Rein

forcem

entoften

delay

edfro

mtheev

ents

that

leadto

it:

need

to“sp

anthegap

”.

Key

idea:

•Wewan

tto

pred

ictfuture

reward

sconsisten

tlyover

time.

•Thisallo

wusto

learnwhat

even

tsare

associated

with

reward

s,earlier

andearlier

back

intim

e.

Weuse

theTem

poral

Differen

ces(TD)alg

orith

m(Sutto

n&

Barto

).

Rein

forcem

entlearn

ingan

ddopam

ine:

pred

ictionerro

rsPositiv

ePE:

Neg

ativePE:

dopam

ine:

Sch

ultz,

Sato

h,R

oesch

,Zag

houl,G

limch

er,Hylan

d..an

dman

ymore

Basic

Data:

VTA

dopam

inefirin

gin

Conditio

ning

Sch

ultz,

Montag

ue&

Day

an,2007

Dopam

inean

dRew

ardProbab

ility

Burst/

Pau

secorrelatio

nswith

Rew

Pred

ictionErro

rs

Bay

eret

al,2007

JNeu

rophys

Tem

poral

Differen

ceLearn

ing:Equatio

ns

Valu

efunctio

n,su

mofdisco

unted

future

reward

s:

V(t)

=〈γ

0r(t)

+γ1r(t+

1)+

γ2r(t+

2)...〉

(1)

Tem

poral

Differen

ceLearn

ing:Equatio

ns

Valu

efunctio

n,su

mofdisco

unted

future

reward

s:

V(t)

=〈γ

0r(t)

+γ1r(t+

1)+

γ2r(t+

2)...〉

(1)

Recu

rsivedefi

nitio

n:

V(t)

=〈r(t)

+γV(t+

1)〉

(2)

Tem

poral

Differen

ceLearn

ing:Equatio

ns

Valu

efunctio

n,su

mofdisco

unted

future

reward

s:

V(t)

=〈γ

0r(t)

+γ1r(t+

1)+

γ2r(t+

2)...〉

(1)

Recu

rsivedefi

nitio

n:

V(t)

=〈r(t)

+γV(t+

1)〉

(2)

Erro

rin

pred

ictedrew

ard(fro

mprev

iousto

nexttim

e-step):

δ(t)

=(

r(t)

+γV̂(t+

1))

−V̂(t)

(3)

Tem

poral

Differen

ceLearn

ing:Equatio

ns

Valu

efunctio

n,su

mofdisco

unted

future

reward

s:

V(t)

=〈γ

0r(t)

+γ1r(t+

1)+

γ2r(t+

2)...〉

(1)

Recu

rsivedefi

nitio

n:

V(t)

=〈r(t)

+γV(t+

1)〉

(2)

Erro

rin

pred

ictedrew

ard(fro

mprev

iousto

nexttim

e-step):

δ(t)

=(

r(t)

+γV̂(t+

1))

−V̂(t)

(3)

Update

valu

eestim

ate:

V̂(t)←

V̂(t)

+αδ(t)

(4)

α=learn

ingrate

TD

andDopam

ineRelatio

nsh

ip

Sch

ultz,D

ayan

&Montag

ue,1997,S

cience

δV

Model:

CSat

t=2,U

Sat

t=16

a)

TD Error 0−

0.2−

0.4−

0.6−

0.8−

1−02

46

810

1214

1618

20

Tim

eb)

0−

0.2−

0.4−

0.6−

0.8−

1−02

46

810

1214

1618

20

Tim

eTD Errorc)

02

46

810

1214

1618

20

0−

0.2−

0.4−

0.6−

0.8−

1−

Tim

e

TD Error

Netw

ork

Implem

entatio

n

Stim

uli

Hidden

V(t)

^^V(t+

1) + r(t)

γ

δ(t)

−

Phase-b

asedIm

plem

entatio

n

Stim

ulus

1

−1+1

V̂(1)

δδ

−2+2

−3+3

32

rV̂

(t)

V̂(2)

V̂(t)

V(t+

1)^

γV̂

(t)V

(t+1)

^γ

γ _ 1γ _ 1

V̂(2)

V̂(3)

V̂(3)

r(3)

Tim

e

(ExtR

ew)

TD

Rew

Integ

TDRew

Integ

=TDRew

Pred

+ExtRew

Minusphase:

TDRew

Integ

clamped

toprev

plusphase

valu

e.

Phase-b

asedIm

plem

entatio

n

Stim

ulus

1

−1+1

V̂(1)

δδ

−2+2

−3+3

32

rV̂

(t)

V̂(2)

V̂(t)

V(t+

1)^

γV̂

(t)V

(t+1)

^γ

γ _ 1γ _ 1

V̂(2)

V̂(3)

V̂(3)

r(3)

Tim

e

(ExtR

ew)

TD

Rew

Integ

TDRew

Integ

=TDRew

Pred

+ExtRew

Minusphase:

TDRew

Integ

clamped

toprev

plusphase

valu

e.

Plusphase:

TDRew

Integ

settlesvia

weig

hts

=expected

reward

att+1,p

lusan

yExtRew

attim

et.

Phase-b

asedIm

plem

entatio

n

Stim

ulus

1

−1+1

V̂(1)

δδ

−2+2

−3+3

32

rV̂

(t)

V̂(2)

V̂(t)

V(t+

1)^

γV̂

(t)V

(t+1)

^γ

γ _ 1γ _ 1

V̂(2)

V̂(3)

V̂(3)

r(3)

Tim

e

(ExtR

ew)

TD

Rew

Integ

TDRew

Integ

=TDRew

Pred

+ExtRew

Minusphase:

TDRew

Integ

clamped

toprev

plusphase

valu

e.

Plusphase:

TDRew

Integ

settlesvia

weig

hts

=expected

reward

att+1,p

lusan

yExtRew

attim

et.

Learn

ingsig

nal

δ(=

“TD”)

trainspred

ictionforprev

ioustim

estep

.( elig

ibility

tracesneed

ed)

Exploratio

n:[rl

cond.proj]

Input

time

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

tone light

odor

stimuli

TD

Rew

Pred...

’Complete

Serial

Compound’(C

SC)inputrep

resentatio

n:

uniqueunitforeach

stimulusat

eachtim

epoint

(used

inSutto

n&

Barto

,Montag

ueet

al,etc)

Notrealistic,b

utgoodfordem

onstratio

n.Thisassu

mptio

ncan

berelax

ed

with

outch

angingcore

ideas

(e.g.Ludvig

etal,2008).

Exploratio

n:[rl

cond.proj]

Stan

dard

TD:

V̂(t)

=∑

iwi x

i (t)

[xiare

inputs:

tone,lig

ht]

Here:

passed

thru

activatio

nfunctio

n–has

tosu

rpass

thresh

old,su

bject

to

inhibito

rycompetitio

nfro

mother

valu

erep

s

DA

andTim

ing:Late

andEarly

Rew

ards

Hollerm

an&

Sch

ultz

1998

DA

andLearn

ing:Audito

ryCortex

Bao

etal,2001,N

ature

Learn

ingTheo

ry:Blocking(Beh

avior)

Learn

ingTheo

ry:Blocking(D

opam

ine)

Waelti

etal,

2001,Natu

re

Blocked

stimulus

Contro

l(notblocked

)stim

Learn

ingTheo

ry:(U

n)Blocking

TD

pred

ictionerro

ran

dhuman

functio

nal

imag

ing

O’D

oherty

etal,2004,S

cience

Ven

tralstriatu

m=DA

enrich

ed,correlates

with

TD

PE=Critic!

Optical

phasic

DA

stimulatio

ncau

sallyinduces

conditio

ning

(Tsai

etal,

2009,Scien

ce)

DA

neu

ronsp

ikingdurin

grein

forcem

enttask

inhuman

s

(Zag

houlet

al,2009,

Scien

ce)

How

aredopam

ine-b

asedRPEsig

nals

used

toselect

actions?

Will

consid

erbiological

implem

entatio

nin

basal

gan

glia

later

Qlearn

ing:exten

dingpred

ictionerro

rlearn

ingto

actions

Erro

rin

pred

ictedrew

ard:

δt=

(

rt+

γmax

aQ

t (st+

1,a))

−Q

t (s,a)

Update

valu

eestim

ate:

Qt (s,a)←

Qt (s,a)+

αδ(t)

Select

amongQ

valu

es:

Pt (a)=

eQt (s,a)

β

∑

ni=1e

Qt (s,i)

β

γ=disco

unt,α=learn

ingrate,

β=“tem

peratu

re”/exploratio

nparam

eter

Watk

ins&

Day

an1992

Google

Deep

MindRLNetw

ork

(“DQN”)

Play

sAtari

forvideo

,DQN

spacein

vad

ers.mov

Google

Deep

MindRLNetw

ork

(“DQN”)

Play

sAtari

Extra

Thefollo

wingslid

esdescrib

earecen

tlydev

eloped

alternativ

eto

TD,called

PVLV,w

hich

wethinkismore

biologically

plau

sible

andcomputatio

nally

powerfu

l.Thismaterial

isoptio

nal

forthecourse.

TheProblem

Q:H

ow

dowelearn

toattach

positiv

e/neg

ativevalen

ceto

enviro

nmen

tal

stimuli?

TheProblem

Q:H

ow

dowelearn

toattach

positiv

e/neg

ativevalen

ceto

enviro

nmen

tal

stimuli?

A:T

hesam

eway

welearn

lots

ofother

stuff:

theDelta

Rule!

δpv=

r−

V̂pv

V̂pv :expected

reward

based

onprio

rasso

ciations

r:rew

ardδpv :learn

ingsig

nal

TheProblem

Q:B

utwhat

hap

pen

swhen

enviro

nmen

talstim

ulusoccu

rsbefo

rerew

ard?

Basic

Data:

VTA

DA

Neu

ralFirin

gin

Conditio

ning

Befo

re:After:

Basic

Data:

VTA

DA

Neu

ralFirin

gin

Conditio

ning

Rew

DA

Rew

DA

a) Acquisition

b) Trained, rew

ard omission

CS

CS

Basic

Data:

VTA

DA

Neu

ralFirin

gin

Conditio

ning

Rew

DA

Rew

DA

a) Acquisition

b) Trained, rew

ard omission

CS

CS

Dopam

inesp

ikes/

dipsare

learningsig

nals

Basic

Data:

VTA

DA

Neu

ralFirin

gin

Conditio

ning

Rew

DA

Rew

DA

a) Acquisition

b) Trained, rew

ard omission

CS

CS

Dopam

inesp

ikes/

dipsare

learningsig

nals

Delta

rule

failsto

accountforpred

ictiveDA

spike!

Stan

dard

Approach

:TD

Pred

ictall

future

reward

s(disco

unted

):

Vt=

∑

τ=∞

τ=t+

1γτ−(t+

1)r

τ

Recu

rsively

:

V̂t−

1=

rt+

γV̂t

Erro

r=Tem

poral

Differen

ce=TD:

DA

=δt=

[rt+

γV̂t ]−

V̂t−

1

TD

Illustrated

−+

γ+

1

δ

V̂1

r1

S1

δ

V̂2

V̂r2

S2

2

S3

3

r3V̂

V̂3

δ

r^

S4

4

V̂V

44

δ

tonetone

tonetone

Tim

e

V̂0

−+

γ+

−+

γ+

−+

γ+

23

1

δinit

final

Input

V̂

43

21

Problem

swith

TD

•Great

algorith

m,d

evelo

ped

incomputer

science

/mach

inelearn

ing,

butisthisactu

allywhat

thebrain

does?

•Even

ifso,d

oesn

’tsp

ecifyhow

these

signals

arecomputed

bysystem

s

upstream

ofDA...

justpred

ictsDA

andδbutsay

snothingab

outV,etc.

•Curren

trew

ardvalu

eisalw

aysrelativ

eto

what

hap

pen

edjustbefo

re.

Toomuch

temporal

dep

enden

cy?

•Chain

ingnotseen

inneu

ralreco

rdings.

•What

determ

ines

“disco

untfacto

r”γ,b

iologically

?

Rat

study:sim

ultan

eousCSan

dUSDA

spike

Pan

etal,2005,Jo

urnal

ofNeu

roscien

ce

Inconsisten

twith

standard

TD!

ThePVLVAltern

ative

PVLV=Prim

aryValu

e,Learn

edValu

e(O

’Reilly,F

rank,H

azy&

Watz,2007,B

ehav

Neu

rosci)

•Norew

ardpred

ictions,ju

stasso

ciations!

•Notem

poral

dep

enden

cies:DA

dep

endsonly

oncu

rrentstate.

•Uses

samebasic

delta-ru

lelearn

ingas

TD

(Resco

rla-Wag

ner).

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

excitatoryinhibitory

CN

A

VS

patchV

Spatch

Stim

uli(C

S)

(DA

)V

TA

/SN

cC

ereb.(T

iming)

PP

TS

timuli

(US

)

(PV

)i(LV

)i

(LV )e

(PV

)e

ventral

striatum

NAc

LHA

CS

Tim

ing

DA

PV

i

LVe

US

/PV

e

•PV(Prim

aryValu

e):Prim

aryrew

ards(U

S),can

celed.

•LV(Learn

edValu

e):Learn

edasso

ciations(C

S→

DA).

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PV

:Prim

aryV

alue

•Train

edat

eachpointin

timeonactu

alrew

ardvalu

epresen

t:

δt=

rt−

V̂t

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PV

:Prim

aryV

alue

•Train

edat

eachpointin

timeonactu

alrew

ardvalu

epresen

t:

δt=

rt−

V̂t

•Thisuses

immed

iatepred

iction(V̂

t )ofcu

rrentrew

valu

e(rt )

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PV

:Prim

aryV

alue

•Train

edat

eachpointin

timeonactu

alrew

ardvalu

epresen

t:

δt=

rt−

V̂t

•Thisuses

immed

iatepred

iction(V̂

t )ofcu

rrentrew

valu

e(rt )

•Acco

unts

forcan

celingofDA

spike@rew

,andDA

dipswhen

norew

received

.

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PV

:Prim

aryV

alue

•Train

edat

eachpointin

timeonactu

alrew

ardvalu

epresen

t:

δt=

rt−

V̂t

•Thisuses

immed

iatepred

iction(V̂

t )ofcu

rrentrew

valu

e(rt )

•Acco

unts

forcan

celingofDA

spike@rew

,andDA

dipswhen

norew

received

.

•Butthisdoesn

’tacco

untforpred

ictiveDA

spikes...

(actually

results

in

pred

ictiveDA

dips!)

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

LV:Learned

value

•Rep

resents

perceiv

edvalu

esofstim

seven

when

thereis

nocurrent

rewexpectation

.

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

LV:Learned

value

•Rep

resents

perceiv

edvalu

esofstim

seven

when

thereis

nocurrent

rewexpectation

.

•Only

gets

trainingsig

nal

@rew

,orwhen

PVexpects

somerew

.(ie

learningisfiltered

byprim

aryPVsystem

.)

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

LV:Learned

value

•Rep

resents

perceiv

edvalu

esofstim

seven

when

thereis

nocurrent

rewexpectation

.

•Only

gets

trainingsig

nal

@rew

,orwhen

PVexpects

somerew

.(ie

learningisfiltered

byprim

aryPVsystem

.)

•→

Learn

sat

timeofrew

,butnotat

CSonset.

•→

Gen

eralizesrew

valu

esto

CS...

•→

Acco

unts

forDA

spikes

forstim

ulithat

hav

eprev

iously

been

associated

with

reward

!

PVLV:C

omputatio

nally

Powerfu

l

Compariso

nwith

TD

onRan

dom

Delay

s(break

sTD

chain

ing):

050

100150

200250

Epochs

-0.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Avg DA Value

Delay 3

Delay 6

Delay 12

Rnd D

elay, p=.2, TD

Disc .95 Lrate .1

050

100150

200250

Epochs

0.00

0.25

0.50

0.75

1.00

Avg DA Value

Delay 3

Delay 6

Delay 12

Rnd D

elay, p=.2, PV

LV Lrate .005

PVLV:C

omputatio

nally

Powerfu

l

Compariso

nwith

TD

onRan

dom

Delay

s(break

sTD

chain

ing):

050

100150

200250

Epochs

-0.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Avg DA Value

Delay 3

Delay 6

Delay 12

Rnd D

elay, p=.2, TD

Disc .95 Lrate .1

050

100150

200250

Epochs

0.00

0.25

0.50

0.75

1.00

Avg DA Value

Delay 3

Delay 6

Delay 12

Rnd D

elay, p=.2, PV

LV Lrate .005

Enab

lesworkingmem

ory

model

tolearn

complex

WM

tasks.

Sim

ilarto

Brown,Bullo

ck&

Grossb

erg,’99

Diffs:

Anato

mical

(CNA

vs.VS;D

orsal

vs.Ven

tralPatch

)

Functio

nal

(intrin

sictim

ing?LVsystem

cannottrain

itself).

PVLVacco

unts

fortim

ingdata

better

than

TD!

•Data:

durin

gtran

sientlearn

ingperio

d,both

rewsan

dCSelicit

activatio

n.

•Thisacco

unted

forbyPV,L

Vsystem

soperatin

gin

parallel.

•TD:p

redicts

chain

ingback

intim

efro

mrew

toCS.

PVLVacco

unts

fortim

ingdata

better

than

TD!

•Data:

delay

edrew

ardscau

sedips@usu

altim

e,then

spikes

•Thisacco

unted

forbyboth

TD

andPV.

PVLVacco

unts

fortim

ingdata

better

than

TD!

•Data:

delay

edrew

ardscau

sedips@usu

altim

e,then

spikes

•Thisacco

unted

forbyboth

TD

andPV.

•Data:

earlyrew

ardscau

sesp

ikes,

then

dips@usu

altim

e

•Thisacco

unted

forbyPV(sp

ike),P

V(dip),b

utTD

only

accountsfor

spike.

More

Key

Pred

ictionsfro

mPVLV

excitatoryinhibitory

CN

A

VS

patchV

Spatch

Stim

uli(C

S)

(DA

)V

TA

/SN

cC

ereb.(T

iming)

PP

TS

timuli

(US

)

(PV

)i(LV

)i

(LV )e

(PV

)e

ventral

striatum

NAc

LHA

•CNA

=Pav

lovian

conditio

ning

(e.g.,Killcro

sset

al.’97).

•NAc(patch

/sh

ell)=Extin

ction

(Ferry

etal.

’00;Annett

etal.,

89),Blocking(data?).

•NAc(m

atrix/core)

=Basic

actions(O

R’s,ap

proach

,av

oid).

•CNA

can’ttrain

itself:No2n

d

order

conditio

ning!

•BLA

=2n

dorder

cond,u

ses

DA-in

dep

enden

tmech

anism

s

(CNA/BLA

double-d

issoc).

Conclu

sions

PVLVprovides

computatio

nally

motiv

atedarch

itecture

that

seemsto

fitwith

biology&

beh

avioral

data.

These

learningmech

anism

sen

able

arbitrary

stimuli/

goals

tobeplugged

into

ourfixed

setofbuilt-in

motiv

ational

driv

es.

Conclu

sions

PVLVprovides

computatio

nally

motiv

atedarch

itecture

that

seemsto

fitwith

biology&

beh

avioral

data.

These

learningmech

anism

sen

able

arbitrary

stimuli/

goals

tobeplugged

into

ourfixed

setofbuilt-in

motiv

ational

driv

es.

Someth

ingmotiv

atesev

erygen

eratedmen

tal-state,alway

s!

PVLV,W

M,an

dDA

DA

PF

C(spans the delay)

DA

a)b)

(causes updating)

(maint in P

FC

)

CS

CS

BG

−G

o

(reinforces Go)

US

/r

US

/r

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PVlearn

ing:

δpv=

r−

V̂pv

–or–

δpv=

PVe−

PVi

∆wi=

ǫxi δpv

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PVlearn

ing:

δpv=

r−

V̂pv

–or–

δpv=

PVe−

PVi

∆wi=

ǫxi δpv

LVlearn

ing(filtered

byPV):

∆wi=

{

ǫ(rt−

V̂lv )

xi

ifV̂pv>

θpvorrt>

00

otherw

ise

PVLV:T

woSep

arateMech

anism

s(PV,L

V)

PVlearn

ing:

δpv=

r−

V̂pv

–or–

δpv=

PVe−

PVi

∆wi=

ǫxi δpv

LVlearn

ing(filtered

byPV):

∆wi=

{

ǫ(rt−

V̂lv )

xi

ifV̂pv>

θpvorrt>

00

otherw

ise

Global

DA

(PVdominates):

δt=

{

δpv

ifV̂pv>

θpvorrt>

0δlv

otherw

ise

δlv

=LVe−

LVi

LVExtras

•DA

spikes

only

observ

ed@CSonset,d

on’tcontin

uethroughoutdelay

until

reward

.Problem

forPV?

LVExtras

•DA

spikes

only

observ

ed@CSonset,d

on’tcontin

uethroughoutdelay

until

reward

.Problem

forPV?

•Solutio

n:PVsystem

has

synap

ticdep

ression,acco

mmodates

to

constan

tsen

sory

inputs;o

nly

perceiv

esvalu

esofstim

sthat

were

not

presen

tin

lasttim

estep

.

•Thisisalso

importan

tforPFClearn

ing..(stay

tuned

)

sequential & temporally-delayed learningski.clps.brown.edu/cogsim/cogsim.8temporal.pdf ·...

Documents