prefrontal cortex as a meta-reinforcement learning system · prefrontal cortex as a...

Prefrontal cortex as a Meta-reinforcement learning system

Matthew BotvinickDeepMind, London UKGatsby Computational Neuroscience Unit, UCL

Mnih et al, Nature (2015)

Yamins & DiCarlo, 2016

Schultz et al, Science (1997)

Jederberg et al., 2016

Mante et al., Nature, 2013

Song et al., Elife, 2017

Lake et al, BBS (2017)

Harlow, Psychological Review, 1949

“Learning to learn”

Harlow, Psychological Review, 1949

Training episodes

“Learning to learn”

Jederberg et al., 2016

https://deepmind.com/blog/impala-scalable-distributed-deeprl-dmlab-30/

at vt

ot at-1 rt-1

δ

(PFC)

(DA)

Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci., 2016; Duan et al., arXiv (2016)

0.7 0.4 0.6 0.9 0.3 0.1 0.8 0.7

Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)

at vt

ot at-1 rt-1

δ

(PFC)

(DA)


Trial1008060401 20

1

2

3

4

Cum

ulat

ive

regr

et

Gittins indices

UCBThompson sampling

Trial

Episode

Left Right


at vt

ot at-1 rt-1

δ

(PFC)

(DA)


0.7 0.3 0.6 0.4 0.3 0.7 0.8 0.2


Trial1008060401 20

1

2

3

4

Cum

ulat

ive

regr

et

Gittins indices


Trial

Episode


Training episodes


at vt

ot at-1 rt-1

δ

(PFC)

(DA)

Volkmann et al., Nature Reviews Neurology, 2010

420-2-4

-4

-2

0

2

4

log2RRRL

log 2

CR CL

420-2-4

log2RRRL

log 2

CR CL

-4

-2

0

2

4

Tsutsui et al., Nature Comms, 2016

Wang et al., Nature Neuroscience (2018)

at vt

ot at-1 rt-1

δ

(PFC)

(DA)


at-1 rt-1 at-1x rt-1 vt

0.2

0.1

0.3

0.4

0.5

0.6

Pro

porti

on

Tsutsui et al., Nature Comms, 2016

0.2

0.1

0.3

0.4

0.5

0.6

Cor

rela

tion

at-1 rt-1 at-1x rt-1 vt


at vt

ot at-1 rt-1

δ

(PFC)

(DA)


Trial1008060401 20

1

2

3

4

Cum

ulat

ive

regr

et

Gittins indices


Trial

Episode

A

B

0 20 40 60 80 100 120 140 160 180 200Step

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100 120 140 160 180 200

Step

Reward probability

Inferred/decoded volatilityLearning rate

action feedback

Behrens et al., Nature Neuroscience, 2007Wang et al., Nature Neuroscience (2018)

Behrens et al., Nature Neuroscience, 2007Wang et al., Nature Neuroscience (2018)

at vt

ot at-1 rt-1

δ

(PFC)

(DA)

Volkmann et al., Nature Reviews Neurology, 2010

Bromberg-Martin et al, J Neurophys, 2010

REVERSAL


at vt

ot at-1 rt-1

δ

(PFC)

(DA)

Left rewardedRight rewarded


Miller, Botvinick & Brody, Nat. Neuro., 2017; Daw et al., Neuron, 2011

Model-based RPE

Stage 2

1

0

-1

1-1 0

Met

a-R

L R

PE

Reward

r2 = 0.89

Model-based RL (from model-free RL)


DA blocked uponfood reward fromlarge/risky option

DA blocked upon food reward from

small/certain option

DA triggered uponfood omission from large/risky option

Wang et al., arXiv; 2018Stopper et al., Neuron, 2014

Optogenetic manipulation of dopamine

• Richer environments / abstractions (Espeholt et al., arXiv, 2018)

• Architectural biases (e.g., Raposo et al., NIPS, 2017)

• Complementary forms of meta-learning (e.g., Fernando et al., under review)

• Episodic reinstatement (Ritter et al., in press)

Current / Future Work

Neuroscience and AI: A virtuous circle

Jane WangZeb Kurth-NelsonDharshan KumaranChris SummerfieldHubert SoyerJoel LeiboSam Ritter

Collaborators

Adam SantoroTim LillicrapDavid Barrett Dhruva TirumalaRemi MunosCharles BlundellDemis Hassabis

DeepMind, London UKGatsby Computational Neuroscience Unit, UCL

prefrontal cortex as a meta-reinforcement learning system · prefrontal cortex as a...

Documents