大脳基底核の神経活動と行動 neural activity of basal ganglia and its correlation with...
Post on 18-Dec-2015
217 views
TRANSCRIPT
大脳基底核の神経活動と行動Neural activity of Basal ganglia and its correlation with behavior
Kazuyuki Samejima
ATR Human Information Science Laboratories &
`Creating the Brain’ CREST ,JST
共同研究京都府立医科大学 第二生理学教室 上田康雅 助手 木村實 教授
ATR人間情報科学研究所 銅谷賢治
Basal Ganglia
• 運動の実行に関係 疾患 ( パーキンソン病、ハンチントン病)
• 系列運動の学習 ( Hikosaka et.al. 1999)
• 報酬予測 (Shidara et.al. 1998 , Kawagoe 1998)
• 報酬の予測誤差 (Shultz 1997)
Overview
1. Data analysis of recorded data– Striatum activity in remembered sequential
movement
2. Task design based on reinforcement learning paradigm– Dose the Striatum represent ``Value function’’ used
by decision making?
3. Model of dopamine neuron activity– How dose the basal ganglia calculate temporal
difference of reward expectation?
Data analysis of Striatum neurons
共同研究開始当時、基底核は系列運動の学習に関係すると考えられていて、運動系列の短期記憶に関係するのではないかと考えられていた。
– 仮説 視覚誘導 -> PM - M1
記憶誘導 -> SMA – BG
Instructed – remembered motor sequence task
(Ueda and Kimura, 1999)
Information analysis
• Former analysis method of neural activity– Which type and How many neurons found?
• The category of neurons set by experimenter are arbitrary.
• If the neurons related task but is not classified by the category, they leave in ‘unclassified’ and information from these neurons are lost.
• Information analysis– Total information as a whole of recorded neurons– Time course of information
(Sugase et.al. 1999; Kitazawa et.al. 1998)
Information analysis: definition
Information of movement sequence
s rs
rsprspspsp
RSHSHRSI
)|(log)|()(log)(
)|()();(
Two type of sequence categories are calculated
1. First movement direction ( Left or Right)
2. Second movement type ( Stick or Button )
Information analysis: example
Total distribution →H(s)
Conditioned distribution H(s|r)
I(S;R) = H(S)-H(S|R)
Result: first movement direction
300ms
Time ( msec.)
HoldCenterStart
HoldCenterStart
INSTRUCTED TASK
REMEMBERED TASK
Time ( msec.)
0.011
0.041
•After 200msec from first stimulus, the information of direction (L/R information) rose rapidly.•After first movement, a high level L/R information was maintained.
• The L/R information slowly rose up 800 msec before the first stimulus
• After first movement, a high level L/R information was maintained
Result: Second movement type
Reward
2nd Stim
1stStim
1st Move
2ndMove
2nd Stim
1stStim
1st Move
2ndMove
HoldCenter
Reward
Time ( msec.)
Time ( msec.)
Instructed task
Remembered task•The S/B information was suppressed during the holding period.
•After the fist movement, the S/B information rose up
•After 200msec from 2nd stimulus, the S/B information rapidly rose up and was maintained after reward supply.
Result: # of significant cell
1. Significant cells of L/R information increased before first stimulus (B1) in remembered task (orange bar, upper left).
2. Significant cells of L/B information decreased before first stimulus and increase before second stimulus in remembered task (upper right).
Conclusion
• The activity of striate neurons contains information of direction and types of movement not only during execution of the movement but also during preparatory delay period of the REMEMBERED task. The neuron activity contains information about next movement but little information about the movement in two steps later.
この解析でわかったこと• 線条体は予測性の反応を示す。• ただし、直後の運動要素に対する予測が
ほとんど• 報酬の予測であるのか刺激もしくは行動
の予測であるのかは報酬の条件を変えて見ないとわからない。
タスクの条件を変えてみたくなる
Overview
1. Data analysis of recorded data– Striatum activity in remembered sequential
movement
2. Task design based on reinforcement learning paradigm– Dose the Striatum represent ``Value function’’
(reward expectation) used by decision making?
3. Model of Basal Ganglia as a reinforcement learner.– How dose the basal ganglia calculate temporal
difference of reward expectation?
Ventral striatum activity in approaching stages to reward
(( Shidara et.al. 1998Shidara et.al. 1998 ))
Approaching stage is essential for ventral striatal activity
Reward expectation
(Kawagoe et.al 1998)
Reinforcement learning
• Model of shaping animal behavior which lead to get reward.
Environment
Agent/animal
State ActionReward
Agent maximize getting reward by acting to environment
Value function
• Expected reward in future
• Learning value function and policy– Value function indicate the goodness of state or
action.– Policy is selected by value difference of selecting
action or predicted state.
• Temporal difference error
....])2()1()([))(( 2 trtrtrEtxV
)()1()()( tVtVtrt
Substiantia Nigra and Temporal Difference error
• This suggest that TD learning is implemented in BG network .
Schultz et.al. 1997
TD errorDopamine neuron
Computational model of striatum
– Action value function Q = MatrixState value function V= Striosome
– GP makes stochastic action selection.– SNc dopamin neuron carry error information of evaluation
(Temporal difference error).
(Doya 2000)
Designing new task inspired by reinforcement learning model
• Dose the striatum represent “action value function” ?
• If neural activity is represent value function – It correlate with Reward expectation– It dose not only represent simple action – It dose not only represent simple stimulus
response
• The condition of same stimulus, same action, but different reward expectation is needed to prove the activity represent value function.
• Action selection based on reward expectation
Related work
– Reward prediction (Kawagoe et.al. 1998)• ADR,1DR task– same stimulus and same action b
ut different reward expectation. Stimuls and action is coupled (delayed saccade indicating cue signal)
-> Dissociate stimulus and action
– Progress in multi stage to reward (Shidara et.al. 1998)
• It has only one action (one lever release)
-> multiple action and decision making
Stochastic reward and target task
•Decision making
•Monkey turns the lever left or right.
•Stochastic Feedback ( LED indicate reaching target, Reward )
•Reward is delivered with probability P(x) where x is target position.
State transition diagram of task
State N: Goal positon is hidden R: Right is goal position L: Left is goal positionAction l : selecting left turn r: selecting right turnReward probability P( R ): when it reach right target P( L ) : when it reach left target
サルは隠れたターゲットに気づいて行動しているか?
Yes
報酬確率を変えると行動を変えるか?
Dose the monkey change its action selection by reward probability?
Yes
Model: 3 state RL model• Reinforcement learning
• Action selection
))}(),(())(),(({))(),(())(),((1 tatSQtatSrtatSQtatSQ tt
'
)',(
),(
)|(
a
aSQ
aSQ
e
eSaP
Prediction : short term
• N-state activity and R or L-state activity – Reward expectation
Q(N,l) ~ 1/2P(L)
Q(R,r) ~ P(R)– Between the internal states
which is unknown target and known target position
Q
Prediction: Long term change of reward prediction
• Fixed alpha and beta prediction model
Estimating parameter and hyper-parameter from behavioral data
In previous prediction, we assumed reinforcement learner with fixed learning rate, fixed parameter of stochastic behavior. But, It might change.
• We have to estimate parameter and hyper-parameter of learning system.Sequential Monte Carlo method for estimating Q
value, learning rate and parameter for stochastic behavior.
Overview
1. Data analysis of recorded data– Striatum activity in remembered sequential
movement
2. Task design based on reinforcement learning paradigm
– Dose the Striatum represent ``Value function’’ used by decision making?
3. Model of dopamine neural activity– How dose the basal ganglia calculate temporal
difference of reward expectation?
Substiantia Nigra and Temproal Difference error
• This suggest that TD learning is implemented in BG network .
Schultz et.al. 1997
TD errorDopamine neuron
Classical Conditioning • Reward expectation (value function) V(t)
• Rewrad r(t)
• Temporal difference error
)1()()()( tVtVtrt
Conditional stimulus
R
r(t)
V(t)
(t)
Reward:
Reward prediction
TD error
Before learningAfter learningReward omitting
Computaional model of BG
• TD model ( Barto 1995, Suri 2000)
Delay!!
(Barto 1995)
What is the biological implimentation of delay for the temporal difference of reward expectation?
Direct-Indirect delay model
• Houk et.al 1995– Direct inhibition pathw
ay -V(t-1)– Indirect disinhibition pa
thway +V(t)
C C C C C
SPs
DA
ST
Primaryreinforcement
-V(t-1)
V(t)r(t)
(Houk, Adams, & Bartio 1995)
TD models
How dose the BG computes TD of reward expectation?
• Network model of BG– Value function V(s ) is learned by cortico-
striatal synaptic plasticity • Dopamine dependent plasticity in cortico-striata
l synapses
– Temprol difference is calculated by difference of receptor delay or dynamics between direct-indirect pathway.
• Integrate-and-fire type neuron model
Basal ganglia network
Cortex
matrix
SNc SNr
Striosome
GPe
GPi
STN
Thalamus
Striatum
- - --
+
-
- - -
Reward
Dopamine
Basal ganglia
Cortex
matrix
SNc SNr
Model
Striosome
Striatum
-
-
GABAAFast dis-inhibition
V(t)
GABABSlow inhibition
V(t-t)
Reward r(t)
2. Two kinds of temporal property of synaptic
current calculate temporal difference of the value V.
1. Cortico-striatal projection computes V(t)
Idea
V(s(t))
Dopamine
3. Dopamine mediated LTP and LTD in cortico-striatal synapse.
-
Integrate-and-fire (IF) neuron model
• A single neuron model – Synaptic current
• Exponential function
– Spontaneous activity • Random current injection
– Threthold dynamics and reset
membrane potential
t
kkC
ktt
k
dttItI
ttforeItI k
)()(
)( /)(max
Input pulse
Ic(t)
GABAA
GABAB ms100
ms10
),()())(( NtIVtVdt
dVCrestm
Dopamine dependent prastisity in Cortico-striatal connection
• SNc dopamine regulates synaptic plasticity in rat neostriatum (Reynolds & Wickens 2001)
•Dopamine sensitive plasticity
)()()ˆ)(( tyTtxtw jdelayiij
Simulation
• Leaky integrate and fire neuron with Glutamate, GABAA and GABAB receptor
• Dopamine mediated LTP and LTD• Orthogonal traveling activity as frontal cortex activity input
…
Slow Synapse= 100msec
Fast synapse = 10msec
8 neurons as cortical input
50 Hz10 Hz
CS
Inputs100 msec ,50Hz
R
Cortex
Reward
MatrixStriosome
SNrSNc -
--
+
+
Reward
Result : Shifting phasic activity
CS
R
CS only
Rewardonly
Afterconditioning
Before Model
Experiment
(Schultz et.al.1997)
Result : Omitted rewardModel Experiment
CS
CS
R
(no R)
(Schultz et.al.1997)
Result: Weight change of cortico-stiatum synapse
Prediction
• Tdelay>100 msec • Activity shifting through learning
1-20
21-40
41-60
61-80
81-100
Rcs