role of dopamin in searching information
DESCRIPTION
zanimljivoTRANSCRIPT
-
The Role of Dopamine in
Information-Seeking
A thesis by Ethan S. Bromberg-Martin
Thesis Committee: Okihide Hikosaka, National Eye Institute, Advisor
Barry J. Richmond, National Institute of Mental Health, Chair David L. Sheinberg, Brown University
Daeyeol Lee, Yale University, Outside Reader
Defended June 2, 2009
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in
the Department of Neuroscience at Brown University
May, 2010
-
Copyright 2009 by Ethan S. Bromberg-Martin. All rights reserved.
-
This dissertation by Ethan S. Bromberg-Martin is accepted in its present form by the Division of Biology and Medicine as satisfying the dissertation requirement of the degree of Doctor of Philosophy. Date___________ ___________________________________ Okihide Hikosaka, M.D., Ph.D., Advisor Laboratory of Sensorimotor Research National Eye Institute National Institutes of Health
Recommended to the Graduate Council Date___________ ___________________________________ David L. Sheinberg, Ph.D., Reader Department of Neuroscience Brown University Date___________ ___________________________________ Barry J. Richmond, M.D., Reader Laboratory of Neuropsychology National Institute of Mental Health National Institutes of Health Date___________ ___________________________________ Daeyeol Lee, Ph.D., Reader Department of Neurobiology Yale University
Approved by the Graduate Council Date___________ ___________________________________ Sheila Bonde, Ph.D. Dean of the Graduate School Brown University
iii
-
CURRICULUM VITAE Ethan S. Bromberg-Martin
Born April 27, 1983
Contact Information
Laboratory of Sensorimotor Research
49 Convent Drive, Room 2A50
Bethesda, MD 20892
Email: [email protected]
Education
Summer 2005 to Current
Candidate for Ph.D. in Neuroscience at Brown University,
as part of the Graduate Partnership Program with the
National Institutes of Health
Graduate Thesis: The Role of Dopamine in Information-Seeking
(Mentor: Dr. Okihide Hikosaka, National Eye Institute)
Fall 2001 to Spring 2005
Bachelor of Science in Computational Biology at Brown University,
iv
-
magna cum laude
Honors Thesis: Partial-Order Alignment of RNA Structures
(Mentor: Dr. Franco Preparata, Computer Science Department)
Other Employment
Summer 2002, 2003, 2004
Internship at the Laboratory of Experimental and Computational Biology,
National Cancer Institute, at NCI-Frederick in Ft. Detrick, Maryland
Summer 2001
Internship at the Laboratory of Biochemical Genetics,
National Heart, Lung, and Blood Institute, at the NIH in Bethesda, Maryland.
Professional Memberships
Society for Neuroscience
Peer-Reviewed Publications
Bromberg-Martin ES, Hikosaka O. (2009) Midbrain dopamine neurons signal
preference for advance information about upcoming rewards. (Neuron, in press)
Hikosaka O, Bromberg-Martin E, Hong S, Matsumoto M. (2008) New insights on
the subcortical representation of reward. Curr. Opin. Neurobiol. 18, 203-208.
Conference Presentations
v
-
Bromberg-Martin E, Matsumoto M, Nakamura K, Nakahara H, Hikosaka O. (2009)
Multiple timescales of reward memory in lateral habenula and midbrain
dopamine neurons. Frontiers in Systems Neuroscience. Conference Abstract:
Computational and systems neuroscience.
Bromberg-Martin ES, Hikosaka O. Midbrain dopamine neurons signal preference for
advance information about upcoming rewards. (2008) Society for Neuroscience
Abstract, 34: 691.23.
Bromberg-Martin ES, Matsumoto M, Hikosaka O. Lateral habenula neurons respond
to the negative value of rewards and the positive value of elapsed time. (2007)
Society for Neuroscience Abstract, 33: 530.9.
Bromberg-Martin E, Mar Jonsson A, Marai GE, McGuire M. Hybrid Billboard
Clouds for Model Simplification. (2004) SIGGRAPH Poster Abstract.
vi
-
PREFACE
Aristotle began his great work, Metaphysics, with the observation that All men by
nature desire to know. In this thesis my aim is to take a first step toward understanding
how the brain motivates itself to seek information about the surrounding world. Chapter
1 introduces current knowledge about the behavioral phenomenon of information-seeking
and about the neural basis of reward seeking, in particular the reward-signaling midbrain
dopamine neurons that are the focus of this thesis. In chapter 2, I present the results of
experiments that show that macaque monkeys seek advance information about the size of
upcoming water rewards. Further, the same midbrain dopamine neurons that signal the
expected amount of water also signal the expectation of advance information, in a manner
that is correlated with behavioral preferences. In chapter 3, I present an analysis of an
existing dataset showing that in a task in which reward information is delivered at a
predictable time, dopamine neurons can use two distinct prediction strategies one based
on a one-trial memory of past rewards, expressed at the start of the trial, and another
based on a long-timescale memory, expressed at the time when new reward information
is revealed. The same pattern is present in a major source of input to dopamine neurons,
reward-predicting cells of the lateral habenula. Chapter 4 summarizes the main results of
this thesis and suggests new avenues for future research.
vii
-
ACKNOWLEDGEMENTS
This thesis was completed in the Laboratory of Sensorimotor Research in the
National Eye Institute (NEI) at the National Institutes of Heath (NIH) and at Brown
University between June 2005 and May 2009. It was made possible by the generous
support of the NIH intramural research program, and by the Graduate Partnerships
Program (GPP) between the NIH and the Brown University Department of Neuroscience.
I thank my advisor, Okihide Hikosaka, for his continual support and excellent
mentorship, especially his keen judgment of experimental design and advice about
building a program of research. He is as good an example of a scientist and a gentleman
as a budding neuroscientist could hope to have as a mentor. Most of all, he makes the
process of scientific discovery enjoyable. Once a visitor asked me why I was in the lab so
late, and suggested that I should spend less time working. I replied, Its not working, its
FUNNING! If it wasnt for Okihide, I wouldnt have been able to say that!
I thank my thesis committee members Barry J. Richmond and David L. Sheinberg
for always being available to discuss matters related to my thesis and for their insightful
input, and my outside reader Daeyeol Lee for valuable advice and a keen eye for detail. I
thank my colleagues at the NIH, especially Masayuki Matsumoto for his collaboration
and experimental and research advice, and Kae Nakamura, Long Ding, Masaki Isoda,
Simon Hong, Masaharu Yasuda, Shinya Yamamoto, and Yoshihisa Tachibana; fellow
viii
-
fellows including Ilya Monosov, Alexander Lerchner, Giancarlo la Camera, Christian
Quaia, Seiji Tanabe, Ralf Haefner, Hendrikje Neinborg, Rebecca Berman, Jason Trageser,
Wilsaan Joiner, Kerry McAlonan, and James Cavanaugh; intrepid investigators including
Bruce Cumming, Kirk Thompson, and Lance Optican; and excellent experts including
Mitchell Smith, John McClurkin, Art Hays, Altah Nichols, Tom Ruffner, Leonard Jensen,
Ginger Tansey, Denise Parker, Mark Lawson, and Bethany Nagy. I also thank Veit
Stuphorn for advice about behavioral experiments, and I thank my voluminous
correspondent and insightful collaborator, Hiroyuki Nakahara.
I thank my colleagues in the Brown-NIH GPP program, especially Avinash Pujala
and Ilya Monosov for their scholarly and other communications, and Emmette Huchison,
Jodi Gilman, Lindsay Hayes, Cole Graydon, Caroline Graybeal and Shau-Ming Wei for
their support and advice. And my colleagues at Brown, especially Arjun Bansal, Luke
Woloszyn, Ryan Mruczek, Jedediah Singer, and Carlos Vargas-Irwin. I thank those who
supported me in the Brown-NIH Graduate Partnership Program, including Diane
Lipscombe, Jerome Sanes, John Isaac, Katherine Roche, Caroline Duffy, Mary DeLong,
and Sharon Milgram. I also thank my professors and teachers at Brown who showed me
how exciting cognitive brain science can be! Including Michael Black, Frank Wood,
Lucien Elie Bienenstock, Mayank Mehta, Stuart Geman, John Stein, Michael Paradiso,
David Sheinberg, Michael Tarr, Fulvio Domini, and Steven Sloman.
Finally, I thank my parents, Rob and Laura; my brother, Brett; and my parents of
parents, Richard, Helen, Henry, and Arlene. For your love, support, advice,
understanding, and general awesomeness, too immense and profound to be contained
within words on a printed page.
ix
-
TABLE OF CONTENTS
Curriculum Vitae iv
Preface vii
Acknowledgements viii
Table of Contents x
List of Figures xii Chapter 1: Introduction .......................................................................................................... 1
1.1 Information-seeking in humans and animals 2
1.2 Reward-predictive signals of midbrain dopamine neurons 6
1.3 Reward-predictive signals of lateral habenula neurons 12
1.4 Summary 15
Chapter 2: Midbrain dopamine neurons signal preference for advance information about upcoming rewards ..............................................................................................................................
16
2.1 Introduction 17
2.2 Results 19
Behavioral preference for advance information 19
Dopamine neurons signal advance information 23
2.3 Discussion 29
Implications of information-seeking for attitudes toward uncertainty 31
Information signals in the dopaminergic reward system 32
Why do dopamine neurons treat information as a reward? 33
2.4 Methods 35
2.5 Acknowledgements 40
Note 2.A: Aversion to information by reinforcement learning algorithms based on TD() 41
Chapter 3: Multiple timescales of memory in a network for reward prediction .......................... 52
3.1 Introduction 53
3.2 Results 55
Behavioral task and optimal timescale of memory 55
Behavioral and neural memory for a single previous reward outcome 58
Multiple timescales of memory 62
x
-
Multiple timescales of memory in tonic neural activity 67
Timecourse of the timescale of memory 71
3.3 Discussion 72
3.4 Methods 77
3.5 Acknowledgements 81
Note 3.A: Optimal reward prediction in the pseudorandom reward schedule 82
Note 3.B: Formal model of reward expectation during the task 85
Note 3.C: Single neurons accessed both short- and long-timescale memories 90
Note 3.D: Reward memory in lateral habenula response to predictable reward omission 95
Note 3.E: Rough independence of lateral habenula tonic and phasic memory effects 97
Note 3.F: V-Shaped timecourse of the timescale of reward memory 98
Note 3.G: No evidence for memory-change-induced prediction-errors 99
Chapter 4: Summary and Conclusions ............................................................................................. 104
4.1 Chapter 2: Summary and future directions 104
4.2 Chapter 3: Summary and future directions 115
4.3 Summary of contributions 123
Bibliography 125
xi
-
LIST OF FIGURES
Figure 1.1: A dopamine neuron following the prediction-error theory 9
Figure 1.2: Lateral habenula neurons carry negative reward signals 14
Figure 2.1: Behavioral preference for advance information 20
Figure 2.2: Control for water volume of the informative and random options 21
Figure 2.3: Behavioral preference for immediate delivery of information 23
Figure 2.4: Dopamine neurons signal information 25
Figure 2.5: Analysis of the dopamine neuron population 26
Figure 2.6: Correlation between neural discrimination and behavioral preference 28
Figure 2.7: Aversion to information by a model using SARSA() and softmax action selection 50
Figure 3.1: Reward-biased saccade task 56
Figure 3.2: Behavioral performance and memory for a single past reward outcome 58
Figure 3.3: Neural responses to task events and memory for a single previous outcome 60
Figure 3.4: Multiple timescales of memory 64
Figure 3.5: Quantifying of the timescale of memory 66
Figure 3.6: Timescale of memory in tonic neural activity 69
Figure 3.7: Timecourse of the timescale of memory 72
Figure 3.8: Optimal reward prediction in the pseudorandom reward schedule 84
Figure 3.9: Formal model of reward expectation during the task 88
Figure 3.10: Single neurons accessed both short- and long-timescale memories 93
Figure 3.11: Memory effect in lateral habenula response to predictable reward omission 96
Figure 3.12: Rough independence of lateral habenula tonic and phasic memory effects 97
Figure 3.13: V-Shaped timecourse of the timescale of reward memory 98
Figure 3.14: No evidence for memory-change-induced prediction-errors during the fixation period 102
Figure 4.1: Hypothetical dopamine neuron encoding of information prediction-errors 107
Figure 4.2: Necessity for advance information in solving the temporal credit-assignment problem 111
Figure 4.3: Hypothetical measurement of the neural and behavioral change between short- and
long-timescale memories
121
xii
-
CHAPTER 1
Introduction
In recent years neuroscientists have learned much about how the brain represents
objects of desire, also known as rewards. We now know that rewards are represented by
neurons in practically every cortical and subcortical area, from motivational brain
systems where rewards are anticipated and predicted, to motor systems where rewards
influence action selection and movement control, to sensory systems where rewards
enhance the neural representation of desirable objects as they are seen, heard, touched,
smelled, and tasted. Unfortunately, a major limitation of this research is that it is very
rarely possible to record the activity of single neurons in behaving human beings, and
instead recordings are largely restricted to experimental animals. So, with a few notable
exceptions (e.g. (Williams et al., 2004; Klein et al., 2008; Zaghloul et al., 2009)), our
knowledge about the neuronal basis of rewarding experience is largely limited to basic
rewards like food and water (Schultz, 2000). The neural pathways that regulate eating
and drinking are highly specialized and evolutionarily ancient, a testament to their
importance (Lenard and Berthoud, 2008; Denton et al., 1996); on the other hand, this may
mean that these neural pathways are not flexible enough to represent the more abstract,
1
-
2
cognitive rewards that are at the heart of our mental lives. In this thesis my goal is to take
a first step toward understanding the neuronal basis of cognitive rewards by studying a
form of cognitive reward-seeking that is shared between humans and animals: the desire
for advance information about future events.
1.1 Information-seeking in humans and animals
In 350 B.C., Aristotle began his great work, Metaphysics, with the assertion that All
men by nature desire to know. This desire is not simply because knowledge can be put
to use in achieving ones goals, but because knowledge is intrinsically desirable in its
own right: For not only with a view to action, but even when we are not going to do
anything, we prefer seeing (one might say) to everything else. The reason is that this,
most of all the senses, makes us know and brings to light many differences between
things. (Aristotle and Ross, 2006). This philosophical assertion appeals to the intuition
surely it is better to seek knowledge than to seek ignorance! Yet it was more than two
millennia before this concept of information-seeking became the subject of scientific
study.
This topic was first introduced to experimental psychology by L. Benjamin Wyckoff
during his studies of discrimination learning in pigeons (Wyckoff, 1952). He coined the
term observing response to mean any action that brought an animal into contact with a
discriminative stimulus (a stimulus used to choose between a set of possible responses).
For example, a pigeon in a Skinner box might turn its head toward a colored light
(observing response), which would then tell the pigeon which button to peck to get a food
reward (discriminative response). In this case the information is obviously beneficial
-
3
because it helps the animal gain a greater amount of food. However, several researchers
found that rats and pigeons persistently observe reward-indicating cues even when their
observations have no effect on the reward rate (Prokasy, 1956; Hendry, 1969; Daly,
1992; Roper and Zentall, 1999). In a prototypical observing behavior experiment a rat
is given a choice between two chambers, after which the rat is confined in the chamber
for 30 seconds and then given a food reward with 50% probability. In one chamber, the
color of the walls is informative, telling the rat whether the trial will be rewarded; in the
other chamber the color of the walls is uninformative, entirely randomized. Under these
conditions, rats express a strong preference for the informative chamber (Prokasy, 1956;
Daly, 1992). The preference for informative cues is strongest when the possibility of
receiving a reward is most attractive: for example, when the rats are hungry, when
rewards are scarce, and when the potential reward size is large (Wehling and Prokasy,
1962; McMichael et al., 1967; Mitchell et al., 1965). Rats also prefer cues that are
informative about upcoming punishments (aversive events), showing the strongest
preference when the intensity of the possible punishment is largest (Badia et al., 1979).
Thus, animals do not seek all kinds of information equally, but specifically seek
information about future events that have a large positive or negative value.
An independent line of inquiry1 into information-seeking began in the field of
economics with a theoretical paper by Kreps and Porteus, introducing a concept they
called temporal resolution of uncertainty (Kreps and Porteus, 1978). According to their
theory, decision-makers may express a preference between two options that have an
identical set of possible outcomes but differ in the time when the outcome becomes
1 I cannot find any paper from the literature on observing behavior that references any paper from the literature on temporal resolution of uncertainty, or vice versa. Apparently this thesis is the first publication to link these two lines of inquiry.
-
4
known that is, the time when the decision-makers uncertainty about the outcome is
resolved. According to this theory, decision-makers can prefer early resolution of
uncertainty (behaving like the rats and pigeons in the above experiments), or late
resolution of uncertainty (ignoring information for as long as possible). Later experiments
showed that humans generally express a strong preference for early resolution of
uncertainty, seeking advance information about monetary gains and losses even when
they cannot use this knowledge to change the outcome (Chew and Ho, 1994; Ahlbrecht
and Weber, 1996; Lovallo and Kahneman, 2000; Eliaz and Schotter, 2007; Luhmann et
al., 2008).
Thus, both humans and animals seek advance information about future rewards, as if
information was itself rewarding. Yet advance information is fundamentally different
from conventional forms of reward. Most rewards are spoken of as discrete physical
entities that can be delivered and consumed at a specific moment in time. A reward is
an object, like a apple that can be eaten, or an attractive picture that can be consumed
by visual inspection. But consider the behavioral experiment described above, in which
rats chose between informative and uninformative chambers that led to probabilistic
rewards. Both options presented exactly the same two physical stimuli: a chamber that
could be white or black, and a reward that could be present or absent. Thus, the animals
preference cannot be explained in terms of a rewarding stimulus that was present for
one option and absent for the other. The only difference between the options was that for
one chamber the color indicated the reward size, whereas for the other chamber it did not.
Crucially, this informative relationship could not be detected on any single trial, but had
to be inferred through repeated experience. Thus, the differential reward value between
-
5
the two options did not originate from any single external event, but from an inference or
realization, an internal event or sequence of events in the brain. In this sense the reward
value of advance information has a highly abstract, cognitive origin.
How could such an abstract form of reward be generated by the brain? Is it created
by the same neural networks that assign value to basic rewards like food and water? Or is
it handled by an entirely separate system? Much is known about how the brain controls
the physical act of consuming food and water but what about the analogous mental act
of consuming information?
In this thesis I will take a first step toward answering these questions by recording
the activity of dopamine-releasing neurons in the midbrain. Dopamine neurons are an
ideal candidate for this purpose because they are one of the brains major sources of food
and water reward signals and are theorized to have a central role in reward learning. A
crucial test of this theory will be whether dopamine neurons carry informational reward
signals as well (Chapter 2). Furthermore, dopamine neuron activity is thought to serve as
a reinforcement signal that teaches the brain which actions are good and which are
bad. Thus, if these neurons carry information-related signals, then they are likely to play
a causal role in creating the informational preference seen in behavior. Finally, dopamine
neurons do not act as simple reward detectors, but instead appear to respond to the first
new information about rewards in the future. Thus, they provide an ideal opportunity to
study how reward-signaling neurons in the brain adapt their activity at the time when new
reward information is expected and consumed (Chapter 3).
In the next section I will describe the nature of reward signals in dopamine neurons,
along with the theoretical framework which I will use for interpreting their activity
-
6
(Section 1.2). I will also describe the activity pattern of neurons in one of their major
input nuclei, the lateral habenula, which I will analyze in detail in Chapter 3 in order to
better understand the source of dopamine neuron reward signals (Section 1.3).
1.2 Reward-predictive signals of midbrain dopamine neurons
Two regions of the midbrain, the ventral tegmental area (VTA) and the substantia
nigra pars compacta (SNc), contain a large number of neurons that release the
neurotransmitter dopamine2. These neurons send axons to many structures involved in
motor control, cognition, emotion, and memory - SNc neurons project primarily to the
dorsal striatum, while VTA neurons project to many areas including the ventral striatum,
amygdala, hippocampus, and wide swaths of frontal cortex (Oades and Halliday, 1987;
Graybiel, 1990). These neurons are well-studied partly because they are involved in a
variety of neurological disorders; in particular, the degeneration of SNc dopamine
neurons is the major cause of Parkinson's disease (Dauer and Przedborski, 2003). In this
thesis I will focus on another aspect of dopamine neurons, their role in reward learning.
Manipulations of dopamine neurotransmission cause profound effects on reward-
seeking. For example, when animals are infused with dopamine antagonists they
gradually reduce the effort they put forth to get food and water rewards, behaving as
though the reward size had suddenly decreased (Wise, 2004). Even on later days when
the drug is no longer administered, animals still continue to put forth reduced effort, as if
the suppression of dopamine neurotransmission had caused them to learn that the task
2 For convenience, they will here be referred to as dopamine neurons rather than the more complete phrase midbrain dopamine neurons. In fact there are several populations of dopamine neurons in the nervous system, including cells in the retina, olfactory bulb, and hypothalamus (Bjorklund, A., and Dunnett, S.B. (2007). Dopamine neuron systems in the brain: an update. Trends in neurosciences 30, 194-202.).
-
7
had a low reward value. In some cases an animal's reward learning can be impaired or
even fully blocked by interfering with dopamine transmission in very localized regions
on the brain (e.g. (Liu et al., 2004; Dalley et al., 2005; Nakamura and Hikosaka, 2006)).
Dopamine release is not only linked to natural rewards but also seems to be crucial for
many 'unnatural' ones as well, as most addictive drugs potentiate dopamine release in one
way or another (Kauer, 2004) and humans taking dopamine agonists as a medical
treatment sometimes develop addiction-like traits like compulsive gambling and
hypersexuality (Ondo and Lai, 2008; Bostwick et al., 2009). Dopamine release appears to
act as a reward or reinforcement signal, as animals can be trained to perform simple tasks
in exchange for direct burst stimulation of dopamine neuron axons (Tsai et al., 2009).
Consistent with the importance of dopamine in reward learning, a seminal series of
studies conducted by Wolfram Schultz and his colleagues in Fribourg found that these
neurons emitted phasic bursts of spikes triggered by food and water rewards (Schultz,
1998). Neurons fired bursts when monkeys drank a squirt of juice, when they touched or
saw a morsel of food, and when they observed visual or auditory cues indicating the
availability of these rewards. Perplexingly, each of these responses was present in some
tasks but absent in others. Furthermore, as monkeys gained practice at a task, some of
these bursts became stronger while other responses entirely disappeared, or could even be
converted into brief inhibitions. A number of theories have been proposed to account for
this puzzling pattern of dopamine neuron activity.
In this thesis I will discuss dopamine neurons primarily in the framework of the most
successful of these theories, based on the computational theory of temporal-difference
learning (TD learning) (Sutton, 1988; Sutton and Barto, 1998). According to this theory,
-
8
we continuously predict the expected future reward value available from our present
situation. Whenever we receive new information about the world we re-evaluate our
situation, reacting to good news by revising our reward prediction upwards and reacting
to bad news by revising it downwards. This mental re-evaluation triggers what is known
as a prediction-error or TD-error signal, reflecting the difference between the new
predicted value and the old predicted value (the temporal-difference). Dopamine
neuron activity is hypothesized to represent this prediction-error signal. Specifically, if
new information indicates that the situation is better than originally predicted, then it
triggers a positive prediction-error and dopamine neurons should burst; if the situation is
exactly as good as predicted, then there is zero prediction-error and dopamine neurons
should not respond; and if the situation takes a turn for the worse, it triggers a negative
prediction-error and dopamine neurons should pause (Figure 1.1). Thus dopamine
activity serves as the first derivative of value, signaling how the value of a situation
changes over time (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997).
-
9
Figure 1.1. A dopamine neuron following the prediction-error theory. A monkey was presented with a visual cue and then 2.25 seconds later a big or a small water reward was delivered. Different cues indicated a 100%, 50%, or 0% probability of a big reward. This neuron was recorded for the study described in Chapter 2. Each plot shows the dopamine neurons activity for a particular cue and reward outcome. Top: the neurons mean firing rate at each time during the trial. Bottom: the neurons spikes on the first 36 trials recorded in each condition. Each row is a trial, and each dot is a spike. When the cues indicated the upcoming reward size, the neuron fired a burst of spikes when the monkey predicted a big reward and paused when the monkey predicted a small reward (top two panels). The neuron then had little response the predictable reward outcome. When the cues were uninformative about the upcoming reward size (50% cue), the neuron had little response to the cue but then fired a burst of spikes for an unpredicted big reward and paused for an unpredicted small reward. Thus, this neuron signaled changes in the animals prediction about the trials reward value.
This theory provides a simple explanation for a wide variety of findings about
dopamine neuron activity (Schultz et al., 1997; Schultz, 1998). It explains why dopamine
-
10
neuron responses to a reward are relative to the rewards predicted size; when two
different reward sizes are potentially available, neurons burst in response to the large
reward and pause in response to the small reward (Figure 1.1, bottom right). If a reward
has the same value as originally predicted, dopamine neurons have little or no response
(Figure 1.1, top right). Similarly, when two different reward-indicating cues could
potentially be shown, neurons burst for the big-reward cue and pause for the small-
reward cue (Figure 1.1, top left). In general, responses to reward-indicating cues are
positively related to the cued reward value, while responses to the later reward outcome
are negatively related to the cued value, as if the neurons were computing the prediction-
error, Actual Value Predicted Value (Fiorillo et al., 2003; Satoh et al., 2003; Morris
et al., 2004; Nakahara et al., 2004). This neural prediction of reward value tracks
behavioral measures of reward prediction during a large number of experimental
situations, including learning and extinguishing cue-reward relationships (Schultz et al.,
1993; Hollerman and Schultz, 1998; Takikawa et al., 2004; Day et al., 2007), estimating
the reward delivery time (Fiorillo et al., 2008; Kobayashi and Schultz, 2008), judging the
reward probability using complex cue-combination rules (Waelti et al., 2001; Tobler et
al., 2003), and estimating the reward value on the current trial based on the outcomes of
previous trials (Satoh et al., 2003; Nakahara et al., 2004; Bayer and Glimcher, 2005).
The TD learning theory provides a very good account of reward-related dopamine
neuron activity, good enough that in this thesis I will use the theory as the major
framework for describing my results. However, I should emphasize that this theory is not
a perfect or ultimate description of dopamine neuron activity. Indeed, the most important
results in Chapters 2 and 3 are strikingly unexpected from a TD learning point of view.
-
11
Therefore, it is wise to bear in mind the limitations of this framework. In particular,
although there is a good match between theory and experiment in the realm of rewards,
the match is less good for neutral and aversive stimuli. Dopamine neurons can respond to
neutral but surprising or startling stimuli such as bright lights and loud tones (Steinfels et
al., 1983; Strecker and Jacobs, 1985; Horvitz et al., 1997; Horvitz, 2000; Redgrave and
Gurney, 2006). The function of this activity is unknown. In addition, some dopamine
neurons respond to aversive events by being strongly inhibited (as expected from the
theory), but other dopamine neurons are strongly excited (Trulson and Preussler, 1984;
Guarraci and Kapp, 1999; Ungless et al., 2004; Joshua et al., 2008; Brischoux et al.,
2009; Matsumoto and Hikosaka, 2009). In this thesis I cannot tell how many of the
neurons I studied belong to this aversive-excited subpopulation because the neurons were
only tested with reward-related events. A second limitation of the theory is that it is only
designed to account for the mean pattern of dopamine neuron activity, as if all dopamine
neurons carried the same signal. Several studies have reported that there is more neuron-
to-neuron variability in response patterns than this simple account would suggest (Schultz,
1986; Nishino et al., 1987; Magarinos-Ascone et al., 1992; Kiyatkin and Rebec, 1998;
Ravel and Richmond, 2006; Wilson and Bowman, 2006).
Finally, an important limitation of dopamine neuron recording in behaving animals is
that there is no definitive test to tell whether a recorded neuron truly releases the
neurotransmitter dopamine. In this thesis I analyzed neurons which met standard
electrophysiological criteria thought to be good indicators that a neuron is dopaminergic
(Schultz, 1986) and which also carried positive reward signals like those that have been
reported in previous studies (Schultz, 1998). So when I say dopamine neurons did X it
-
12
should be strictly interpreted as reward-signaling presumed dopamine neurons did X.
There is evidence that the standard criteria are not perfect they may exclude certain
subpopulations of dopamine neurons, and may sometimes include nearby GABA-
releasing interneurons (Ungless et al., 2004; Margolis et al., 2006; Margolis et al., 2008;
Lammel et al., 2008). However, the major findings in this thesis are not based on any
single neuron, but on the predominant patterns of activity seen in the population as a
whole. Thus the new findings in this thesis can be interpreted as the typical response
properties of the majority of reward-responsive dopamine neurons.
1.3 Reward-predictive signals of lateral habenula neurons
Given that dopamine neurons have an important role in reward-seeking, it is crucial
to discover how dopamine neurons compute their reward-predictive signals. Dopamine
neurons receive input from many brain areas, including the pedunculopontine tegmental
nucleus, dorsal and ventral striatum, ventral pallidum, central amygdala, dorsal raphe,
lateral hypothalamus, and lateral habenula (Schultz, 1998; Fudge and Haber, 2000;
Geisler et al., 2007). In this thesis I will analyze the activity of neurons in the one of these
areas, the lateral habenula, which is an especially strong candidate for sending value
signals to dopamine neurons.
The lateral habenula is a small nucleus in the habenular complex of the epithalamus,
located on the dorsal surface of the thalamus (Parent et al., 1981; Herkenham and Nauta,
1977, 1979). The habenula sends its outputs through the fasciculus retroflexus, a bundle
of fibers which travels anteriorly and ventrally before branching off into different regions
of the brain, including the VTA and SNc. The full mixture of neurotransmitters is
-
13
unknown, although at least some of the projection neurons are glutamatergic (Geisler et
al., 2007). The lateral habenula receives inputs from a number of brain regions, including
reciprocal connections from VTA dopamine neurons (Oades and Halliday, 1987), but
receives its most prominent anatomical input from the lateral hypothalamus and globus
pallidus (Parent et al., 1981; Sutherland, 1982; Hong and Hikosaka, 2008).
How does the lateral habenula influence midbrain dopamine neurons? In an early
study in anesthetized rats, Christoph and colleagues demonstrated that electrical
stimulation of the lateral habenula caused a powerful short-latency inhibition of more
than 85% of SNc and VTA dopamine neurons (Christoph et al., 1986). This inhibition is
likely to be disynaptic, such that lateral habenula neurons excite GABAergic interneurons
in the midbrain, which in turn inhibit dopamine neurons (Ji and Shepard, 2007). Single-
neuron recordings in behaving animals provide strong evidence that lateral habenula
neurons make use of this inhibitory influence over dopamine neurons in order to help
dopamine neurons compute their prediction-error signals (Matsumoto and Hikosaka,
2007). Lateral habenula and SNc dopamine neurons had mirror-image responses to
reward-related task events: whereas SNc dopamine neurons were excited by reward-
predicting cues and unexpected rewards, habenula neurons were inhibited; and whereas
dopamine neurons were inhibited by reward-omission cues and unexpected omissions,
habenula neurons were strongly excited (Figure 1.2). The response to reward-omission
cues occurred about 40 milliseconds earlier in habenula neurons than in dopamine
neurons, suggesting that habenula excitations could be the source of their negative reward
signals. Indeed, the dopamine neurons that were most inhibited by reward-omission cues
were also most strongly inhibited by lateral habenula electrical stimulation. Thus, lateral
-
14
habenula neurons appear to carry a negative prediction-error signal that is likely to be the
source of the positive prediction-error signals of dopamine neurons, at least the inhibitory
dopamine response to negative events.
Figure 1.2. Lateral habenula neurons carry negative reward signals. A monkey was presented with a visual target (cue). The monkey made a saccade to the target which was followed 200 ms later by a water reward or by no reward (outcome). The position of the target on the screen indicated whether the trial would be rewarded. On 4% of trials the position-reward mapping was reversed and so an unpredicted reward outcome was delivered (e.g. the cue indicated that a reward was coming but instead no reward was delivered). This plot shows the population average activity of reward-responsive lateral habenula and dopamine neurons from the dataset analyzed in Chapter 3. Arrows indicate cue or outcome onset. The firing rate was smoothed with a Gaussian kernel ( = 15 ms). Lateral habenula neurons were inhibited by reward cues (left) and unpredicted reward deliveries (right); they were excited by no-reward cues (left) and unpredicted reward omissions (right); and they had less response to reward outcomes that were cued and highly predictable (middle). Dopamine neurons carried a mirror-image signal.
-
15
1.4 Summary
Humans and animals prefer to view informative cues that indicate the value of an
upcoming reward in advance of its delivery. Midbrain dopamine neurons signal changes
in an animals expectation of food and water rewards. Their food- and water-predictive
activity may originate in lateral habenula neurons which carry a sign-reversed signal.
Dopamine release is thought to act as a reinforcer that teaches the brain which actions are
good and should be repeated and which are bad and should be avoided. If dopamine
neurons truly have such a fundamental role in reward learning, then they should not only
signal the value of basic rewards but should also signal the value of advance information
about those rewards. The information-predictive signals of dopamine neurons would then
be a major candidate for creating the informational preferences seen in behavior.
-
CHAPTER 2
Midbrain dopamine neurons signal preference for
advance information about upcoming rewards
SUMMARY
The desire to know what the future holds is a powerful motivator in everyday life,
but it is unknown how this desire is created by neurons in the brain. Here we show that
when macaque monkeys are offered a water reward of variable magnitude, they seek
advance information about its size. Furthermore, the same midbrain dopamine neurons
that signal the expected amount of water also signal the expectation of information, in a
manner that is correlated with the strength of the animal's preference. Our data shows that
single dopamine neurons process both primitive and cognitive rewards, and suggests that
current theories of reward-seeking must be revised to include information-seeking.3
3 The major portion of this chapter is from the paper Midbrain dopamine neurons signal preference for advance information about upcoming rewards by Ethan S. Bromberg-Martin and Okihide Hikosaka, currently accepted by Neuron.
16
-
17
2.1 INTRODUCTION
Dopamine-releasing neurons located in the substantia nigra pars compacta and
ventral tegmental area are thought to play a crucial role in reward learning (Wise, 2004).
Their activity bears a remarkable resemblance to prediction errors signaling changes in
a situation's expected value (Schultz et al., 1997; Montague et al., 2004). When a reward
or reward-predictive cue is more valuable than expected, dopamine neurons fire a burst of
spikes; if it has the same value as expected, they have little or no response; and if it is less
valuable than expected, they are briefly inhibited. Based on these findings, many theories
invoke dopamine neuron activity to explain human learning and decision-making
(Holroyd and Coles, 2002; Montague et al., 2004) and symptoms of neurological
disorders (Redish, 2004; Frank et al., 2004), inspired by the idea that these neurons could
encode the full range of rewarding experiences, from the primitive to the sublime.
However, their activity has almost exclusively been studied for basic forms of reward
such as food and water. It is unknown whether the same neurons that process these basic,
primitive rewards are involved in processing more abstract, cognitive rewards (Schultz,
2000).
We therefore chose to study a form of cognitive reward that is shared between
humans and animals. When people anticipate the possibility of a large future gain such
as an exciting new job, a generous raise, or having their research published in a
prestigious scientific journal they do not like to be held in suspense about their future
fate. They want to find out now. In other words, even when people cannot take any action
to influence the final outcome, they often prefer to receive advance information about
upcoming rewards. Related concepts have been arrived at independently in several fields
-
18
of study. Economists have studied temporal resolution of uncertainty (Kreps and
Porteus, 1978), and have shown that humans often prefer their uncertainty to be resolved
earlier rather than later (Chew and Ho, 1994; Ahlbrecht and Weber, 1996; Eliaz and
Schotter, 2007; Luhmann et al., 2008). Experimental psychologists have studied
observing behavior (Wyckoff, 1952), and have shown that a class of observing
behavior that produces reward-predictive cues can be a powerful motivator for rats,
pigeons, and humans (Wyckoff, 1952; Prokasy, 1956; Daly, 1992; Lieberman et al.,
1997).
To date, however, there has not been a rigorous test of this preference in non-human
primates, the animals in which the reward-predicting activity of dopamine neurons has
been best described (Schultz, 2000; Schultz et al., 1997; Montague et al., 2004).
Unfortunately, previous non-human primate studies required animals to work for rewards
by making a costly physical response, so that information about when rewards were
available allowed animals to save considerable physical effort (Kelleher, 1958; Steiner,
1967; Steiner, 1970; Lieberman, 1972; Woods and Winger, 2002). Other studies used an
unbalanced design in which animals pulled a lever to view informative cues but were not
offered a control lever for viewing uninformative cues, leaving it ambiguous whether
animals pulled the lever because the cues conveyed information or because of some other
stimulus property (Steiner, 1967; Schrier et al., 1980). To perform a rigorous test, it is
best to use a symmetrical choice procedure in which both informative and uninformative
options are available, in which they are both selected using the same physical response,
and in which both are matched for the timing and physical properties of both cue stimuli
and rewards (Daly, 1992; Roper and Zentall, 1999).
-
19
To this end, we developed a simple decision task allowing rhesus macaque
monkeys to choose whether to receive advance information about the size of an
upcoming water reward. We found that monkeys expressed a strong behavioral
preference, preferring information to its absence and preferring to receive the information
as soon as possible. Furthermore, midbrain dopamine neurons which signaled the
monkeys expectation of water rewards also signaled the expectation of advance
information, in a manner that was correlated with the animals preference. These results
show that the dopaminergic reward system processes both primitive and cognitive
rewards, and suggest that current theories of reward-seeking must be revised to include
information-seeking.
2.2 RESULTS
Behavioral preference for advance information
We trained two monkeys to perform a simple decision task (information choice
task, Figure 2.1A). On each trial two colored targets appeared on the left and right sides
of a screen, and the monkey had to choose between them by making a saccadic eye
movement. Then, after a delay of a few seconds, the monkey received either a big or a
small water reward. The monkeys choice had no effect on the reward size both reward
sizes were always equally probable. However, choosing one of the colored targets
produced an informative cue - a cue whose shape indicated the size of the upcoming
reward. Choosing the other color produced a random cue - a cue whose shape was
randomized and therefore had no meaning. The positions of the targets were randomized
on each trial. To familiarize monkeys with the two options, we interleaved choice trials
-
20
with forced-information trials and forced-random trials, in which only one of the targets
was available.
After only a few days of training, both monkeys expressed a strong preference to
view informative cues (Figure 2.1B). Monkey Z chose information about 80% of the
time, and monkey V's choice rate was even higher, close to 100%. Their preference for
advance information cannot be explained by a difference in the amount of water reward,
because information did not allow monkeys to obtain extra water from the reward-
delivery apparatus and had little effect on whether they completed a trial successfully (<
2% error rate for each target, Figure 2.2).
-
21
Figure 2.1. Behavioral preference for advance information (A) Information choice task. Fractions represent probabilities of different trial types. (B) Percent choice of information for each monkey. Each dot represents a single day of training. The mean number of choice trials per session was 152 for monkey V (range: 71-203) and 161 for monkey Z (range: 39-285). The gray region is the Clopper-Pearson 95% confidence interval for each day.
Figure 2.2. Control for water volume of the informative and random options (A) Each monkey was tested in six behavioral sessions, alternating between days when only random cues were available (random sessions) and days when only informative cues were available (information sessions). Each dot represents the amount of water delivered during a single session, expressed as a percentage of the theoretical amount that should have been delivered if monkeys were unable to influence the water volume (circles: Monkey V; squares: Monkey Z). Inset: the difference in water delivered between the two types of sessions was close to zero (unpaired t-test, P = 0.28, 95% CI = -3.2% to +1.0%). (B) Error rates are shown for each trial type, combining all behavioral sessions except the early learning stage (the first eight sessions for each monkey). Both monkeys made slightly more errors on random trials than information trials, compatible with the notion that monkeys were less motivated to complete random trials because they were the least desirable trial type in the task (a phenomenon seen in many other studies, e.g. (Shidara and Richmond, 2002; Lauwereyns et al., 2002a; Roesch and Olson, 2004; Kobayashi et al., 2006)). However, the overall error rate was very low (< 2%). A 2% difference in reward rate is far too small to explain the strong behavioral preferences we observed.
An important concern is that advance information might have allowed monkeys to
extract a greater amount of subjective value from the water reward by physically
preparing for its delivery for instance, by tensing their cheek muscles to swish water
-
22
around in their mouths in a more pleasurable fashion (Perkins, 1955). We therefore
introduced a second task that equalized the opportunity for simple physical preparation
(Mitchell et al., 1965) (information delay task, Figure 2.3A). Monkeys again chose
between informative and random cues, but afterward a second cue appeared that was
always informative on every trial. Thus, information was always available well in
advance of reward delivery; the choice was between receiving the information
immediately, or after a delay.
Soon after being exposed to the new task, both monkeys expressed a clear preference
for immediate information, comparable to their preference in original task (Figure 2.3B).
We then reversed the relationship between cue colors and information content, and
monkeys switched their choices to the newly informative color (Figure 2.3B). We
conclude that monkeys treated information about rewards as if it was a reward in itself,
preferring information to its absence and preferring to receive it as soon as possible.
-
23
Figure 2.3. Behavioral preference for immediate delivery of information (A) Information delay task. The fixation point and target configurations (not shown here) were the same as in the information choice task shown in Figure 2.1A. (B) Percent choice of immediate information. Conventions as in Figure 2.1B. The vertical line labeled reversal marks the time when the informative and random cue colors were switched. The mean number of choice trials per session was 151 for monkey V (range: 50-222) and 111 for monkey Z (range: 35-176). The behavioral preference started below 50% because the cue colors were re-used from a pilot experiment; the informative color had been previously trained as random, and vice versa (data not shown).
Dopamine neurons signal advance information
To understand the neural basis of the rewarding value of information, we recorded
the activity of 47 presumed midbrain dopamine neurons while monkeys performed the
information choice task shown in Figure 2.1. As in previous studies, we focused on
-
24
neurons that were presumed to be dopaminergic based on standard electrophysiological
criteria and that signaled the value of water rewards (henceforth referred to as dopamine
neurons) (Methods). Figure 2.4 shows an example neuron that carried a strong water
reward signal. On trials when the monkey viewed informative cues, the neuron was
phasically excited by the cue indicating a large reward and inhibited by the cue indicating
a small reward. In contrast, on trials when the monkey was forced to view uninformative
random cues, the neuron had little response to the cues but was strongly responsive to the
later reward outcome, excited when the reward was large and inhibited when it was
small. Thus, consistent with previous studies, this neuron signaled changes in the
monkey's expectation of water rewards.
The same neuron also responded to the targets indicating the availability of
information. On forced trials when only one target was available, the neuron was excited
by the informative-cue target and inhibited by the random-cue target. On choice trials
when both targets were available, the monkey always chose to receive information, and
the neuron responded much as it did when the informative-cue target was presented
alone. Thus, this dopamine neuron signaled changes in both the expectation of water and
the expectation of information.
-
25
Figure 2.4. Dopamine neurons signal information Top: firing rate of an example neuron. Trials are sorted separately for each task event, as follows. Target: forced-information (red), choice-information (pink), forced-random (blue). Cue: informative cues (red) indicating that the reward is big (solid) or small (dashed), random cues (blue) with the same shape as informative cues for big (solid, cross shape) or small (dashed, wave shape) rewards. Reward: informative (red) are the same trials as for the cue response, random (blue) trials where the reward was big (solid) or small (dashed). The firing rate was smoothed with a Gaussian kernel, = 20 ms. Bottom: rasters for individual trials. Each row is a trial, and each dot is a spike. Colors are the same as in the firing rate display, except that dark colors correspond to dashed lines.
This pattern of responses was quite common in dopamine neurons. We measured
each neurons discrimination between targets, cues, and rewards using the area under the
receiver operating characteristic (Figure 2.5B-D, Methods). This measure ranges from
0.5 at chance levels to 0.0 or 1.0 for perfect discrimination. As in the example, neurons
discriminated strongly between informative reward-predicting cues and between
randomly-sized rewards, but only weakly between uninformative random cues and
between fully predictable rewards (Figure 2.5C,D). The same neurons also discriminated
between the targets, with clear preferential activation by the target that predicted advance
information (Figure 2.5B). The discrimination was highly similar when measured using
-
26
either forced-information or choice-information trials in independent data sets (rho =
0.68, p < 10-4; Methods), indicating that the neural preference for information was
reproducible and consistent across different stimulus configurations. The same pattern
occurred in both monkeys (Figure 2.6) and could be seen in the population average firing
rate (Figure 2.5A).
Figure 2.5. Analysis of the dopamine neuron population (A) Population average firing rate. Conventions as in Figure 2.4. Gray bars indicate the time windows used for the ROC analysis. Colored bars indicate time points with a significant difference between selected pairs of task conditions (P < 0.01, Wilcoxon signed rank test), as follows. Target: force-info vs. force-rand (red), choice-info vs. force-rand (pink); Cue: info-big vs. info-small (red), rand-cross vs. rand-wave (blue); Reward: info-big vs. info-small (red), rand-big vs. rand-small (blue).
-
27
(B-D) Neural discrimination between task conditions in response to the targets (B), cues (C), and rewards (D). Each dots (x,y) coordinates represent a single neurons ROC area for discriminating between the pairs of task conditions listed on the x and y axes. A discrimination of 1 indicates perfect preference for the condition listed next to 1 (e.g. Choice info); discrimination of zero indicates perfect preference for the condition listed next to 0 (e.g. Force rand). Note that in (B) the x and y coordinates were both calculated using the same set of forced-random trials. Colored dots indicate neurons with significant discrimination between the conditions listed on the y-axis (red), x-axis (blue), or both axes (magenta) (P < 0.05, Wilcoxon rank-sum test).
There was also a tendency for neurons to have a weak initial excitation for each task
event (Figure 2.5A). This nonspecific response is probably due to the animals initial
uncertainty about the stimulus identity (Kakade and Dayan, 2002; Day et al., 2007) or
stimulus timing (Fiorillo et al., 2008; Kobayashi and Schultz, 2008). We did not observe
a predominant tendency for neurons to have anticipatory tonic increases in activity before
the delivery of probabilistic rewards, a phenomenon that has been reported in one study
(Fiorillo et al., 2003) but not others (Satoh et al., 2003; Morris et al., 2006; Bayer and
Glimcher, 2005; Matsumoto and Hikosaka, 2007; Joshua et al., 2008). This may be due
to differences in task design such as the size of the reward or the manner in which the
reward was signaled (Fiorillo et al., 2003).
An important question is whether dopamine neurons signal the presence of
information per se, or whether they truly signal how much it is preferred. In the latter
case, there should be a correlation between the neural preference for information,
expressed as the neural discrimination between the informative-cue target and the
random-cue target, and the behavioral preference for information, expressed as a choice
percentage. Such correlations were indeed present, both between-monkeys and within-
monkey. Between-monkeys, monkey V expressed a stronger behavioral preference for
information than monkey Z (Figure 2.1B), and also expressed a stronger neural
-
28
preference (P = 0.02, Figure 2.6A). Within-monkey, during the sessions in which
monkey Zs behavioral preference was strongest, the neural preference was enhanced
(rho = 0.44, P = 0.02, Figure 2.6D). On the other hand, behavioral preferences for
information were not significantly correlated with neural discrimination between water-
related cues or water rewards (all P > 0.25, Figure 2.6B,C,E,F). Thus, consistent with
evidence that dopamine neurons signal the subjective value of liquid rewards (Morris et
al., 2006; Roesch et al., 2007; Kobayashi and Schultz, 2008), they may also signal the
subjective value of information.
Figure 2.6. Correlation between neural discrimination and behavioral preference (A) Histogram of single-neuron target response discrimination between all informative trials (choice and forced trials combined) versus forced-random trials, separately for monkey V (left) and monkey Z (right). Arrows, numbers, and horizontal lines indicate the
-
29
mean discrimination, and the width of the arrows represent the 95% bootstrap confidence interval. Red indicates statistical significance. (B,C) Same as (A), for discrimination between informative big-reward and small-reward cues (B) or between random big and small rewards (C). (D) Plot of behavioral choice percentage against single-neuron discrimination between all informative trials versus forced-random trials in response to the target. The line was fitted by least-squares regression. Text shows Spearmans rank correlation (rho), and red indicates statistical significance. The data is from monkey Z only, because monkey V almost exclusively chose the informative target and therefore had no behavioral variability. (E-F) Same as (D), but for discrimination between informative big-reward and small-reward cues (E) or between random big and small rewards (F).
2.3 DISCUSSION
Here we have shown that macaque monkeys prefer to receive advance information
about future rewards, and that their behavioral preference is paralleled by the neural
preference of midbrain dopamine neurons. Thus, the same dopamine neurons that signal
primitive rewards like food and water also signal the cognitive reward of advance
information.
Monkeys expressed a strong preference for advance information even though it had
no effect on the final reward outcome. This is consistent with the intuitive belief that, all
things being equal, it is better to seek knowledge than to seek ignorance. It also provides
an explanation for the puzzling fact that the brain devotes a great deal of neural effort to
processing reward information even when this is not required to perform the task at hand.
For example, many studies use passive classical conditioning tasks in which informative
cues are followed by rewards with no requirement for the subject to take any action. In
these tasks the brain could simply ignore the cues and wait passively for rewards to
arrive. Yet even after extensive training, many neurons continue to use the cue
information to predict the size, probability, and timing of reward delivery (e.g. (Tobler et
al., 2003; Joshua et al., 2008)). In other tasks, neurons persist in predicting rewards even
-
30
when the act of prediction is harmful, causing maladaptive behavior that interferes with
reward consumption (e.g. refusing to perform trials with low predicted value (Shidara and
Richmond, 2002; Lauwereyns et al., 2002a)). These observations suggest that the act of
prediction has a special status, an intrinsic motivational or rewarding value of its own.
Our data provides strong evidence for this hypothesis. When given an explicit choice,
monkeys actively sought out the advance information that was necessary to make
accurate reward predictions at the earliest possible opportunity.
A limitation of our study is that it does not determine the precise psychological
mechanism by which value is assigned to information. There are several possibilities.
Theories from experimental psychology suggest that in our task the value of viewing
informative cues would simply be the sum of the conditioned reinforcement generated by
the individual big-reward and small-reward cues. In this view, the preference for
information implies that the conditioned reinforcement is weighted nonlinearly, so that
the benefit of strong reinforcement from the big-reward cue outweighs the drawback of
weak reinforcement from the small-reward cue (Wyckoff, 1959; Fantino, 1977;
Dinsmoor, 1983), akin to the nonlinear weighting of rewards that produces risk seeking
(von Neumann and Morgenstern, 1944). On the other hand, theories in economics
suggest that preference is not due to independent contributions of individual cues but
instead comes from considering the full probability distribution of future events. In this
view, information-seeking is due to an explicit preference for early resolution of
uncertainty (Kreps and Porteus, 1978) or an implicit preference induced by psychological
factors such as anticipatory emotions (Caplin and Leahy, 2001). In addition, just as the
value assigned to conventional forms of reward (e.g. food) depends on the internal state
-
31
of the subject (e.g. hunger), the value assigned to information is likely to depend on
psychological factors such as personality (Miller, 1987), emotions like hope and anxiety
(Chew and Ho, 1994; Wu, 1999) and attitudes toward uncertainty (Lovallo and
Kahneman, 2000; Platt and Huettel, 2008).
Implications of information-seeking for attitudes toward uncertainty
In the framework of decision-making under uncertainty, advance information
reduces the amount of reward uncertainty by narrowing down the set of potential reward
outcomes. Our data therefore suggests that in our task rhesus macaque monkeys preferred
to reduce their reward uncertainty at the earliest possible moment, as though the
experience of uncertainty was aversive.
Interestingly, several previous studies using similar saccadic decision tasks came to a
seemingly opposite conclusion: macaque monkeys appeared to prefer uncertainty,
choosing an uncertain, variable-size reward instead of a certain, fixed-size reward
(McCoy and Platt, 2005; Platt and Huettel, 2008). How can these results be reconciled?
One possibility is that they can be explained by a common principle; for instance,
perhaps monkeys treat the offer of a variable-size reward as a source of uncertainty to be
confronted and resolved. An important point, however, is that a preference for reward
variance can be caused by factors unrelated to uncertainty most notably, it can be
caused by an explicit preference over the probability distribution of reward outcomes, for
instance due to disproportionate salience of large rewards (Hayden and Platt, 2007) or a
nonlinear utility function (Platt and Huettel, 2008). In contrast, the choice of information
has no influence on the reward outcome; it only affects the amount of time spent in a
-
32
state of uncertainty before the reward outcome is revealed. In this sense the preference
for advance information is a relatively pure measurement of attitudes toward uncertainty.
Information signals in the dopaminergic reward system
Dopamine neuron activity is thought to teach the brain to seek basic goals like food
and water, reinforcing and punishing actions by adjusting synaptic connections between
neurons in cortical and subcortical brain structures (Wise, 2004; Montague et al., 2004).
Our data suggests that the same neural system also teaches the brain to seek advance
information, selectively reinforcing actions that lead to knowledge about rewards in the
future. Thus, the behavioral preference for information could be created by the
dopaminergic reward system. At the neural level, neurons which gain sensitivity to
rewards through a dopamine-mediated reinforcement process would come to represent
both rewards and advance information about those rewards in a common currency,
particularly neurons involved in reward timing, conditioned reinforcement, and decision-
making under risk (Kim et al., 2008; Seo and Lee, 2009; Platt and Huettel, 2008). In turn,
these signals could ultimately feed back to dopamine neurons to influence their value
signals.
An important goal for future research will therefore be to discover how dopamine
neurons measure information and assign its rewarding value. One possibility is that
dopamine neurons receive information-related input from specialized brain areas, distinct
from those that compute the value of traditional rewards like food and water. Indeed,
signals encoding the amount and timing of reward information, and dissociated from
preference coding of traditional rewards, have been found in several cortical areas
-
33
(Nakamura, 2006; Behrens et al., 2007; Luhmann et al., 2008). How these information
signals could be translated into a behavioral preference, and whether they are
communicated to dopamine neurons, is unknown.
Another possibility is that dopamine neurons receive information signals from the
same brain areas that contribute to their food- and water-related signals, such as the
lateral habenula (Matsumoto and Hikosaka, 2007). In this case, dopamine neurons would
receive a highly processed input, with different forms of rewards already converted into a
common currency by upstream brain areas. We are currently testing this possibility in
further experiments.
Why do dopamine neurons treat information as a reward?
The preference for advance information, despite its intuitive appeal, is not predicted
by current computational models of dopamine neuron function (Schultz et al., 1997;
Montague et al., 2004), which are widely viewed as highly efficient algorithms for
reinforcement learning. This raises an important question: could the information-
predictive activity of dopamine neurons be a harmful bug that impairs the efficiency of
reward learning? Or is it a useful feature that improves over existing computational
models? Here we present our hypothesis that the positive value of advance information is
a feature with a fundamental role in reinforcement learning.
Specifically, modern theories of reinforcement learning recognize that animals learn
from two types of reinforcement: primary reinforcement generated by rewards
themselves, and secondary reinforcement generated predictively, by observing sensory
cues in advance of reward delivery. Predictive reinforcement greatly enhances the speed
-
34
and reliability of learning, as demonstrated most strikingly by temporal-difference
learning algorithms (Sutton and Barto, 1998) which have produced influential accounts of
animal behavior (Sutton and Barto, 1981) and dopamine neuron activity (Schultz et al.,
1997; Montague et al., 2004). This implies that animals should treat predictive
reinforcement as an object of desire, making an active effort to seek out environments
where reward-predictive sensory cues are plentiful. If an animal was trapped in an
impoverished environment where reward-predictive cues were unavailable, the
consequences would be devastating: the animal's sophisticated predictive reinforcement
learning algorithms would be reduced to impotence. This can be seen clearly in our
dopamine neuron data (Figure 2.5A). When an action produces informative cues,
dopamine neurons signal its value immediately, a predictive reinforcement signal; but
when an action produces uninformative cues, dopamine neurons must wait to signal its
value until the reward outcome arrives, acting as little more than a primitive reward
detector. Thus, predictive reinforcement depends entirely on obtaining advance
information about upcoming rewards.
In light of these considerations, we propose that any learning system driven by the
engine of predictive reinforcement must actively seek out its fuel of advance
information. In this view, current models of neural reinforcement learning present a
curious paradox: their learning algorithms are vitally dependent on advance information,
but they treat information as valueless and make no effort to obtain it. These models do
include a form of knowledge-seeking by exploring unfamiliar actions, but they make no
effort to obtain informative cues that would maximize learning from these new
experiences. In fact, models using the popular TD() algorithm (Sutton and Barto, 1998)
-
35
are actually averse to advance information (Note 2.A). Our data shows that a new class
of models is necessary that assign information a positive value perhaps representing the
future reward the animal expects to receive, as a result of obtaining better fuel for its
learning algorithm. This would be akin to the concept of intrinsically motivated
reinforcement learning (Barto et al., 2004), in that dopamine neurons would assign an
intrinsic value to information because it could help the animal learn to better predict and
control its environment (Barto et al., 2004; Redgrave and Gurney, 2006). Also, although
dopamine neurons have been best studied in the realm of rewards, they can also respond
to salient non-rewarding stimuli (Horvitz, 2000; Redgrave and Gurney, 2006; Joshua et
al., 2008; Matsumoto and Hikosaka, 2009). This suggests that dopamine neurons might
be able to signal the value of information about neutral and punishing events (Herry et
al., 2007; Badia et al., 1979; Fanselow, 1979; Tsuda et al., 1989) events, as part of a more
general role in motivating animals to learn about the world around them.
2.4 METHODS
Subjects. Subjects were two male rhesus macaque monkeys (Macaca mulatta), monkey
V (9.3 kg) and monkey Z (8.7 kg). All procedures for animal care and experimentation
were approved by the Institute Animal Care and Use Committee and complied with the
Public Health Service Policy on the humane care and use of laboratory animals. A plastic
head holder, scleral search coils, and plastic recording chambers were implanted under
general anesthesia and sterile surgical conditions.
-
36
Behavioral Tasks. Behavioral tasks were under the control of the REX program (Hays et
al., 1982) adapted for the QNX operating system. Monkeys sat in a primate chair, facing
a frontoparallel screen 31 cm from the monkeys eyes in a sound-attenuated and
electrically shielded room. Eye movements were monitored using a scleral search coil
system with 1 ms resolution. Stimuli generated by an active matrix liquid crystal display
projector (PJ550, ViewSonic) were rear-projected on the screen.
In the information choice task (Figure 2.1), each trial began with the appearance of a
central spot of light (1 diameter), which the monkey was required to fixate. After 800
ms, the spot disappeared and two colored targets appeared on the left and right sides of
the screen (2.5 diameter, 10-15 eccentricity). (On forced-information and forced-
random trials, only a single target appeared). The monkey had 710 ms to saccade to and
fixate the chosen target, after which the non-chosen target immediately disappeared. At
the end of the 710 ms response window a cue (14 diameter) was presented of the chosen
color. For the informative color, the cue was a cross on large-reward trials or a wave
pattern on small-reward trials. For the random color, the cues shape was chosen
pseudorandomly on each trial (see below). The colors were green and orange, chosen to
have similar luminance and counterbalanced across monkeys. Monkeys were not required
to fixate the cue. After 2250 ms of display time, the cue disappeared and simultaneously
a 200 ms tone sounded and reward delivery began. The inter-trial interval was 3850-4850
ms beginning from the disappearance of the cue. Water rewards were delivered using a
gravity-based system (Crist Instruments). Reward delivery lasted 50 ms on small-reward
trials (0.04 ml) and 700 ms (0.88 ml, monkey V) or 825 ms (1.05 ml, monkey Z) on
-
37
large-reward trials. To minimize the effects of physical preparation, licking the water
spout was not required to obtain rewards; water was delivered directly into the mouth.
The task proceeded in blocks of 24 trials, each block containing a randomized
sequence of all 3x2x2x2 combinations of choice type (forced-information, forced-
random, choice), reward size (large, small), random cue shape (cross, waves), and
informative target location (left, right). Thus, the random cues were actually quasi-
random and could theoretically yield a small amount of information about reward size,
but extracting that information would require a very difficult feat of working memory.
If monkeys made an error (broke fixation on the central spot, or failed to choose a
target, or broke fixation on the chosen target before the cue appeared), then the trial
terminated, an error tone sounded, an additional 3 seconds were added to the inter-trial
interval, and the trial was repeated (correction trial). If the error occurred after the
choice, only the chosen target was available on the correction trial.
The information delay task (Figure 2.3) was identical to the information choice
task except the cue colors and shapes were different, and a third set of always-
informative gray cues lasting for 1500 ms were appended to the cue period. (there were
also minor differences in the task parameters for monkey Z: the duration of the first cue
was 2000 ms, and the big reward volume was ~1.29 ml). The 1500 ms duration of the
always-informative cue was chosen to allow near-optimal physical preparation for
rewards. With a shorter cue duration (e.g. < 750 ms), there might not be enough time to
discriminate the cue and make a physical response (e.g. compare to the latency of
anticipatory licking in (Tobler et al., 2003)). With a longer cue duration (e.g. > 2
seconds), physical preparation for reward delivery begins to be impaired by timing errors
-
38
(e.g. compare to the timecourse of anticipatory licking in (Fiorillo et al., 2008; Kobayashi
and Schultz, 2008)). To perform a reversal (vertical lines in Figure 2.3B), the colors of
the informative and random cues were switched.
Neural Recording. Midbrain dopamine neurons were recorded using techniques
described previously (Matsumoto and Hikosaka, 2007). A recording chamber was placed
over fronto-parietal cortex, tilted laterally by 35 degrees, and aimed at the substantia
nigra. The recording sites were determined using a grid system, which allowed recordings
at 1 mm spacing. Single-neuron recording was performed using tungsten electrodes
(Frederick Haer) that were inserted through a stainless steel guide tube and advanced by
an oil-driven micro-manipulator (MO-97A, Narishige). Single neurons were isolated on-
line using custom voltage-time window discrimination software (the MEX program
(Hays et al., 1982) adapted for the QNX operating system).
Neurons were recorded in and around the substantia nigra pars compacta and ventral
tegmental area. We targeted this region based on anatomical atlases and magnetic
resonance imaging (4.7T, Bruker). During recording sessions, we identified this region
based on recording depth and using landmarks including the somatosensory and motor
thalamus, subthalamic nucleus, substantia nigra pars reticulata, red nucleus, and
oculomotor nerve. Presumed dopamine neurons were identified by their irregular tonic
firing at 0.5-10 Hz and broad spike waveforms. We focused our recordings on presumed
dopamine neurons that responded to the task and appeared to carry positive reward
signals. Occasional dopamine-like neurons that upon examination showed no differential
response to the cues and no differential response to the reward outcomes were not
-
39
recorded further. We then analyzed all neurons that were recorded for at least 60 trials
and that had positive reward discrimination for both informative cues and random
outcomes, or positive reward discrimination for cues and no discrimination for outcomes,
or positive reward discrimination for outcomes and no discrimination for cues (P < 0.05,
Wilcoxon rank-sum test). We were able to examine the response properties of 108
neurons, 84 of which met our criteria for presumed dopaminergic firing rate, pattern, and
spike waveform, and 47 of which also met our criteria for trial count and significant
reward signals. This yielded 20 neurons from monkey V (right hemisphere) and 27
neurons from monkey Z (left hemisphere) for our analysis.
Data Analysis. All statistical tests were two-tailed. The neural analysis excluded error
trials and correction trials. We analyzed neural activity in time windows 150-500 ms after
target onset (targets), 150-300 ms after cue onset (cues), and 200-450 ms after cue offset
(rewards). These were chosen to include the major components of the average neural
response. The neural discrimination between a pair of task conditions was defined as the
area under the receiver operating characteristic (ROC), which can be interpreted as the
probability that a randomly chosen single-trial firing rate from the first condition was
greater than a randomly chosen single-trial firing rate from the second condition (Green
and Swets, 1966). We observed the same results using other measures of neural
discrimination such as the signal-to-baseline ratio and signal-to-noise ratio. Confidence
intervals and significance of the population averages of single-neuron ROC areas (Figure
2.6A-C) were computed using a bootstrap test with 200,000 resamples (Efron and
Tibshirani, 1993). Consistent with previous studies of reward coding (Schultz and Romo,
-
40
1990; Kawagoe et al., 2004; Roesch et al., 2007; Matsumoto and Hikosaka, 2007) we
observed similar neural coding of behavioral preferences for both of the target locations
on the screen (average ROC area, forced information vs. forced random: ipsilateral =
0.60, P < 10-4, contralateral = 0.62, P < 10-4; choice information vs. forced random:
ipsilateral 0.58, P < 10-4, contralateral 0.62, P < 10-4), so for all analyses the data were
combined. We could not analyze activity on choice-random trials due to their rarity (< 3
trials for most neurons). All correlations were computed using Spearman's rho (rank
correlation). To compare neural discrimination measured using either forced-information
or choice-information trials in independent data sets, we calculated the correlation
between two values, the discrimination between forced-information trials vs. even-
numbered forced-random trials, and the discrimination between choice-information trials
vs. odd-numbered forced-random trials (rho = 0.68, P < 10-4). Significance of
correlations, and of the difference in mean ROC area between the two monkeys (Figure
2.6), were computed using permutation tests (200,000 permutations) (Efron and
Tibshirani, 1993).
2.5 ACKNOWLEDGEMENTS
We thank M. Matsumoto, G. La Camera, B. J. Richmond, D. Sheinberg, and V. Stuphorn
for valuable discussions, and G. Tansey, D. Parker, M. Lawson, B. Nagy, J.W.
McClurkin, A.M. Nichols, T.W. Ruffner, L.P. Jensen, and M.K. Smith for technical
assistance. This work was supported by the intramural research program at the National
Eye Institute.
-
41
NOTE 2.A
Aversion to information by reinforcement learning algorithms based on TD()
It may be surprising that models based on temporal-difference learning (TD
learning), which is formally indifferent to information, could show any preference in our
task. However, TD learning is only a method for reward prediction; it does not specify
how to use that knowledge to take action. When TD learning is coupled to a mechanism
for action-selection (as is necessary in models of animal behavior), new behavior can
emerge. Here we will describe the formalism of TD learning (Sutton and Barto, 1998),
explain a simple case in which it leads to biases in selection between risky and safe
actions, and extend this reasoning to our case of selection between informative and
uninformative actions.
TD learning formalism definition of reward value
The goal of TD learning is to learn a value function that takes as input the current
state of the environment s and predicts the amount of expected reward r that will be
received in the future. We assume that at time t a learning agent observes the state of the
environment st, takes an action at, and then receives feedback in the form of a reward rt+1
and a new state of the environment st+1. The agent-environment interaction can be
described as a Markov decision process, which is fully specified by three functions
(,R,T): the agents policy for choosing actions (s,a) = Pr(take action a | in state s), and
the environments mapping from (state,action) pairs to rewards and new states, the
reward function R(s,a,r) = Pr(receive reward r | in state s and took action a) and the state
transition function T(s,a,s) = Pr(new state is s | in state s and took action a). Given these
-
42
specifications, we can derive the full probability distribution over the agents future
experiences starting from the current state st and action at and leading to any future state
st+ and reward rt+, e.g. written as p,R,T(st+, rt+ | st, at). We can then define the value
function V as the expectation of the sum of future rewards, starting from the state st at
the current moment in time t and considering all future states st+:
=
++ ===1
,