role of dopamin in searching information

The Role of Dopamine in

Information-Seeking

A thesis by Ethan S. Bromberg-Martin

Thesis Committee: Okihide Hikosaka, National Eye Institute, Advisor

Barry J. Richmond, National Institute of Mental Health, Chair David L. Sheinberg, Brown University

Daeyeol Lee, Yale University, Outside Reader

Defended June 2, 2009

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Program in

the Department of Neuroscience at Brown University

May, 2010

This dissertation by Ethan S. Bromberg-Martin is accepted in its present form by the Division of Biology and Medicine as satisfying the dissertation requirement of the degree of Doctor of Philosophy. Date___________ ___________________________________ Okihide Hikosaka, M.D., Ph.D., Advisor Laboratory of Sensorimotor Research National Eye Institute National Institutes of Health

Recommended to the Graduate Council Date___________ ___________________________________ David L. Sheinberg, Ph.D., Reader Department of Neuroscience Brown University Date___________ ___________________________________ Barry J. Richmond, M.D., Reader Laboratory of Neuropsychology National Institute of Mental Health National Institutes of Health Date___________ ___________________________________ Daeyeol Lee, Ph.D., Reader Department of Neurobiology Yale University

Approved by the Graduate Council Date___________ ___________________________________ Sheila Bonde, Ph.D. Dean of the Graduate School Brown University

iii

CURRICULUM VITAE Ethan S. Bromberg-Martin

Born April 27, 1983

Contact Information

Laboratory of Sensorimotor Research

49 Convent Drive, Room 2A50

Bethesda, MD 20892

Email: [email protected]

Education

Summer 2005 to Current

Candidate for Ph.D. in Neuroscience at Brown University,

as part of the Graduate Partnership Program with the

National Institutes of Health

Graduate Thesis: The Role of Dopamine in Information-Seeking

(Mentor: Dr. Okihide Hikosaka, National Eye Institute)

Fall 2001 to Spring 2005

Bachelor of Science in Computational Biology at Brown University,

iv

magna cum laude

Honors Thesis: Partial-Order Alignment of RNA Structures

(Mentor: Dr. Franco Preparata, Computer Science Department)

Other Employment

Summer 2002, 2003, 2004

Internship at the Laboratory of Experimental and Computational Biology,

National Cancer Institute, at NCI-Frederick in Ft. Detrick, Maryland

Summer 2001

Internship at the Laboratory of Biochemical Genetics,

National Heart, Lung, and Blood Institute, at the NIH in Bethesda, Maryland.

Professional Memberships

Society for Neuroscience

Peer-Reviewed Publications

Bromberg-Martin ES, Hikosaka O. (2009) Midbrain dopamine neurons signal

preference for advance information about upcoming rewards. (Neuron, in press)

Hikosaka O, Bromberg-Martin E, Hong S, Matsumoto M. (2008) New insights on

the subcortical representation of reward. Curr. Opin. Neurobiol. 18, 203-208.

Conference Presentations

v

Bromberg-Martin E, Matsumoto M, Nakamura K, Nakahara H, Hikosaka O. (2009)

Multiple timescales of reward memory in lateral habenula and midbrain

dopamine neurons. Frontiers in Systems Neuroscience. Conference Abstract:

Computational and systems neuroscience.

Bromberg-Martin ES, Hikosaka O. Midbrain dopamine neurons signal preference for

advance information about upcoming rewards. (2008) Society for Neuroscience

Abstract, 34: 691.23.

Bromberg-Martin ES, Matsumoto M, Hikosaka O. Lateral habenula neurons respond

to the negative value of rewards and the positive value of elapsed time. (2007)

Society for Neuroscience Abstract, 33: 530.9.

Bromberg-Martin E, Mar Jonsson A, Marai GE, McGuire M. Hybrid Billboard

Clouds for Model Simplification. (2004) SIGGRAPH Poster Abstract.

vi

PREFACE

Aristotle began his great work, Metaphysics, with the observation that All men by

nature desire to know. In this thesis my aim is to take a first step toward understanding

how the brain motivates itself to seek information about the surrounding world. Chapter

1 introduces current knowledge about the behavioral phenomenon of information-seeking

and about the neural basis of reward seeking, in particular the reward-signaling midbrain

dopamine neurons that are the focus of this thesis. In chapter 2, I present the results of

experiments that show that macaque monkeys seek advance information about the size of

upcoming water rewards. Further, the same midbrain dopamine neurons that signal the

expected amount of water also signal the expectation of advance information, in a manner

that is correlated with behavioral preferences. In chapter 3, I present an analysis of an

existing dataset showing that in a task in which reward information is delivered at a

predictable time, dopamine neurons can use two distinct prediction strategies one based

on a one-trial memory of past rewards, expressed at the start of the trial, and another

based on a long-timescale memory, expressed at the time when new reward information

is revealed. The same pattern is present in a major source of input to dopamine neurons,

reward-predicting cells of the lateral habenula. Chapter 4 summarizes the main results of

this thesis and suggests new avenues for future research.

vii

ACKNOWLEDGEMENTS

This thesis was completed in the Laboratory of Sensorimotor Research in the

National Eye Institute (NEI) at the National Institutes of Heath (NIH) and at Brown

University between June 2005 and May 2009. It was made possible by the generous

support of the NIH intramural research program, and by the Graduate Partnerships

Program (GPP) between the NIH and the Brown University Department of Neuroscience.

I thank my advisor, Okihide Hikosaka, for his continual support and excellent

mentorship, especially his keen judgment of experimental design and advice about

building a program of research. He is as good an example of a scientist and a gentleman

as a budding neuroscientist could hope to have as a mentor. Most of all, he makes the

process of scientific discovery enjoyable. Once a visitor asked me why I was in the lab so

late, and suggested that I should spend less time working. I replied, Its not working, its

FUNNING! If it wasnt for Okihide, I wouldnt have been able to say that!

I thank my thesis committee members Barry J. Richmond and David L. Sheinberg

for always being available to discuss matters related to my thesis and for their insightful

input, and my outside reader Daeyeol Lee for valuable advice and a keen eye for detail. I

thank my colleagues at the NIH, especially Masayuki Matsumoto for his collaboration

and experimental and research advice, and Kae Nakamura, Long Ding, Masaki Isoda,

Simon Hong, Masaharu Yasuda, Shinya Yamamoto, and Yoshihisa Tachibana; fellow

viii

fellows including Ilya Monosov, Alexander Lerchner, Giancarlo la Camera, Christian

Quaia, Seiji Tanabe, Ralf Haefner, Hendrikje Neinborg, Rebecca Berman, Jason Trageser,

Wilsaan Joiner, Kerry McAlonan, and James Cavanaugh; intrepid investigators including

Bruce Cumming, Kirk Thompson, and Lance Optican; and excellent experts including

Mitchell Smith, John McClurkin, Art Hays, Altah Nichols, Tom Ruffner, Leonard Jensen,

Ginger Tansey, Denise Parker, Mark Lawson, and Bethany Nagy. I also thank Veit

Stuphorn for advice about behavioral experiments, and I thank my voluminous

correspondent and insightful collaborator, Hiroyuki Nakahara.

I thank my colleagues in the Brown-NIH GPP program, especially Avinash Pujala

and Ilya Monosov for their scholarly and other communications, and Emmette Huchison,

Jodi Gilman, Lindsay Hayes, Cole Graydon, Caroline Graybeal and Shau-Ming Wei for

their support and advice. And my colleagues at Brown, especially Arjun Bansal, Luke

Woloszyn, Ryan Mruczek, Jedediah Singer, and Carlos Vargas-Irwin. I thank those who

supported me in the Brown-NIH Graduate Partnership Program, including Diane

Lipscombe, Jerome Sanes, John Isaac, Katherine Roche, Caroline Duffy, Mary DeLong,

and Sharon Milgram. I also thank my professors and teachers at Brown who showed me

how exciting cognitive brain science can be! Including Michael Black, Frank Wood,

Lucien Elie Bienenstock, Mayank Mehta, Stuart Geman, John Stein, Michael Paradiso,

David Sheinberg, Michael Tarr, Fulvio Domini, and Steven Sloman.

Finally, I thank my parents, Rob and Laura; my brother, Brett; and my parents of

parents, Richard, Helen, Henry, and Arlene. For your love, support, advice,

understanding, and general awesomeness, too immense and profound to be contained

within words on a printed page.

ix

TABLE OF CONTENTS

Curriculum Vitae iv

Preface vii

Acknowledgements viii

Table of Contents x

List of Figures xii Chapter 1: Introduction .......................................................................................................... 1

1.1 Information-seeking in humans and animals 2

1.2 Reward-predictive signals of midbrain dopamine neurons 6

1.3 Reward-predictive signals of lateral habenula neurons 12

1.4 Summary 15

Chapter 2: Midbrain dopamine neurons signal preference for advance information about upcoming rewards ..............................................................................................................................

16

2.1 Introduction 17

2.2 Results 19

Behavioral preference for advance information 19

Dopamine neurons signal advance information 23

2.3 Discussion 29

Implications of information-seeking for attitudes toward uncertainty 31

Information signals in the dopaminergic reward system 32

Why do dopamine neurons treat information as a reward? 33

2.4 Methods 35

2.5 Acknowledgements 40

Note 2.A: Aversion to information by reinforcement learning algorithms based on TD() 41

Chapter 3: Multiple timescales of memory in a network for reward prediction .......................... 52

3.1 Introduction 53

3.2 Results 55

Behavioral task and optimal timescale of memory 55

Behavioral and neural memory for a single previous reward outcome 58

Multiple timescales of memory 62

x

Multiple timescales of memory in tonic neural activity 67

Timecourse of the timescale of memory 71

3.3 Discussion 72

3.4 Methods 77

3.5 Acknowledgements 81

Note 3.A: Optimal reward prediction in the pseudorandom reward schedule 82

Note 3.B: Formal model of reward expectation during the task 85

Note 3.C: Single neurons accessed both short- and long-timescale memories 90

Note 3.D: Reward memory in lateral habenula response to predictable reward omission 95

Note 3.E: Rough independence of lateral habenula tonic and phasic memory effects 97

Note 3.F: V-Shaped timecourse of the timescale of reward memory 98

Note 3.G: No evidence for memory-change-induced prediction-errors 99

Chapter 4: Summary and Conclusions ............................................................................................. 104

4.1 Chapter 2: Summary and future directions 104

4.2 Chapter 3: Summary and future directions 115

4.3 Summary of contributions 123

Bibliography 125

xi

LIST OF FIGURES

Figure 1.1: A dopamine neuron following the prediction-error theory 9

Figure 1.2: Lateral habenula neurons carry negative reward signals 14

Figure 2.1: Behavioral preference for advance information 20

Figure 2.2: Control for water volume of the informative and random options 21

Figure 2.3: Behavioral preference for immediate delivery of information 23

Figure 2.4: Dopamine neurons signal information 25

Figure 2.5: Analysis of the dopamine neuron population 26

Figure 2.6: Correlation between neural discrimination and behavioral preference 28

Figure 2.7: Aversion to information by a model using SARSA() and softmax action selection 50

Figure 3.1: Reward-biased saccade task 56

Figure 3.2: Behavioral performance and memory for a single past reward outcome 58

Figure 3.3: Neural responses to task events and memory for a single previous outcome 60

Figure 3.4: Multiple timescales of memory 64

Figure 3.5: Quantifying of the timescale of memory 66

Figure 3.6: Timescale of memory in tonic neural activity 69

Figure 3.7: Timecourse of the timescale of memory 72

Figure 3.8: Optimal reward prediction in the pseudorandom reward schedule 84

Figure 3.9: Formal model of reward expectation during the task 88

Figure 3.10: Single neurons accessed both short- and long-timescale memories 93

Figure 3.11: Memory effect in lateral habenula response to predictable reward omission 96

Figure 3.12: Rough independence of lateral habenula tonic and phasic memory effects 97

Figure 3.13: V-Shaped timecourse of the timescale of reward memory 98

Figure 3.14: No evidence for memory-change-induced prediction-errors during the fixation period 102

Figure 4.1: Hypothetical dopamine neuron encoding of information prediction-errors 107

Figure 4.2: Necessity for advance information in solving the temporal credit-assignment problem 111

Figure 4.3: Hypothetical measurement of the neural and behavioral change between short- and

long-timescale memories

121

xii

CHAPTER 1

Introduction

In recent years neuroscientists have learned much about how the brain represents

objects of desire, also known as rewards. We now know that rewards are represented by

neurons in practically every cortical and subcortical area, from motivational brain

systems where rewards are anticipated and predicted, to motor systems where rewards

influence action selection and movement control, to sensory systems where rewards

enhance the neural representation of desirable objects as they are seen, heard, touched,

smelled, and tasted. Unfortunately, a major limitation of this research is that it is very

rarely possible to record the activity of single neurons in behaving human beings, and

instead recordings are largely restricted to experimental animals. So, with a few notable

exceptions (e.g. (Williams et al., 2004; Klein et al., 2008; Zaghloul et al., 2009)), our

knowledge about the neuronal basis of rewarding experience is largely limited to basic

rewards like food and water (Schultz, 2000). The neural pathways that regulate eating

and drinking are highly specialized and evolutionarily ancient, a testament to their

importance (Lenard and Berthoud, 2008; Denton et al., 1996); on the other hand, this may

mean that these neural pathways are not flexible enough to represent the more abstract,

1

2

cognitive rewards that are at the heart of our mental lives. In this thesis my goal is to take

a first step toward understanding the neuronal basis of cognitive rewards by studying a

form of cognitive reward-seeking that is shared between humans and animals: the desire

for advance information about future events.

1.1 Information-seeking in humans and animals

In 350 B.C., Aristotle began his great work, Metaphysics, with the assertion that All

men by nature desire to know. This desire is not simply because knowledge can be put

to use in achieving ones goals, but because knowledge is intrinsically desirable in its

own right: For not only with a view to action, but even when we are not going to do

anything, we prefer seeing (one might say) to everything else. The reason is that this,

most of all the senses, makes us know and brings to light many differences between

things. (Aristotle and Ross, 2006). This philosophical assertion appeals to the intuition

surely it is better to seek knowledge than to seek ignorance! Yet it was more than two

millennia before this concept of information-seeking became the subject of scientific

study.

This topic was first introduced to experimental psychology by L. Benjamin Wyckoff

during his studies of discrimination learning in pigeons (Wyckoff, 1952). He coined the

term observing response to mean any action that brought an animal into contact with a

discriminative stimulus (a stimulus used to choose between a set of possible responses).

For example, a pigeon in a Skinner box might turn its head toward a colored light

(observing response), which would then tell the pigeon which button to peck to get a food

reward (discriminative response). In this case the information is obviously beneficial

3

because it helps the animal gain a greater amount of food. However, several researchers

found that rats and pigeons persistently observe reward-indicating cues even when their

observations have no effect on the reward rate (Prokasy, 1956; Hendry, 1969; Daly,

1992; Roper and Zentall, 1999). In a prototypical observing behavior experiment a rat

is given a choice between two chambers, after which the rat is confined in the chamber

for 30 seconds and then given a food reward with 50% probability. In one chamber, the

color of the walls is informative, telling the rat whether the trial will be rewarded; in the

other chamber the color of the walls is uninformative, entirely randomized. Under these

conditions, rats express a strong preference for the informative chamber (Prokasy, 1956;

Daly, 1992). The preference for informative cues is strongest when the possibility of

receiving a reward is most attractive: for example, when the rats are hungry, when

rewards are scarce, and when the potential reward size is large (Wehling and Prokasy,

1962; McMichael et al., 1967; Mitchell et al., 1965). Rats also prefer cues that are

informative about upcoming punishments (aversive events), showing the strongest

preference when the intensity of the possible punishment is largest (Badia et al., 1979).

Thus, animals do not seek all kinds of information equally, but specifically seek

information about future events that have a large positive or negative value.

An independent line of inquiry1 into information-seeking began in the field of

economics with a theoretical paper by Kreps and Porteus, introducing a concept they

called temporal resolution of uncertainty (Kreps and Porteus, 1978). According to their

theory, decision-makers may express a preference between two options that have an

identical set of possible outcomes but differ in the time when the outcome becomes

1 I cannot find any paper from the literature on observing behavior that references any paper from the literature on temporal resolution of uncertainty, or vice versa. Apparently this thesis is the first publication to link these two lines of inquiry.

4

known that is, the time when the decision-makers uncertainty about the outcome is

resolved. According to this theory, decision-makers can prefer early resolution of

uncertainty (behaving like the rats and pigeons in the above experiments), or late

resolution of uncertainty (ignoring information for as long as possible). Later experiments

showed that humans generally express a strong preference for early resolution of

uncertainty, seeking advance information about monetary gains and losses even when

they cannot use this knowledge to change the outcome (Chew and Ho, 1994; Ahlbrecht

and Weber, 1996; Lovallo and Kahneman, 2000; Eliaz and Schotter, 2007; Luhmann et

al., 2008).

Thus, both humans and animals seek advance information about future rewards, as if

information was itself rewarding. Yet advance information is fundamentally different

from conventional forms of reward. Most rewards are spoken of as discrete physical

entities that can be delivered and consumed at a specific moment in time. A reward is

an object, like a apple that can be eaten, or an attractive picture that can be consumed

by visual inspection. But consider the behavioral experiment described above, in which

rats chose between informative and uninformative chambers that led to probabilistic

rewards. Both options presented exactly the same two physical stimuli: a chamber that

could be white or black, and a reward that could be present or absent. Thus, the animals

preference cannot be explained in terms of a rewarding stimulus that was present for

one option and absent for the other. The only difference between the options was that for

one chamber the color indicated the reward size, whereas for the other chamber it did not.

Crucially, this informative relationship could not be detected on any single trial, but had

to be inferred through repeated experience. Thus, the differential reward value between

5

the two options did not originate from any single external event, but from an inference or

realization, an internal event or sequence of events in the brain. In this sense the reward

value of advance information has a highly abstract, cognitive origin.

How could such an abstract form of reward be generated by the brain? Is it created

by the same neural networks that assign value to basic rewards like food and water? Or is

it handled by an entirely separate system? Much is known about how the brain controls

the physical act of consuming food and water but what about the analogous mental act

of consuming information?

In this thesis I will take a first step toward answering these questions by recording

the activity of dopamine-releasing neurons in the midbrain. Dopamine neurons are an

ideal candidate for this purpose because they are one of the brains major sources of food

and water reward signals and are theorized to have a central role in reward learning. A

crucial test of this theory will be whether dopamine neurons carry informational reward

signals as well (Chapter 2). Furthermore, dopamine neuron activity is thought to serve as

a reinforcement signal that teaches the brain which actions are good and which are

bad. Thus, if these neurons carry information-related signals, then they are likely to play

a causal role in creating the informational preference seen in behavior. Finally, dopamine

neurons do not act as simple reward detectors, but instead appear to respond to the first

new information about rewards in the future. Thus, they provide an ideal opportunity to

study how reward-signaling neurons in the brain adapt their activity at the time when new

reward information is expected and consumed (Chapter 3).

In the next section I will describe the nature of reward signals in dopamine neurons,

along with the theoretical framework which I will use for interpreting their activity

6

(Section 1.2). I will also describe the activity pattern of neurons in one of their major

input nuclei, the lateral habenula, which I will analyze in detail in Chapter 3 in order to

better understand the source of dopamine neuron reward signals (Section 1.3).

1.2 Reward-predictive signals of midbrain dopamine neurons

Two regions of the midbrain, the ventral tegmental area (VTA) and the substantia

nigra pars compacta (SNc), contain a large number of neurons that release the

neurotransmitter dopamine2. These neurons send axons to many structures involved in

motor control, cognition, emotion, and memory - SNc neurons project primarily to the

dorsal striatum, while VTA neurons project to many areas including the ventral striatum,

amygdala, hippocampus, and wide swaths of frontal cortex (Oades and Halliday, 1987;

Graybiel, 1990). These neurons are well-studied partly because they are involved in a

variety of neurological disorders; in particular, the degeneration of SNc dopamine

neurons is the major cause of Parkinson's disease (Dauer and Przedborski, 2003). In this

thesis I will focus on another aspect of dopamine neurons, their role in reward learning.

Manipulations of dopamine neurotransmission cause profound effects on reward-

seeking. For example, when animals are infused with dopamine antagonists they

gradually reduce the effort they put forth to get food and water rewards, behaving as

though the reward size had suddenly decreased (Wise, 2004). Even on later days when

the drug is no longer administered, animals still continue to put forth reduced effort, as if

the suppression of dopamine neurotransmission had caused them to learn that the task

2 For convenience, they will here be referred to as dopamine neurons rather than the more complete phrase midbrain dopamine neurons. In fact there are several populations of dopamine neurons in the nervous system, including cells in the retina, olfactory bulb, and hypothalamus (Bjorklund, A., and Dunnett, S.B. (2007). Dopamine neuron systems in the brain: an update. Trends in neurosciences 30, 194-202.).

7

had a low reward value. In some cases an animal's reward learning can be impaired or

even fully blocked by interfering with dopamine transmission in very localized regions

on the brain (e.g. (Liu et al., 2004; Dalley et al., 2005; Nakamura and Hikosaka, 2006)).

Dopamine release is not only linked to natural rewards but also seems to be crucial for

many 'unnatural' ones as well, as most addictive drugs potentiate dopamine release in one

way or another (Kauer, 2004) and humans taking dopamine agonists as a medical

treatment sometimes develop addiction-like traits like compulsive gambling and

hypersexuality (Ondo and Lai, 2008; Bostwick et al., 2009). Dopamine release appears to

act as a reward or reinforcement signal, as animals can be trained to perform simple tasks

in exchange for direct burst stimulation of dopamine neuron axons (Tsai et al., 2009).

Consistent with the importance of dopamine in reward learning, a seminal series of

studies conducted by Wolfram Schultz and his colleagues in Fribourg found that these

neurons emitted phasic bursts of spikes triggered by food and water rewards (Schultz,

1998). Neurons fired bursts when monkeys drank a squirt of juice, when they touched or

saw a morsel of food, and when they observed visual or auditory cues indicating the

availability of these rewards. Perplexingly, each of these responses was present in some

tasks but absent in others. Furthermore, as monkeys gained practice at a task, some of

these bursts became stronger while other responses entirely disappeared, or could even be

converted into brief inhibitions. A number of theories have been proposed to account for

this puzzling pattern of dopamine neuron activity.

In this thesis I will discuss dopamine neurons primarily in the framework of the most

successful of these theories, based on the computational theory of temporal-difference

learning (TD learning) (Sutton, 1988; Sutton and Barto, 1998). According to this theory,

8

we continuously predict the expected future reward value available from our present

situation. Whenever we receive new information about the world we re-evaluate our

situation, reacting to good news by revising our reward prediction upwards and reacting

to bad news by revising it downwards. This mental re-evaluation triggers what is known

as a prediction-error or TD-error signal, reflecting the difference between the new

predicted value and the old predicted value (the temporal-difference). Dopamine

neuron activity is hypothesized to represent this prediction-error signal. Specifically, if

new information indicates that the situation is better than originally predicted, then it

triggers a positive prediction-error and dopamine neurons should burst; if the situation is

exactly as good as predicted, then there is zero prediction-error and dopamine neurons

should not respond; and if the situation takes a turn for the worse, it triggers a negative

prediction-error and dopamine neurons should pause (Figure 1.1). Thus dopamine

activity serves as the first derivative of value, signaling how the value of a situation

changes over time (Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997).

9

Figure 1.1. A dopamine neuron following the prediction-error theory. A monkey was presented with a visual cue and then 2.25 seconds later a big or a small water reward was delivered. Different cues indicated a 100%, 50%, or 0% probability of a big reward. This neuron was recorded for the study described in Chapter 2. Each plot shows the dopamine neurons activity for a particular cue and reward outcome. Top: the neurons mean firing rate at each time during the trial. Bottom: the neurons spikes on the first 36 trials recorded in each condition. Each row is a trial, and each dot is a spike. When the cues indicated the upcoming reward size, the neuron fired a burst of spikes when the monkey predicted a big reward and paused when the monkey predicted a small reward (top two panels). The neuron then had little response the predictable reward outcome. When the cues were uninformative about the upcoming reward size (50% cue), the neuron had little response to the cue but then fired a burst of spikes for an unpredicted big reward and paused for an unpredicted small reward. Thus, this neuron signaled changes in the animals prediction about the trials reward value.

This theory provides a simple explanation for a wide variety of findings about

dopamine neuron activity (Schultz et al., 1997; Schultz, 1998). It explains why dopamine

10

neuron responses to a reward are relative to the rewards predicted size; when two

different reward sizes are potentially available, neurons burst in response to the large

reward and pause in response to the small reward (Figure 1.1, bottom right). If a reward

has the same value as originally predicted, dopamine neurons have little or no response

(Figure 1.1, top right). Similarly, when two different reward-indicating cues could

potentially be shown, neurons burst for the big-reward cue and pause for the small-

reward cue (Figure 1.1, top left). In general, responses to reward-indicating cues are

positively related to the cued reward value, while responses to the later reward outcome

are negatively related to the cued value, as if the neurons were computing the prediction-

error, Actual Value Predicted Value (Fiorillo et al., 2003; Satoh et al., 2003; Morris

et al., 2004; Nakahara et al., 2004). This neural prediction of reward value tracks

behavioral measures of reward prediction during a large number of experimental

situations, including learning and extinguishing cue-reward relationships (Schultz et al.,

1993; Hollerman and Schultz, 1998; Takikawa et al., 2004; Day et al., 2007), estimating

the reward delivery time (Fiorillo et al., 2008; Kobayashi and Schultz, 2008), judging the

reward probability using complex cue-combination rules (Waelti et al., 2001; Tobler et

al., 2003), and estimating the reward value on the current trial based on the outcomes of

previous trials (Satoh et al., 2003; Nakahara et al., 2004; Bayer and Glimcher, 2005).

The TD learning theory provides a very good account of reward-related dopamine

neuron activity, good enough that in this thesis I will use the theory as the major

framework for describing my results. However, I should emphasize that this theory is not

a perfect or ultimate description of dopamine neuron activity. Indeed, the most important

results in Chapters 2 and 3 are strikingly unexpected from a TD learning point of view.

11

Therefore, it is wise to bear in mind the limitations of this framework. In particular,

although there is a good match between theory and experiment in the realm of rewards,

the match is less good for neutral and aversive stimuli. Dopamine neurons can respond to

neutral but surprising or startling stimuli such as bright lights and loud tones (Steinfels et

al., 1983; Strecker and Jacobs, 1985; Horvitz et al., 1997; Horvitz, 2000; Redgrave and

Gurney, 2006). The function of this activity is unknown. In addition, some dopamine

neurons respond to aversive events by being strongly inhibited (as expected from the

theory), but other dopamine neurons are strongly excited (Trulson and Preussler, 1984;

Guarraci and Kapp, 1999; Ungless et al., 2004; Joshua et al., 2008; Brischoux et al.,

2009; Matsumoto and Hikosaka, 2009). In this thesis I cannot tell how many of the

neurons I studied belong to this aversive-excited subpopulation because the neurons were

only tested with reward-related events. A second limitation of the theory is that it is only

designed to account for the mean pattern of dopamine neuron activity, as if all dopamine

neurons carried the same signal. Several studies have reported that there is more neuron-

to-neuron variability in response patterns than this simple account would suggest (Schultz,

1986; Nishino et al., 1987; Magarinos-Ascone et al., 1992; Kiyatkin and Rebec, 1998;

Ravel and Richmond, 2006; Wilson and Bowman, 2006).

Finally, an important limitation of dopamine neuron recording in behaving animals is

that there is no definitive test to tell whether a recorded neuron truly releases the

neurotransmitter dopamine. In this thesis I analyzed neurons which met standard

electrophysiological criteria thought to be good indicators that a neuron is dopaminergic

(Schultz, 1986) and which also carried positive reward signals like those that have been

reported in previous studies (Schultz, 1998). So when I say dopamine neurons did X it

12

should be strictly interpreted as reward-signaling presumed dopamine neurons did X.

There is evidence that the standard criteria are not perfect they may exclude certain

subpopulations of dopamine neurons, and may sometimes include nearby GABA-

releasing interneurons (Ungless et al., 2004; Margolis et al., 2006; Margolis et al., 2008;

Lammel et al., 2008). However, the major findings in this thesis are not based on any

single neuron, but on the predominant patterns of activity seen in the population as a

whole. Thus the new findings in this thesis can be interpreted as the typical response

properties of the majority of reward-responsive dopamine neurons.

1.3 Reward-predictive signals of lateral habenula neurons

Given that dopamine neurons have an important role in reward-seeking, it is crucial

to discover how dopamine neurons compute their reward-predictive signals. Dopamine

neurons receive input from many brain areas, including the pedunculopontine tegmental

nucleus, dorsal and ventral striatum, ventral pallidum, central amygdala, dorsal raphe,

lateral hypothalamus, and lateral habenula (Schultz, 1998; Fudge and Haber, 2000;

Geisler et al., 2007). In this thesis I will analyze the activity of neurons in the one of these

areas, the lateral habenula, which is an especially strong candidate for sending value

signals to dopamine neurons.

The lateral habenula is a small nucleus in the habenular complex of the epithalamus,

located on the dorsal surface of the thalamus (Parent et al., 1981; Herkenham and Nauta,

1977, 1979). The habenula sends its outputs through the fasciculus retroflexus, a bundle

of fibers which travels anteriorly and ventrally before branching off into different regions

of the brain, including the VTA and SNc. The full mixture of neurotransmitters is

13

unknown, although at least some of the projection neurons are glutamatergic (Geisler et

al., 2007). The lateral habenula receives inputs from a number of brain regions, including

reciprocal connections from VTA dopamine neurons (Oades and Halliday, 1987), but

receives its most prominent anatomical input from the lateral hypothalamus and globus

pallidus (Parent et al., 1981; Sutherland, 1982; Hong and Hikosaka, 2008).

How does the lateral habenula influence midbrain dopamine neurons? In an early

study in anesthetized rats, Christoph and colleagues demonstrated that electrical

stimulation of the lateral habenula caused a powerful short-latency inhibition of more

than 85% of SNc and VTA dopamine neurons (Christoph et al., 1986). This inhibition is

likely to be disynaptic, such that lateral habenula neurons excite GABAergic interneurons

in the midbrain, which in turn inhibit dopamine neurons (Ji and Shepard, 2007). Single-

neuron recordings in behaving animals provide strong evidence that lateral habenula

neurons make use of this inhibitory influence over dopamine neurons in order to help

dopamine neurons compute their prediction-error signals (Matsumoto and Hikosaka,

2007). Lateral habenula and SNc dopamine neurons had mirror-image responses to

reward-related task events: whereas SNc dopamine neurons were excited by reward-

predicting cues and unexpected rewards, habenula neurons were inhibited; and whereas

dopamine neurons were inhibited by reward-omission cues and unexpected omissions,

habenula neurons were strongly excited (Figure 1.2). The response to reward-omission

cues occurred about 40 milliseconds earlier in habenula neurons than in dopamine

neurons, suggesting that habenula excitations could be the source of their negative reward

signals. Indeed, the dopamine neurons that were most inhibited by reward-omission cues

were also most strongly inhibited by lateral habenula electrical stimulation. Thus, lateral

14

habenula neurons appear to carry a negative prediction-error signal that is likely to be the

source of the positive prediction-error signals of dopamine neurons, at least the inhibitory

dopamine response to negative events.

Figure 1.2. Lateral habenula neurons carry negative reward signals. A monkey was presented with a visual target (cue). The monkey made a saccade to the target which was followed 200 ms later by a water reward or by no reward (outcome). The position of the target on the screen indicated whether the trial would be rewarded. On 4% of trials the position-reward mapping was reversed and so an unpredicted reward outcome was delivered (e.g. the cue indicated that a reward was coming but instead no reward was delivered). This plot shows the population average activity of reward-responsive lateral habenula and dopamine neurons from the dataset analyzed in Chapter 3. Arrows indicate cue or outcome onset. The firing rate was smoothed with a Gaussian kernel ( = 15 ms). Lateral habenula neurons were inhibited by reward cues (left) and unpredicted reward deliveries (right); they were excited by no-reward cues (left) and unpredicted reward omissions (right); and they had less response to reward outcomes that were cued and highly predictable (middle). Dopamine neurons carried a mirror-image signal.

15

1.4 Summary

Humans and animals prefer to view informative cues that indicate the value of an

upcoming reward in advance of its delivery. Midbrain dopamine neurons signal changes

in an animals expectation of food and water rewards. Their food- and water-predictive

activity may originate in lateral habenula neurons which carry a sign-reversed signal.

Dopamine release is thought to act as a reinforcer that teaches the brain which actions are

good and should be repeated and which are bad and should be avoided. If dopamine

neurons truly have such a fundamental role in reward learning, then they should not only

signal the value of basic rewards but should also signal the value of advance information

about those rewards. The information-predictive signals of dopamine neurons would then

be a major candidate for creating the informational preferences seen in behavior.

CHAPTER 2

Midbrain dopamine neurons signal preference for

advance information about upcoming rewards

SUMMARY

The desire to know what the future holds is a powerful motivator in everyday life,

but it is unknown how this desire is created by neurons in the brain. Here we show that

when macaque monkeys are offered a water reward of variable magnitude, they seek

advance information about its size. Furthermore, the same midbrain dopamine neurons

that signal the expected amount of water also signal the expectation of information, in a

manner that is correlated with the strength of the animal's preference. Our data shows that

single dopamine neurons process both primitive and cognitive rewards, and suggests that

current theories of reward-seeking must be revised to include information-seeking.3

3 The major portion of this chapter is from the paper Midbrain dopamine neurons signal preference for advance information about upcoming rewards by Ethan S. Bromberg-Martin and Okihide Hikosaka, currently accepted by Neuron.

16

17

2.1 INTRODUCTION

Dopamine-releasing neurons located in the substantia nigra pars compacta and

ventral tegmental area are thought to play a crucial role in reward learning (Wise, 2004).

Their activity bears a remarkable resemblance to prediction errors signaling changes in

a situation's expected value (Schultz et al., 1997; Montague et al., 2004). When a reward

or reward-predictive cue is more valuable than expected, dopamine neurons fire a burst of

spikes; if it has the same value as expected, they have little or no response; and if it is less

valuable than expected, they are briefly inhibited. Based on these findings, many theories

invoke dopamine neuron activity to explain human learning and decision-making

(Holroyd and Coles, 2002; Montague et al., 2004) and symptoms of neurological

disorders (Redish, 2004; Frank et al., 2004), inspired by the idea that these neurons could

encode the full range of rewarding experiences, from the primitive to the sublime.

However, their activity has almost exclusively been studied for basic forms of reward

such as food and water. It is unknown whether the same neurons that process these basic,

primitive rewards are involved in processing more abstract, cognitive rewards (Schultz,

2000).

We therefore chose to study a form of cognitive reward that is shared between

humans and animals. When people anticipate the possibility of a large future gain such

as an exciting new job, a generous raise, or having their research published in a

prestigious scientific journal they do not like to be held in suspense about their future

fate. They want to find out now. In other words, even when people cannot take any action

to influence the final outcome, they often prefer to receive advance information about

upcoming rewards. Related concepts have been arrived at independently in several fields

18

of study. Economists have studied temporal resolution of uncertainty (Kreps and

Porteus, 1978), and have shown that humans often prefer their uncertainty to be resolved

earlier rather than later (Chew and Ho, 1994; Ahlbrecht and Weber, 1996; Eliaz and

Schotter, 2007; Luhmann et al., 2008). Experimental psychologists have studied

observing behavior (Wyckoff, 1952), and have shown that a class of observing

behavior that produces reward-predictive cues can be a powerful motivator for rats,

pigeons, and humans (Wyckoff, 1952; Prokasy, 1956; Daly, 1992; Lieberman et al.,

1997).

To date, however, there has not been a rigorous test of this preference in non-human

primates, the animals in which the reward-predicting activity of dopamine neurons has

been best described (Schultz, 2000; Schultz et al., 1997; Montague et al., 2004).

Unfortunately, previous non-human primate studies required animals to work for rewards

by making a costly physical response, so that information about when rewards were

available allowed animals to save considerable physical effort (Kelleher, 1958; Steiner,

1967; Steiner, 1970; Lieberman, 1972; Woods and Winger, 2002). Other studies used an

unbalanced design in which animals pulled a lever to view informative cues but were not

offered a control lever for viewing uninformative cues, leaving it ambiguous whether

animals pulled the lever because the cues conveyed information or because of some other

stimulus property (Steiner, 1967; Schrier et al., 1980). To perform a rigorous test, it is

best to use a symmetrical choice procedure in which both informative and uninformative

options are available, in which they are both selected using the same physical response,

and in which both are matched for the timing and physical properties of both cue stimuli

and rewards (Daly, 1992; Roper and Zentall, 1999).

19

To this end, we developed a simple decision task allowing rhesus macaque

monkeys to choose whether to receive advance information about the size of an

upcoming water reward. We found that monkeys expressed a strong behavioral

preference, preferring information to its absence and preferring to receive the information

as soon as possible. Furthermore, midbrain dopamine neurons which signaled the

monkeys expectation of water rewards also signaled the expectation of advance

information, in a manner that was correlated with the animals preference. These results

show that the dopaminergic reward system processes both primitive and cognitive

rewards, and suggest that current theories of reward-seeking must be revised to include

information-seeking.

2.2 RESULTS

Behavioral preference for advance information

We trained two monkeys to perform a simple decision task (information choice

task, Figure 2.1A). On each trial two colored targets appeared on the left and right sides

of a screen, and the monkey had to choose between them by making a saccadic eye

movement. Then, after a delay of a few seconds, the monkey received either a big or a

small water reward. The monkeys choice had no effect on the reward size both reward

sizes were always equally probable. However, choosing one of the colored targets

produced an informative cue - a cue whose shape indicated the size of the upcoming

reward. Choosing the other color produced a random cue - a cue whose shape was

randomized and therefore had no meaning. The positions of the targets were randomized

on each trial. To familiarize monkeys with the two options, we interleaved choice trials

20

with forced-information trials and forced-random trials, in which only one of the targets

was available.

After only a few days of training, both monkeys expressed a strong preference to

view informative cues (Figure 2.1B). Monkey Z chose information about 80% of the

time, and monkey V's choice rate was even higher, close to 100%. Their preference for

advance information cannot be explained by a difference in the amount of water reward,

because information did not allow monkeys to obtain extra water from the reward-

delivery apparatus and had little effect on whether they completed a trial successfully (<

2% error rate for each target, Figure 2.2).

21

Figure 2.1. Behavioral preference for advance information (A) Information choice task. Fractions represent probabilities of different trial types. (B) Percent choice of information for each monkey. Each dot represents a single day of training. The mean number of choice trials per session was 152 for monkey V (range: 71-203) and 161 for monkey Z (range: 39-285). The gray region is the Clopper-Pearson 95% confidence interval for each day.

Figure 2.2. Control for water volume of the informative and random options (A) Each monkey was tested in six behavioral sessions, alternating between days when only random cues were available (random sessions) and days when only informative cues were available (information sessions). Each dot represents the amount of water delivered during a single session, expressed as a percentage of the theoretical amount that should have been delivered if monkeys were unable to influence the water volume (circles: Monkey V; squares: Monkey Z). Inset: the difference in water delivered between the two types of sessions was close to zero (unpaired t-test, P = 0.28, 95% CI = -3.2% to +1.0%). (B) Error rates are shown for each trial type, combining all behavioral sessions except the early learning stage (the first eight sessions for each monkey). Both monkeys made slightly more errors on random trials than information trials, compatible with the notion that monkeys were less motivated to complete random trials because they were the least desirable trial type in the task (a phenomenon seen in many other studies, e.g. (Shidara and Richmond, 2002; Lauwereyns et al., 2002a; Roesch and Olson, 2004; Kobayashi et al., 2006)). However, the overall error rate was very low (< 2%). A 2% difference in reward rate is far too small to explain the strong behavioral preferences we observed.

An important concern is that advance information might have allowed monkeys to

extract a greater amount of subjective value from the water reward by physically

preparing for its delivery for instance, by tensing their cheek muscles to swish water

22

around in their mouths in a more pleasurable fashion (Perkins, 1955). We therefore

introduced a second task that equalized the opportunity for simple physical preparation

(Mitchell et al., 1965) (information delay task, Figure 2.3A). Monkeys again chose

between informative and random cues, but afterward a second cue appeared that was

always informative on every trial. Thus, information was always available well in

advance of reward delivery; the choice was between receiving the information

immediately, or after a delay.

Soon after being exposed to the new task, both monkeys expressed a clear preference

for immediate information, comparable to their preference in original task (Figure 2.3B).

We then reversed the relationship between cue colors and information content, and

monkeys switched their choices to the newly informative color (Figure 2.3B). We

conclude that monkeys treated information about rewards as if it was a reward in itself,

preferring information to its absence and preferring to receive it as soon as possible.

23

Figure 2.3. Behavioral preference for immediate delivery of information (A) Information delay task. The fixation point and target configurations (not shown here) were the same as in the information choice task shown in Figure 2.1A. (B) Percent choice of immediate information. Conventions as in Figure 2.1B. The vertical line labeled reversal marks the time when the informative and random cue colors were switched. The mean number of choice trials per session was 151 for monkey V (range: 50-222) and 111 for monkey Z (range: 35-176). The behavioral preference started below 50% because the cue colors were re-used from a pilot experiment; the informative color had been previously trained as random, and vice versa (data not shown).

Dopamine neurons signal advance information

To understand the neural basis of the rewarding value of information, we recorded

the activity of 47 presumed midbrain dopamine neurons while monkeys performed the

information choice task shown in Figure 2.1. As in previous studies, we focused on

24

neurons that were presumed to be dopaminergic based on standard electrophysiological

criteria and that signaled the value of water rewards (henceforth referred to as dopamine

neurons) (Methods). Figure 2.4 shows an example neuron that carried a strong water

reward signal. On trials when the monkey viewed informative cues, the neuron was

phasically excited by the cue indicating a large reward and inhibited by the cue indicating

a small reward. In contrast, on trials when the monkey was forced to view uninformative

random cues, the neuron had little response to the cues but was strongly responsive to the

later reward outcome, excited when the reward was large and inhibited when it was

small. Thus, consistent with previous studies, this neuron signaled changes in the

monkey's expectation of water rewards.

The same neuron also responded to the targets indicating the availability of

information. On forced trials when only one target was available, the neuron was excited

by the informative-cue target and inhibited by the random-cue target. On choice trials

when both targets were available, the monkey always chose to receive information, and

the neuron responded much as it did when the informative-cue target was presented

alone. Thus, this dopamine neuron signaled changes in both the expectation of water and

the expectation of information.

25

Figure 2.4. Dopamine neurons signal information Top: firing rate of an example neuron. Trials are sorted separately for each task event, as follows. Target: forced-information (red), choice-information (pink), forced-random (blue). Cue: informative cues (red) indicating that the reward is big (solid) or small (dashed), random cues (blue) with the same shape as informative cues for big (solid, cross shape) or small (dashed, wave shape) rewards. Reward: informative (red) are the same trials as for the cue response, random (blue) trials where the reward was big (solid) or small (dashed). The firing rate was smoothed with a Gaussian kernel, = 20 ms. Bottom: rasters for individual trials. Each row is a trial, and each dot is a spike. Colors are the same as in the firing rate display, except that dark colors correspond to dashed lines.

This pattern of responses was quite common in dopamine neurons. We measured

each neurons discrimination between targets, cues, and rewards using the area under the

receiver operating characteristic (Figure 2.5B-D, Methods). This measure ranges from

0.5 at chance levels to 0.0 or 1.0 for perfect discrimination. As in the example, neurons

discriminated strongly between informative reward-predicting cues and between

randomly-sized rewards, but only weakly between uninformative random cues and

between fully predictable rewards (Figure 2.5C,D). The same neurons also discriminated

between the targets, with clear preferential activation by the target that predicted advance

information (Figure 2.5B). The discrimination was highly similar when measured using

26

either forced-information or choice-information trials in independent data sets (rho =

0.68, p < 10-4; Methods), indicating that the neural preference for information was

reproducible and consistent across different stimulus configurations. The same pattern

occurred in both monkeys (Figure 2.6) and could be seen in the population average firing

rate (Figure 2.5A).

Figure 2.5. Analysis of the dopamine neuron population (A) Population average firing rate. Conventions as in Figure 2.4. Gray bars indicate the time windows used for the ROC analysis. Colored bars indicate time points with a significant difference between selected pairs of task conditions (P < 0.01, Wilcoxon signed rank test), as follows. Target: force-info vs. force-rand (red), choice-info vs. force-rand (pink); Cue: info-big vs. info-small (red), rand-cross vs. rand-wave (blue); Reward: info-big vs. info-small (red), rand-big vs. rand-small (blue).

27

(B-D) Neural discrimination between task conditions in response to the targets (B), cues (C), and rewards (D). Each dots (x,y) coordinates represent a single neurons ROC area for discriminating between the pairs of task conditions listed on the x and y axes. A discrimination of 1 indicates perfect preference for the condition listed next to 1 (e.g. Choice info); discrimination of zero indicates perfect preference for the condition listed next to 0 (e.g. Force rand). Note that in (B) the x and y coordinates were both calculated using the same set of forced-random trials. Colored dots indicate neurons with significant discrimination between the conditions listed on the y-axis (red), x-axis (blue), or both axes (magenta) (P < 0.05, Wilcoxon rank-sum test).

There was also a tendency for neurons to have a weak initial excitation for each task

event (Figure 2.5A). This nonspecific response is probably due to the animals initial

uncertainty about the stimulus identity (Kakade and Dayan, 2002; Day et al., 2007) or

stimulus timing (Fiorillo et al., 2008; Kobayashi and Schultz, 2008). We did not observe

a predominant tendency for neurons to have anticipatory tonic increases in activity before

the delivery of probabilistic rewards, a phenomenon that has been reported in one study

(Fiorillo et al., 2003) but not others (Satoh et al., 2003; Morris et al., 2006; Bayer and

Glimcher, 2005; Matsumoto and Hikosaka, 2007; Joshua et al., 2008). This may be due

to differences in task design such as the size of the reward or the manner in which the

reward was signaled (Fiorillo et al., 2003).

An important question is whether dopamine neurons signal the presence of

information per se, or whether they truly signal how much it is preferred. In the latter

case, there should be a correlation between the neural preference for information,

expressed as the neural discrimination between the informative-cue target and the

random-cue target, and the behavioral preference for information, expressed as a choice

percentage. Such correlations were indeed present, both between-monkeys and within-

monkey. Between-monkeys, monkey V expressed a stronger behavioral preference for

information than monkey Z (Figure 2.1B), and also expressed a stronger neural

28

preference (P = 0.02, Figure 2.6A). Within-monkey, during the sessions in which

monkey Zs behavioral preference was strongest, the neural preference was enhanced

(rho = 0.44, P = 0.02, Figure 2.6D). On the other hand, behavioral preferences for

information were not significantly correlated with neural discrimination between water-

related cues or water rewards (all P > 0.25, Figure 2.6B,C,E,F). Thus, consistent with

evidence that dopamine neurons signal the subjective value of liquid rewards (Morris et

al., 2006; Roesch et al., 2007; Kobayashi and Schultz, 2008), they may also signal the

subjective value of information.

Figure 2.6. Correlation between neural discrimination and behavioral preference (A) Histogram of single-neuron target response discrimination between all informative trials (choice and forced trials combined) versus forced-random trials, separately for monkey V (left) and monkey Z (right). Arrows, numbers, and horizontal lines indicate the

29

mean discrimination, and the width of the arrows represent the 95% bootstrap confidence interval. Red indicates statistical significance. (B,C) Same as (A), for discrimination between informative big-reward and small-reward cues (B) or between random big and small rewards (C). (D) Plot of behavioral choice percentage against single-neuron discrimination between all informative trials versus forced-random trials in response to the target. The line was fitted by least-squares regression. Text shows Spearmans rank correlation (rho), and red indicates statistical significance. The data is from monkey Z only, because monkey V almost exclusively chose the informative target and therefore had no behavioral variability. (E-F) Same as (D), but for discrimination between informative big-reward and small-reward cues (E) or between random big and small rewards (F).

2.3 DISCUSSION

Here we have shown that macaque monkeys prefer to receive advance information

about future rewards, and that their behavioral preference is paralleled by the neural

preference of midbrain dopamine neurons. Thus, the same dopamine neurons that signal

primitive rewards like food and water also signal the cognitive reward of advance

information.

Monkeys expressed a strong preference for advance information even though it had

no effect on the final reward outcome. This is consistent with the intuitive belief that, all

things being equal, it is better to seek knowledge than to seek ignorance. It also provides

an explanation for the puzzling fact that the brain devotes a great deal of neural effort to

processing reward information even when this is not required to perform the task at hand.

For example, many studies use passive classical conditioning tasks in which informative

cues are followed by rewards with no requirement for the subject to take any action. In

these tasks the brain could simply ignore the cues and wait passively for rewards to

arrive. Yet even after extensive training, many neurons continue to use the cue

information to predict the size, probability, and timing of reward delivery (e.g. (Tobler et

al., 2003; Joshua et al., 2008)). In other tasks, neurons persist in predicting rewards even

30

when the act of prediction is harmful, causing maladaptive behavior that interferes with

reward consumption (e.g. refusing to perform trials with low predicted value (Shidara and

Richmond, 2002; Lauwereyns et al., 2002a)). These observations suggest that the act of

prediction has a special status, an intrinsic motivational or rewarding value of its own.

Our data provides strong evidence for this hypothesis. When given an explicit choice,

monkeys actively sought out the advance information that was necessary to make

accurate reward predictions at the earliest possible opportunity.

A limitation of our study is that it does not determine the precise psychological

mechanism by which value is assigned to information. There are several possibilities.

Theories from experimental psychology suggest that in our task the value of viewing

informative cues would simply be the sum of the conditioned reinforcement generated by

the individual big-reward and small-reward cues. In this view, the preference for

information implies that the conditioned reinforcement is weighted nonlinearly, so that

the benefit of strong reinforcement from the big-reward cue outweighs the drawback of

weak reinforcement from the small-reward cue (Wyckoff, 1959; Fantino, 1977;

Dinsmoor, 1983), akin to the nonlinear weighting of rewards that produces risk seeking

(von Neumann and Morgenstern, 1944). On the other hand, theories in economics

suggest that preference is not due to independent contributions of individual cues but

instead comes from considering the full probability distribution of future events. In this

view, information-seeking is due to an explicit preference for early resolution of

uncertainty (Kreps and Porteus, 1978) or an implicit preference induced by psychological

factors such as anticipatory emotions (Caplin and Leahy, 2001). In addition, just as the

value assigned to conventional forms of reward (e.g. food) depends on the internal state

31

of the subject (e.g. hunger), the value assigned to information is likely to depend on

psychological factors such as personality (Miller, 1987), emotions like hope and anxiety

(Chew and Ho, 1994; Wu, 1999) and attitudes toward uncertainty (Lovallo and

Kahneman, 2000; Platt and Huettel, 2008).

Implications of information-seeking for attitudes toward uncertainty

In the framework of decision-making under uncertainty, advance information

reduces the amount of reward uncertainty by narrowing down the set of potential reward

outcomes. Our data therefore suggests that in our task rhesus macaque monkeys preferred

to reduce their reward uncertainty at the earliest possible moment, as though the

experience of uncertainty was aversive.

Interestingly, several previous studies using similar saccadic decision tasks came to a

seemingly opposite conclusion: macaque monkeys appeared to prefer uncertainty,

choosing an uncertain, variable-size reward instead of a certain, fixed-size reward

(McCoy and Platt, 2005; Platt and Huettel, 2008). How can these results be reconciled?

One possibility is that they can be explained by a common principle; for instance,

perhaps monkeys treat the offer of a variable-size reward as a source of uncertainty to be

confronted and resolved. An important point, however, is that a preference for reward

variance can be caused by factors unrelated to uncertainty most notably, it can be

caused by an explicit preference over the probability distribution of reward outcomes, for

instance due to disproportionate salience of large rewards (Hayden and Platt, 2007) or a

nonlinear utility function (Platt and Huettel, 2008). In contrast, the choice of information

has no influence on the reward outcome; it only affects the amount of time spent in a

32

state of uncertainty before the reward outcome is revealed. In this sense the preference

for advance information is a relatively pure measurement of attitudes toward uncertainty.

Information signals in the dopaminergic reward system

Dopamine neuron activity is thought to teach the brain to seek basic goals like food

and water, reinforcing and punishing actions by adjusting synaptic connections between

neurons in cortical and subcortical brain structures (Wise, 2004; Montague et al., 2004).

Our data suggests that the same neural system also teaches the brain to seek advance

information, selectively reinforcing actions that lead to knowledge about rewards in the

future. Thus, the behavioral preference for information could be created by the

dopaminergic reward system. At the neural level, neurons which gain sensitivity to

rewards through a dopamine-mediated reinforcement process would come to represent

both rewards and advance information about those rewards in a common currency,

particularly neurons involved in reward timing, conditioned reinforcement, and decision-

making under risk (Kim et al., 2008; Seo and Lee, 2009; Platt and Huettel, 2008). In turn,

these signals could ultimately feed back to dopamine neurons to influence their value

signals.

An important goal for future research will therefore be to discover how dopamine

neurons measure information and assign its rewarding value. One possibility is that

dopamine neurons receive information-related input from specialized brain areas, distinct

from those that compute the value of traditional rewards like food and water. Indeed,

signals encoding the amount and timing of reward information, and dissociated from

preference coding of traditional rewards, have been found in several cortical areas

33

(Nakamura, 2006; Behrens et al., 2007; Luhmann et al., 2008). How these information

signals could be translated into a behavioral preference, and whether they are

communicated to dopamine neurons, is unknown.

Another possibility is that dopamine neurons receive information signals from the

same brain areas that contribute to their food- and water-related signals, such as the

lateral habenula (Matsumoto and Hikosaka, 2007). In this case, dopamine neurons would

receive a highly processed input, with different forms of rewards already converted into a

common currency by upstream brain areas. We are currently testing this possibility in

further experiments.

Why do dopamine neurons treat information as a reward?

The preference for advance information, despite its intuitive appeal, is not predicted

by current computational models of dopamine neuron function (Schultz et al., 1997;

Montague et al., 2004), which are widely viewed as highly efficient algorithms for

reinforcement learning. This raises an important question: could the information-

predictive activity of dopamine neurons be a harmful bug that impairs the efficiency of

reward learning? Or is it a useful feature that improves over existing computational

models? Here we present our hypothesis that the positive value of advance information is

a feature with a fundamental role in reinforcement learning.

Specifically, modern theories of reinforcement learning recognize that animals learn

from two types of reinforcement: primary reinforcement generated by rewards

themselves, and secondary reinforcement generated predictively, by observing sensory

cues in advance of reward delivery. Predictive reinforcement greatly enhances the speed

34

and reliability of learning, as demonstrated most strikingly by temporal-difference

learning algorithms (Sutton and Barto, 1998) which have produced influential accounts of

animal behavior (Sutton and Barto, 1981) and dopamine neuron activity (Schultz et al.,

1997; Montague et al., 2004). This implies that animals should treat predictive

reinforcement as an object of desire, making an active effort to seek out environments

where reward-predictive sensory cues are plentiful. If an animal was trapped in an

impoverished environment where reward-predictive cues were unavailable, the

consequences would be devastating: the animal's sophisticated predictive reinforcement

learning algorithms would be reduced to impotence. This can be seen clearly in our

dopamine neuron data (Figure 2.5A). When an action produces informative cues,

dopamine neurons signal its value immediately, a predictive reinforcement signal; but

when an action produces uninformative cues, dopamine neurons must wait to signal its

value until the reward outcome arrives, acting as little more than a primitive reward

detector. Thus, predictive reinforcement depends entirely on obtaining advance

information about upcoming rewards.

In light of these considerations, we propose that any learning system driven by the

engine of predictive reinforcement must actively seek out its fuel of advance

information. In this view, current models of neural reinforcement learning present a

curious paradox: their learning algorithms are vitally dependent on advance information,

but they treat information as valueless and make no effort to obtain it. These models do

include a form of knowledge-seeking by exploring unfamiliar actions, but they make no

effort to obtain informative cues that would maximize learning from these new

experiences. In fact, models using the popular TD() algorithm (Sutton and Barto, 1998)

35

are actually averse to advance information (Note 2.A). Our data shows that a new class

of models is necessary that assign information a positive value perhaps representing the

future reward the animal expects to receive, as a result of obtaining better fuel for its

learning algorithm. This would be akin to the concept of intrinsically motivated

reinforcement learning (Barto et al., 2004), in that dopamine neurons would assign an

intrinsic value to information because it could help the animal learn to better predict and

control its environment (Barto et al., 2004; Redgrave and Gurney, 2006). Also, although

dopamine neurons have been best studied in the realm of rewards, they can also respond

to salient non-rewarding stimuli (Horvitz, 2000; Redgrave and Gurney, 2006; Joshua et

al., 2008; Matsumoto and Hikosaka, 2009). This suggests that dopamine neurons might

be able to signal the value of information about neutral and punishing events (Herry et

al., 2007; Badia et al., 1979; Fanselow, 1979; Tsuda et al., 1989) events, as part of a more

general role in motivating animals to learn about the world around them.

2.4 METHODS

Subjects. Subjects were two male rhesus macaque monkeys (Macaca mulatta), monkey

V (9.3 kg) and monkey Z (8.7 kg). All procedures for animal care and experimentation

were approved by the Institute Animal Care and Use Committee and complied with the

Public Health Service Policy on the humane care and use of laboratory animals. A plastic

head holder, scleral search coils, and plastic recording chambers were implanted under

general anesthesia and sterile surgical conditions.

36

Behavioral Tasks. Behavioral tasks were under the control of the REX program (Hays et

al., 1982) adapted for the QNX operating system. Monkeys sat in a primate chair, facing

a frontoparallel screen 31 cm from the monkeys eyes in a sound-attenuated and

electrically shielded room. Eye movements were monitored using a scleral search coil

system with 1 ms resolution. Stimuli generated by an active matrix liquid crystal display

projector (PJ550, ViewSonic) were rear-projected on the screen.

In the information choice task (Figure 2.1), each trial began with the appearance of a

central spot of light (1 diameter), which the monkey was required to fixate. After 800

ms, the spot disappeared and two colored targets appeared on the left and right sides of

the screen (2.5 diameter, 10-15 eccentricity). (On forced-information and forced-

random trials, only a single target appeared). The monkey had 710 ms to saccade to and

fixate the chosen target, after which the non-chosen target immediately disappeared. At

the end of the 710 ms response window a cue (14 diameter) was presented of the chosen

color. For the informative color, the cue was a cross on large-reward trials or a wave

pattern on small-reward trials. For the random color, the cues shape was chosen

pseudorandomly on each trial (see below). The colors were green and orange, chosen to

have similar luminance and counterbalanced across monkeys. Monkeys were not required

to fixate the cue. After 2250 ms of display time, the cue disappeared and simultaneously

a 200 ms tone sounded and reward delivery began. The inter-trial interval was 3850-4850

ms beginning from the disappearance of the cue. Water rewards were delivered using a

gravity-based system (Crist Instruments). Reward delivery lasted 50 ms on small-reward

trials (0.04 ml) and 700 ms (0.88 ml, monkey V) or 825 ms (1.05 ml, monkey Z) on

37

large-reward trials. To minimize the effects of physical preparation, licking the water

spout was not required to obtain rewards; water was delivered directly into the mouth.

The task proceeded in blocks of 24 trials, each block containing a randomized

sequence of all 3x2x2x2 combinations of choice type (forced-information, forced-

random, choice), reward size (large, small), random cue shape (cross, waves), and

informative target location (left, right). Thus, the random cues were actually quasi-

random and could theoretically yield a small amount of information about reward size,

but extracting that information would require a very difficult feat of working memory.

If monkeys made an error (broke fixation on the central spot, or failed to choose a

target, or broke fixation on the chosen target before the cue appeared), then the trial

terminated, an error tone sounded, an additional 3 seconds were added to the inter-trial

interval, and the trial was repeated (correction trial). If the error occurred after the

choice, only the chosen target was available on the correction trial.

The information delay task (Figure 2.3) was identical to the information choice

task except the cue colors and shapes were different, and a third set of always-

informative gray cues lasting for 1500 ms were appended to the cue period. (there were

also minor differences in the task parameters for monkey Z: the duration of the first cue

was 2000 ms, and the big reward volume was ~1.29 ml). The 1500 ms duration of the

always-informative cue was chosen to allow near-optimal physical preparation for

rewards. With a shorter cue duration (e.g. < 750 ms), there might not be enough time to

discriminate the cue and make a physical response (e.g. compare to the latency of

anticipatory licking in (Tobler et al., 2003)). With a longer cue duration (e.g. > 2

seconds), physical preparation for reward delivery begins to be impaired by timing errors

38

(e.g. compare to the timecourse of anticipatory licking in (Fiorillo et al., 2008; Kobayashi

and Schultz, 2008)). To perform a reversal (vertical lines in Figure 2.3B), the colors of

the informative and random cues were switched.

Neural Recording. Midbrain dopamine neurons were recorded using techniques

described previously (Matsumoto and Hikosaka, 2007). A recording chamber was placed

over fronto-parietal cortex, tilted laterally by 35 degrees, and aimed at the substantia

nigra. The recording sites were determined using a grid system, which allowed recordings

at 1 mm spacing. Single-neuron recording was performed using tungsten electrodes

(Frederick Haer) that were inserted through a stainless steel guide tube and advanced by

an oil-driven micro-manipulator (MO-97A, Narishige). Single neurons were isolated on-

line using custom voltage-time window discrimination software (the MEX program

(Hays et al., 1982) adapted for the QNX operating system).

Neurons were recorded in and around the substantia nigra pars compacta and ventral

tegmental area. We targeted this region based on anatomical atlases and magnetic

resonance imaging (4.7T, Bruker). During recording sessions, we identified this region

based on recording depth and using landmarks including the somatosensory and motor

thalamus, subthalamic nucleus, substantia nigra pars reticulata, red nucleus, and

oculomotor nerve. Presumed dopamine neurons were identified by their irregular tonic

firing at 0.5-10 Hz and broad spike waveforms. We focused our recordings on presumed

dopamine neurons that responded to the task and appeared to carry positive reward

signals. Occasional dopamine-like neurons that upon examination showed no differential

response to the cues and no differential response to the reward outcomes were not

39

recorded further. We then analyzed all neurons that were recorded for at least 60 trials

and that had positive reward discrimination for both informative cues and random

outcomes, or positive reward discrimination for cues and no discrimination for outcomes,

or positive reward discrimination for outcomes and no discrimination for cues (P < 0.05,

Wilcoxon rank-sum test). We were able to examine the response properties of 108

neurons, 84 of which met our criteria for presumed dopaminergic firing rate, pattern, and

spike waveform, and 47 of which also met our criteria for trial count and significant

reward signals. This yielded 20 neurons from monkey V (right hemisphere) and 27

neurons from monkey Z (left hemisphere) for our analysis.

Data Analysis. All statistical tests were two-tailed. The neural analysis excluded error

trials and correction trials. We analyzed neural activity in time windows 150-500 ms after

target onset (targets), 150-300 ms after cue onset (cues), and 200-450 ms after cue offset

(rewards). These were chosen to include the major components of the average neural

response. The neural discrimination between a pair of task conditions was defined as the

area under the receiver operating characteristic (ROC), which can be interpreted as the

probability that a randomly chosen single-trial firing rate from the first condition was

greater than a randomly chosen single-trial firing rate from the second condition (Green

and Swets, 1966). We observed the same results using other measures of neural

discrimination such as the signal-to-baseline ratio and signal-to-noise ratio. Confidence

intervals and significance of the population averages of single-neuron ROC areas (Figure

2.6A-C) were computed using a bootstrap test with 200,000 resamples (Efron and

Tibshirani, 1993). Consistent with previous studies of reward coding (Schultz and Romo,

40

1990; Kawagoe et al., 2004; Roesch et al., 2007; Matsumoto and Hikosaka, 2007) we

observed similar neural coding of behavioral preferences for both of the target locations

on the screen (average ROC area, forced information vs. forced random: ipsilateral =

0.60, P < 10-4, contralateral = 0.62, P < 10-4; choice information vs. forced random:

ipsilateral 0.58, P < 10-4, contralateral 0.62, P < 10-4), so for all analyses the data were

combined. We could not analyze activity on choice-random trials due to their rarity (< 3

trials for most neurons). All correlations were computed using Spearman's rho (rank

correlation). To compare neural discrimination measured using either forced-information

or choice-information trials in independent data sets, we calculated the correlation

between two values, the discrimination between forced-information trials vs. even-

numbered forced-random trials, and the discrimination between choice-information trials

vs. odd-numbered forced-random trials (rho = 0.68, P < 10-4). Significance of

correlations, and of the difference in mean ROC area between the two monkeys (Figure

2.6), were computed using permutation tests (200,000 permutations) (Efron and

Tibshirani, 1993).

2.5 ACKNOWLEDGEMENTS

We thank M. Matsumoto, G. La Camera, B. J. Richmond, D. Sheinberg, and V. Stuphorn

for valuable discussions, and G. Tansey, D. Parker, M. Lawson, B. Nagy, J.W.

McClurkin, A.M. Nichols, T.W. Ruffner, L.P. Jensen, and M.K. Smith for technical

assistance. This work was supported by the intramural research program at the National

Eye Institute.

41

NOTE 2.A

Aversion to information by reinforcement learning algorithms based on TD()

It may be surprising that models based on temporal-difference learning (TD

learning), which is formally indifferent to information, could show any preference in our

task. However, TD learning is only a method for reward prediction; it does not specify

how to use that knowledge to take action. When TD learning is coupled to a mechanism

for action-selection (as is necessary in models of animal behavior), new behavior can

emerge. Here we will describe the formalism of TD learning (Sutton and Barto, 1998),

explain a simple case in which it leads to biases in selection between risky and safe

actions, and extend this reasoning to our case of selection between informative and

uninformative actions.

TD learning formalism definition of reward value

The goal of TD learning is to learn a value function that takes as input the current

state of the environment s and predicts the amount of expected reward r that will be

received in the future. We assume that at time t a learning agent observes the state of the

environment st, takes an action at, and then receives feedback in the form of a reward rt+1

and a new state of the environment st+1. The agent-environment interaction can be

described as a Markov decision process, which is fully specified by three functions

(,R,T): the agents policy for choosing actions (s,a) = Pr(take action a | in state s), and

the environments mapping from (state,action) pairs to rewards and new states, the

reward function R(s,a,r) = Pr(receive reward r | in state s and took action a) and the state

transition function T(s,a,s) = Pr(new state is s | in state s and took action a). Given these

42

specifications, we can derive the full probability distribution over the agents future

experiences starting from the current state st and action at and leading to any future state

st+ and reward rt+, e.g. written as p,R,T(st+, rt+ | st, at). We can then define the value

function V as the expectation of the sum of future rewards, starting from the state st at

the current moment in time t and considering all future states st+:

=

++ ===1

,

role of dopamin in searching information

Documents