reinforcement learning models of neuropeptide-modulated ... · neuropeptide-modulated human brain...

Reinforcement Learning models of

neuropeptide-modulated human brain function

Luıs Eduardo Moutinho Guerra

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisors: Prof. Manuel Fernando Cabido Peres LopesCo-supervisor: Dr. Diana Prata

Examination Committee

Chairperson: Prof. Luıs Manuel Antunes VeigaSupervisor: Prof. Manuel Fernando Cabido Peres Lopes

Member of the Committee: Prof. Pedro Tiago Goncalves Monteiro

October 2018

Acknowledgments

Deciding to take my Master’s in this field was only due to my Bachelor’s teachers’ love for their craft.

Aspiring to branch out of my field due to my girlfriend’s advice ended up being very fulfilling.

Verifying my sanity was a task relegated to my mother and sister.

I couldn’t have done it without the support from all my friends at IST who soldiered alongside me.

Dampening my mood was only stopped by my hometown friends who bravely endured the dnd drought.

On-line friends made their support count and for that I must reward them. Pois.

Partaking in this journey really tested my limits and I couldn’t have made it without you.

E muito.

Abstract

The way alterations in the chemistry of the human brain affect social interactions is still not fully un-

derstood and deepening our knowledge in this field could allow us to create novel medical therapeutics

for a variety of diseases. Various Reinforcement Learning algorithms have been used to model learning

processes in both animals and humans. This thesis focuses on the study of the relation between the

activation of the Reward Centers of the human brain and specific parameters of a Reinforcement Learn-

ing algorithm. This algorithm is known as Q-learning and is used as a model for the learning process

of an individual playing an iterated Prisoner’s Dilemma styled social game for monetary rewards. This

relation is tested and compared between subject groups that are administered, by means of intranasal

spray, either placebos, Oxytocin or Vasopressin. Subjects are adults of both genders with ages in the 20

to 40 years range and are grouped by gender during experiments.

Keywords

Q-learning; Prisoner’s Dilemma; fMRI; Reinforcement Learning;

iii

Resumo

A maneira como as alteracoes na quımica do cerebro afetam as nossas interacoes sociais nao e ainda

completamente compreendida e aprofundar os nossos conhecimentos nesta area pode permitir-nos criar

novas terapeuticas para varias doencas. Varios algoritmos de Aprendizagem por Reforco foram utilizados

para modelar processos de aprendizagem, tanto em animais como em humanos. Esta tese foca-se no estudo

da relacao entre a ativacao dos Centros de Recompensa do cerebro humano e parametros especıficos de

um algoritmo de Aprendizagem por Reforco, conhecido por Q-learning, utilizado como modelo para o

processo de aprendizagem de um indivıduo enquanto este joga um jogo social iterativo semelhante ao

famoso Dilema do Prisioneiro. Esta relacao e testada e comparada entre grupos de participantes aos quais

foram administrados, por meio de spray intranasal, doses de um Placebo, Oxitocina, ou Vasopressina.

Os participantes sao adultos de ambos os sexos com idades compreendidas entre os vinte e quarenta anos

que foram agrupados por sexo durante as experiencias.

Palavras Chave

Q-learning; Dilema do Prisioneiro; fMRI; Aprendizagem por Reforco;

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 4

2.1 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Exploration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.1 ε–greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4.2 Boltzmann policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Two Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 N-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.8 Interior Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.9 FSL, FEAT and FeatQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.9.1 FEAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.9.2 FeatQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.10 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.11 Spearman Rank Order Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Related Work 20

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Base Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Effects of intranasal Oxytocin and Vasopressin on cooperative behavior and asso-

ciated brain activity in men . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Sex differences in the neural and behavioral response to intranasal oxytocin and

vasopressin during human social interaction . . . . . . . . . . . . . . . . . . . . . . 25

3.3 The validity of modeling brain processes with Reinforcement Learning (RL) . . . . . . . . 25

vii

3.3.1 Reinforcement learning in the brain . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Actions, Policies, Values, and the Basal Ganglia . . . . . . . . . . . . . . . . . . . 27

3.3.3 Model-based fMRI and its application to Reward Learning and Decision making . 28

3.4 The importance of the striatum when dealing with reward prediction error . . . . . . . . . 28

3.4.1 Temporal prediction errors in a passive learning task activate human striatum . . 28

4 Methods 30

4.1 Data Processing (Extraction, Transformation, Loading (ETL)) . . . . . . . . . . . . . . . 31

4.1.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.3 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Q-learning parameters estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Reward Prediction Error (RPE) estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 Blood-oxygen-level dependent (BOLD)/RPE correlation and respective Analysis of vari-

ance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Defining a Q-learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Boundaries for the η parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.8 Empirical model testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.8.1 Artificial test subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.8.2 Generation of Action and Reward sequences . . . . . . . . . . . . . . . . . . . . . . 40

4.8.3 Estimation of subjects’ parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.8.4 Percentage Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.8.5 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.9 Testing the Q-learning implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.10 Chi-square test to test confounding effects of Round Order . . . . . . . . . . . . . . . . . 44

4.11 ANOVA design to test effects of Sex, Drug and Opponent . . . . . . . . . . . . . . . . . . 46

5 Results 47

5.1 Effects on the Pearson’s Correlation between RPE and the BOLD response . . . . . . . . 48

5.2 Effects on the Pearson’s Correlation between Reward and the BOLD response . . . . . . . 50

5.3 Effects on the Spearman Correlation between RPE and the BOLD response . . . . . . . . 51

5.4 Effects on the Spearman Correlation between Reward and the BOLD response . . . . . . 52

5.5 Effects on the Pearson’s Correlation between Positive RPE’s and the BOLD response . . 54

5.6 Effects on the Spearman Correlation between Positive RPE’s and the BOLD response . . 56

5.7 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

viii

6 Discussions 58

6.1 Main effect of Subject Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Main effect of Opponent Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Effect of the (Opponent Type x Subject Sex) interaction . . . . . . . . . . . . . . . . . . . 59

6.4 Main effect of Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

ix

List of Figures

2.1 Prisoner’s Dilemma (PD) punishment distribution matrix . . . . . . . . . . . . . . . . . . 5

2.2 Mask that identifies a ROI around the left amygdala, one of the constituents of the reward

system of the subject’s brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 FEAT’s Miscellaneous menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 FEAT’s Data menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 FEAT’s Stats menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6 FEAT’s Post-stats menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.7 FEAT’s Full model setup menu, events tab . . . . . . . . . . . . . . . . . . . . . . . 15

2.8 FEAT’s Full model setup menu, contrasts tab . . . . . . . . . . . . . . . . . . . . . 16

2.9 The FeatQuery menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.10 Two associated variables (A and B) and their ranked versions . . . . . . . . . . . 18

2.11 Plot of variables A and B (unranked) and their respective trendline compared

to a plot of their ranked counterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 Overview table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Time-line for one round of PD regardless of the nature of the opponent . . . . 22

3.3 PD payoff matrix for the game performed in Rilling’s study [1] . . . . . . . . . . 23

3.4 A monkey’s neuro-conditioning to a sound (CS) followed by being fed juice

(US) in instants a) b) and c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Activation of the striatum when the subject receives an unexpected reward. . 29

4.1 ETL description scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Percentage of the search space with P (a) > 0.95 for each η . . . . . . . . . . . . . 36

4.3 Plot of the P(cooperation) over all possible Q-values with η=1.5 . . . . . . . . . . . . 37

4.4 Plot of the P(cooperation) over all possible Q-values with η=3 . . . . . . . . . . . . . 38

4.5 Plot of the P(cooperation) over all possible Q-values with η=20 . . . . . . . . . . . . 38

4.6 Experimental-real RPE correlation for various models . . . . . . . . . . . . . . . . 41

xi

4.7 Graph showing the evolution of a subject’s Q-values for cooperation. . . . . . . 42

4.8 Graph showing the evolution of a subject’s Q-values for defection. . . . . . . . . 43

4.9 Graph showing the evolution of a hypothetical subject’s Q-values for cooperation. 44

4.10 Table showing the distribution of subjects across dependent variables. . . . . . 45

4.11 Figure showing the results of the Chi-squared test. . . . . . . . . . . . . . . . . . . 46

5.1 Table detailing interactions between the within subject factor and the between

subject factors for RPE/BOLD Pearson’s Correlation. . . . . . . . . . . . . . . . 48

5.2 Table detailing interactions between the between subject factors for RPE/BOLD

Pearson’s Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3 Main effect of the independent variable Sex on the RPE/BOLD Pearson’s

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


subject factors for Reward/BOLD Pearson’s Correlation. . . . . . . . . . . . . . . 50

5.5 Table detailing interactions between the between subject factors for Reward/BOLD



subject factors for RPE/BOLD Spearman Correlation. . . . . . . . . . . . . . . . 51

5.7 Table detailing interactions between the between subject factors for RPE/BOLD

Spearman Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.8 Main effect of the independent variable Sex on RPE/BOLD Spearman Corre-

lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


subject factors for Reward/BOLD Spearman Correlation. . . . . . . . . . . . . . 52

5.10 Table detailing interactions between the between subject factors for Reward/BOLD

Spearman Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.11 Effect of the (Sex x Opponent Type) interaction on Reward/BOLD Spearman

Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


subject factors for Positive RPE/BOLD Pearson’s Correlation. . . . . . . . . . . 54

5.13 Table detailing interactions between the between subject factors for Positive

RPE/BOLD Pearson’s Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.14 Main Effect of the independent variable Opponent Type on Positive RPE/BOLD



subject factors for Positive RPE/BOLD Spearman Correlation. . . . . . . . . . 56

xii

5.16 Table detailing interactions between the between subject factors for Positive

RPE/BOLD Spearman Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.17 Results overview table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

xiii

List of Tables

2.1 Pearson’s Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Empirical test results for experiments with 30 rounds . . . . . . . . . . . . . . . . . . . . . 40

4.2 Empirical test results for 100 rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

xv

Acronyms

RL Reinforcement Learning

IMM Instituto de Medicina Molecular

PD Prisoner’s Dilemma

ROI Region of Interest

PCC Pearson’s Correlation Coefficient

RPE Reward Prediction Error

OT Oxytocin

AVP Vasopressin

fMRI Functional Magnetic Resonance Imaging

BOLD Blood-oxygen-level dependent

ETL Extraction, Transformation, Loading

ANOVA Analysis of variance

xvii

1Introduction

Contents

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1

1.1 Motivation

Our goal is to model the learning and decision processes that occur in the human brain while playing

a Prisoner’s Dilemma style game using a Reinforcement Learning (RL) algorithm known as Q-learning,

then use these models to correlate the intensity of the Blood-oxygen-level dependent (BOLD) response

registered in the reward centers of the individuals’ brain with various measures taken from these RL

models.

1.2 Introduction

As a social species, we constantly interact among ourselves to achieve our goals. Such interactions

allow us to coordinate efforts to achieve goals we wouldn’t be able to achieve alone. This puts us, however,

under the constant burden of resisting the temptation of engaging in anti-social behavior that could bring

us great momentary gains at the expense of those of our peers that would, from that moment on, be

less inclined to positively interact with us. Pro-social behavior brings trust and long-term stability while

anti-social behavior brings dis-trust and creates problems that are often only noticeable in the distant

future. It is then vital for the long-term success of any society, for cooperation to be seen as the default

behavior. To that end we collectively praise individuals who interact in cooperative ways (charity donors,

social volunteers, etc) while looking down on the ones who seek to further their ambitions at the expense

of others (criminals in general).

In a society with such a social setup, it is very important for each individual to be capable of judging

other’s actions and acting accordingly. Failing to do so might lead that individual to overestimate negative

actions towards him, and react in an overaggressive manner, or to underestimate positive actions, making

him look unappreciative. While these errors in judgment affect everyone, despite their mental health state,

patients that suffer from certain conditions can be unable to understand the intentions of the people they

interact with to a point that it negatively impacts their lives. It is known that the amount of stimulation

certain areas of the brain receive is very influential in the way an individual responds in situations where

that part of the brain is used. It is also known that certain drugs can be effective at influencing the way

certain regions of the brain interact with certain chemicals and this is the basis for most work in the

field of neurology. As such, a lot of work has been put into determining which substances could affect

the parts of the brain that are responsible for our social behavior. Studies by James Rilling [1] [2] found

links between increased activity in various regions of the brain that are part of our reward systems and

the administration of various drugs making the test subjects act in more or less cooperative ways. This

seems to align with the idea that our actions, while interacting socially, are at least partially dictated

by the rewards we perceive we are getting from them which means that pathologies that derive from

a diminished ability to correctly perceive rewards received or to realistically estimate potential rewards

2

could be corrected by applying these substances in a specific way.

As shown by Yael Niv [3] RL models are good approximations of the decision making processes that

occur in our brain as they emulate the reward based and model free nature of the discovery we make of

our environment in certain situations. This means that any drug induced changes to the decision process

of an individual should manifest in the RL models estimated by observing the evolution of their actions

in different situations.

This work was conducted in parallel to another study being performed by a research team at Instituto

de Medicina Molecular (IMM) that is trying to relate the administration of certain substances with human

individuals’ performance in a social game. The team is working on a paper, currently in writing [4], that

encompasses ideas of relating RL models of the human behavior during a Prisoner’s Dilemma styled game

with the type of strategies individuals tend to follow when playing.

We analyzed data resulting from prior brain examinations and tried to determine if certain regions of

the brain, constituents of the reward center, play a relevant part on the way we perceive and act on the

current state of a social interaction.

We analyzed data from both Rilling’s studies [1] [2]. This data relates to an experiment conducted

by the author’s team that involved examining human test subjects with Functional Magnetic Resonance

Imaging (FMRI) while they played repeated rounds of a Prisoner’s Dilemma style game against computers

and computers that were perceived as humans. This involved:

• Extracting all the data produced in both Rilling’s studies [1] [2].

• Transforming the data so that it becomes evident the association between each action sequence and

each reward system activation sequence for each subject.

• Processing the FMRI data so that individual reward center stimulus can be matched with specific

events or actions along the experiment time frame.

• Resorting to parameter fitting in order to fit the parameters of a Q-learning model to the action

sequences per-formed by the test subjects.

• Studying the correlation between the amount of stimulation found in the reward centers of the brain

and measures resulting from the Q-learning models.

3

2Background

Contents

2.1 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Exploration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Two Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 N-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.8 Interior Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.9 FSL, FEAT and FeatQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.10 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.11 Spearman Rank Order Correlation . . . . . . . . . . . . . . . . . . . . . . . . 18

4

2.1 Prisoner’s Dilemma

The PD is a thought experiment that puts two prisoners in separate interrogation rooms. It is then

explained to the prisoners that if they incriminate their partner in a crime they both are accused of

committing together they get to walk free as long as their partner does not incriminate them back. If

this happens their partner serves three years in jail. However, if the prisoner chooses to remain silent and

his partner incriminates him he will have to serve the three years while the partner walks free. If both

incriminate one another they serve two years each but if both remain silent they will serve only one year

each.

Figure 2.1: PD punishment distribution matrix

The punishment distribution (Figure 2.1) across all possible outcomes is what makes this problem

very interesting as it creates a Nash Equilibrium (a state where two participants in a game perform a

strategy from which neither is incentivised to deviate from) around the outcome of mutual defection

since no matter what the partner decides to do the prisoner that contemplates the choice always stands

to gain an immediate advantage by choosing to defect. This suggests that if two rational players were

to play each other in a PD-style game they would always defect one another. However, when observing

the real world, we observe many instances of rational individuals identifying cooperation as the most

advantageous move since they find they can cultivate trust in others. The PD and other dilemmas like

it inspired new fields of study that seek to explain and model real world behaviors as well as trying to

find strategies to force people to take more socially responsible choices.

5

2.2 Reinforcement Learning

Reinforcement learning is a machine learning technique that is used to build agents that progressively

learn as they interact with their environment becoming progressively better at performing their task.

The learning process is iterative. In each iteration the agent chooses an action based on the information

gathered from previous interactions with the section of the environment it is currently in. This means

that the agent will learn the intricacies of the environment in a ”piece-by-piece” fashion. This means that

these environments or state-spaces are suitable to be modeled as Markov Decision Processes (MDP) since

most RL algorithms expect discrete state spaces and interactions between the agent and the environment

that follow the Markov Property.

Upon choosing an action for the current state the agent will observe a ”feedback” from the environ-

ment regarding the quality of the action taken. This feedback will either align or not with the agent’s

expectations. If the agent was expecting a different outcome it will proceed to change its expectations so

that in future interactions, it is capable of better choosing the most advantageous action.

After a sufficient number of interactions, the agent will ideally have traveled to all the possible states

in the state space a number of times high enough for it to try all available actions a sufficient number of

times. This should provide it with the ability of deciding which action to take no matter the state the

agent finds itself in.

2.3 Q-Learning

Q-learning is an off-policy RL algorithm that learns the optimal policy to navigate any finite state-

space (usually stored in the form of a matrix) provided it has the opportunity to reach each state a

sufficient number of times. The update of the Q-values is given by equation 2.1 where it details that the

Q value regarding the state x and action a should be determined by the sum of its current value with a

learning rate α that multiplies with the sum of the recently acquired reward with the estimation of the

reward that would be acquired by following a greedy policy (a value that is multiplied by the discount

factor or inflation rate γ) and the negative of the Q’s current value.

Q(x,a) = Q(x,a) + α(r(x,a) + γmaxbQ(y,b) −Q(x,a)), α, γ ∈]0, 1] (2.1)

σ = r(x,a) + γmaxbQ(y,b) −Q(x,a) (2.2)

The expression in equation 2.2, is often referred as RPE (Reward Prediction Error) as it represents

the error between the reward received and the one that the agent expected to receive. In summary

Q-learning works by iteratively adding the RPE observed for each state-action pair to the previously

estimated Q-value weighting the RPE and the maximum future action reward by an α, γ ∈]0, 1] pair.

6

2.4 Exploration Methods

To guarantee that the Q-learning algorithm visits a diverse number of states the policy that it follows

must not be greedy. For that, many methods, some simpler than others, exist to guarantee such diversity.

2.4.1 ε–greedy

In the ε–greedy policy, the agent executes the action that would be indicated by a greedy policy with

probability 1− ε. With probability ε the agent will perform an exploration step that will involve selecting

one of the non-optimal actions at random with a uniform probability distribution between them. This

ensures that the algorithm finds a balance between exploring the world for new knowledge and exploiting

the already accumulated knowledge to generate value.

2.4.2 Boltzmann policy

The Boltzmann policy functions similarly to the ε–greedy in that it deviates from choosing the optimal

state with a certain probability. Its sophistication lies in the fact that the probability of choosing any

specific action is proportional to the value the RL algorithm currently assigns it (in this case a Q-value).

P(a|x) =eηQ(x,a)∑b∈A e

ηQ(x,b), η ∈ [0,+∞[ (2.3)

As can be seen in 2.3 the probability of an agent that follows the Boltzmann policy choosing action

a in state x is given by the fraction of a numerator eηQ(x,a) that is as big as the Q-value of (a | x) by a

denominator∑b∈A e

ηQ(x,b) that sums eηQ(x,b) for all actions available in x. In this way, the probability

of following a certain action increases as its Q-value increases making the choice between exploration and

exploitation change dynamically as the exploration progresses. The value of η allows control over how

greedily the policy behaves, the higher the value of η the greedier the policy is.

7

2.5 Two Sample T-Test

In most branches of scientific work, it will be required to test the validity of the results of all experi-

ments. For this purpose, statistical testing tools are used. One of the most well know is the Two Sample

T-test [5]. The Two Sample T-test determines if the means of two populations from which were extracted

two samples (one from each population) are representative of the population they originated from or if

their characteristics are caused by statistical noise. This process occurs as in other hypothesis testing

techniques. A null hypothesis H0 must be defined in a way that rejecting or not the null hypothesis

provides us with relevant information about the statistical validity of the results.

In cases where a comparison between populations must be made (i.e. population that took a placebo

against a population that took some sort of experimental medication) a Two Sample T-test serves us well

by indicating if both populations are significantly different from each other (meaning that the medication

had any effect). For this, H0 must be set as stating that both populations have equal means 2.4.

H0 : µ1 = µ2 (2.4)

The Two Sample T-test defines a test statistic that is used to determine the whether H0 should or

should not be rejected 2.5 where x and y are the mean values of the samples taken from population 1

and 2 respectively, sx and sy are the standard deviations from populations each population and n and m

correspond to the number of samples each population possesses.

T =x− y√s2xn +

s2ym

(2.5)

After determining the value of T one can determine whether H0 should be rejected or not by following

2.6 where t(1− α2 , v) corresponds to the critical value of a t-distribution [6] with confidence level α (usually

equal to 0.05) and a v number of degrees of freedom where v is given by 2.7.

|T | > t1−α2 ,v (2.6)

v =

(s2xn

)2

+

(s2ym

)2

(s2xn

)2

(n−1) +

(s2yn

)2

(m−1

)(2.7)

If 2.6 turns out the be true, then H0 will be rejected meaning that the two distributions are likely to

have different means making the results statistically significant. If 2.6 turns out to be false, then H0 will

not be rejected meaning the results are not statistically significant.

8

2.6 One-way ANOVA

When comparing more than two independent samples from three or more populations, One-way

ANOVA [7], or One-way Analysis Of Variance, provides the capability of performing statistical tests on

the validity of experimental results. As with Two Sample T-test a null hypothesis H0 must be defined

so that it states that all the means for all populations are equal 2.8.

H0 : µ1 = µ2 = ... = µk (2.8)

The One-way ANOVA defines its test statistic 2.9 in function of the values of MSC and MSE.

F =MSC

MSE(2.9)

In order for MSE, the Mean of the Squared Errors between samples of the same sample group, to be

computed, as in 2.10 where SSE corresponds to the Sum of Square Errors between all samples of its own

sample group, N corresponds to the total amount of samples in all sample groups and k corresponds to

the number of sample groups, the value of SSE, the Sum of Square Errors between the samples of each

sample group, must be determined. Similarly to MSE, MSC, the Mean of the Squared Errors between

the means of all the sample groups, can be computed as in 2.11 where SSC, the Sum of Square Errors

between the means of all the sample groups, is dependent on determining SSC.

MSE =SSE

N − k(2.10)

MSC =SSC

k − 1(2.11)

SSE can be computed as in 2.12 where xij corresponds to the value of the j-th sample of the i-th

sample group and xi corresponds to the mean the i-th sample group. Similarly, SSC can be computed as

in 2.13 where x corresponds to the mean value of all the samples across all the sample groups as in 2.14.

SSE =

k∑i=1

ni∑j=1

(xij − xi)2 (2.12)

SSC =

k∑i=1

ni∑j=1

(xi − x)2 (2.13)

x =1

N

k∑i=1

ni∑j=1

xij (2.14)

After computing F the statistical test can be performed as in 2.15 where F1−α,k−1,N−k corresponds

9

to the critical value of a Fisher’s-distribution with confidence level α (usually equal to 0.05) and a k-1

and N-k of degrees of freedom.

F > F1−α,k−1,N−k (2.15)

If 2.15 turns out to be true then H0 is rejected, meaning that at least one of the populations is likely

to be significantly different than the others (1-α likely).

2.7 N-way ANOVA

When analyzing a problem where one dependent variable may or may not be influenced by a group

of independent variables and their interactions, the N-way ANOVA [8] is one of the most used tools to

study these possibilities. This scenario will generate numerous null hypothesis as not only the effect

of each independent variable by itself on the dependent variable is studied but also the interactions of

the independent variables between them might have an effect on the dependent variable. For a set of

independent variables α, β, γ the null hypothesis for their effect on the dependent variable by themselves

will be defined as in 2.16 while the interactions between them will be defined as in 2.17 (for an interaction

between independent variables α and β) or as in 2.18 (for an interaction between all three independent

variables).

H0 : α1 = α2 = ... = αi (2.16)

H0 : α1β1 = α1β2 = ... = αiβj (2.17)

H0 : α1β1γ1 = α1β1γ2 = ... = αiβjγk (2.18)

The N-way ANOVA is a specific case of a General Linear Model and can be defined (for the three

variables described) as in 2.19 where µ represents the overall mean of the dependent variable across all

groups, αi, βj and γk represent the effect of the group i, j or k from its respective independent variable

on the overall mean, parameter interactions such as (βjγk) represent the effects of interactions between

the independent variables involved and εijk represent the error associated with groups i, j and k when

present in a sample at the same time.

yijkr = µ+ αi + βj + γk + (αiβj) + (αiγk) + (βjγk) + (αiβjγk) + εijk (2.19)

All the independent variable related parameters in 2.19 are subject to a constraint that forces the sum

10

of their parameters across all groups to be 0 (e.g. equation 2.20 shows this constrain for the independent

variable α)

I∑i=1

αi = 0 (2.20)

The estimation of the independent variable group parameters is usually done through a Iteratively

re-weighted least squares method that iterates through the provided samples and finds the parameters

that best fit the General Linear Model to the data.

2.8 Interior Point Algorithm

The Interior Point algorithm [9] is a local optimization algorithm that minimizes a given function

while obeying certain constraints on its domain through the addition of barrier functions to the function

being minimized. In the context of this study all the constraints applied on the parameters of the function

will be constant as in 2.21 and so, according to Barrier function theory [10], when dealing with a function

f(x) with a constraint of x > b 2.21 we can dismiss the inequality constraint by subtracting µc(x) to f(x)

2.22 and maintain the same restriction so long as µ is a free parameter that, as it approaches zero, allows

proposed solutions to the minimization of the equation in 2.21 to approach b and so long as c(x) =∞ if

x < b.

f(x), x > b (2.21)

f(x) + µ c(x), c(x) = +∞ if x < b (2.22)

One commonly used barrier function is −log(x) since it tends to infinity when x tends to zero 2.23.

f(x) − µ log(x−b) (2.23)

And so, by minimizing 2.23 iteratively while decreasing µ we find the local minimum of 2.21 depending

on the initial value for x (x0). When dealing with functions with local minima the Interior Point Algorithm

may fail to find the global minimum. One way to circumvent this is to run the method multiple times

with different x0 and keep the best solution, thus increasing the probability that a global minimum would

be found.

11

2.9 FSL, FEAT and FeatQuery

FSL [11] is a library of tools that can be used to analyze Functional Magnetic Resonance Imaging

(fMRI) and other neuro-imaging exams. In particular there are two tools that are relevant to this project:

FEAT and FeatQuery.

FEAT is responsible for analyzing fMRI data that has been processed from an initial raw state into

an analyzable one resorting to BET (another FSL tool). For that the user must identify each event that

they wish to have analyzed. When using FEAT, one should consider an event as a group of occurrences

that will be averaged before being analyzed. This means that it is valid to define an event that contains

all the occurrences of a certain type that may occur during an experiment in order to understand what

happens in the brain during that type of event (on average), as well as to define each event as a single

occurrence of any type to get data on that specific occurrence.

After analyzing the data with FEAT, the user can resort to FeatQuery to read the files produced by

FEAT. For that it is important for the user to define what is called a Region of Interest (ROI). A ROI

is defined by a mask that delimits a region of the brain that the user wishes to have data on.

Figure 2.2: Mask that identifies a ROI around the left amygdala, one of the constituents of the reward systemof the subject’s brain.

After applying the mask and defining the ROI, FeatQuery will take the data of each voxel inside the

ROI and average it. Note that if the user defined events in FEAT composed of various occurrences each

voxel will have the average value of stimulation present in that region of the brain for all occurrences.

After the whole procedure the user should have the amount of stimulus a certain area of the brain was

subjected to during an event.

2.9.1 FEAT

Feat is the first tool we will use in the BOLD response data extraction. It will be responsible for

extracting the parameters from the function that describes the BOLD values through time, taking into

12

account that certain time points are classified with certain events (for example, beginning or end of a

round).

Figure 2.3: FEAT’s Miscellaneous menu

Figure 2.4: FEAT’s Data menu

In the Miscellaneous menu, as seen figure in 2.3, we have the ”Brain/Background threshold” field that

allows the regulation of the threshold for the background noise in the fMRI. Only stimulation above a

certain value will be considered relevant for the analysis while values below will be considered as noise.

13

Using option ”Z threshold” we can define the threshold that filters relevant BOLD responses from

weaker ones that may still be registered. These weaker BOLD responses might result from the lingering

effects of a past activation that still registers in the brain.

In FEAT’s Data menu (figure 2.4) we selected FEAT directories as the format our input would be in

(since that was the format data from previous processing phases was delivered to us) an chose which files

to input. After, we chose the total number of volumes (brain image) along with the time each of those

volumes took to collect. Finally, we chose the cutoff value for a High pass filter, to reduce noise in the

extraction.

Figure 2.5: FEAT’s Stats menu

Figure 2.5 shows FEAT’s Stats menu. This menu contains options to, for example compensate for

motion correction during scans, but since Rilling’s team did not use this functionality we also won’t use

it.

14

Figure 2.6: FEAT’s Post-stats menu

Figure 2.7: FEAT’s Full model setup menu, events tab

In figure 2.6 we can see FEAT’s Post-stats menu. In this menu we instructed FSL to look for clusters

15

of voxels (in the Thresholding option) that show average activation values over the value defined in the

Z threshold field while the P threshold defines the p-value used to define whether the Z threshold was

passed or not.

By selecting the option Full Model Setup in FEAT’s Stats menu we get to the menu described in figure

2.7. In this menu we selected the number of events each experiment contained. In our case we considered

60 events, 30 of them being the beginning of each of the 30 rounds while the other 30 corresponded to

the moment rewards were received in each round.

For each event we gave it a name and chose a function to model the hemodynamic response felt in

the brain during its occurrence. We chose the sinusoidal function as it is a staple when modelling this

type of brain activity.

Figure 2.8: FEAT’s Full model setup menu, contrasts tab

In the Contrasts tab of the Full Model Setup menu (fig 2.8) we defined the matrix that relates the

events to the baseline. We could have also related events to one another but that didn’t make sense due

to the design of the experiment and the singular nature of each of our events. No F-tests were performed

since our events are singular instances and not averages of many occurrences.

After concluding all the steps necessary to process one subject’s data into a FEAT folder we used a

functionality of FEAT to export all the configuration made to a script file so that they could later be

16

adapted to other subjects via a script.

2.9.2 FeatQuery

Figure 2.9 shows the FeatQuery menu. We used FeatQuery to apply the mask that defined our ROI.

First, we selected the folder containing the data processed by FEAT relative to one subject as one subject

was processed at a time. Then we chose the file containing the mask of that specific subject.

Figure 2.9: The FeatQuery menu

Due to the original mask being in standard space (a standardized set of coordinates that is used in

neuroscience) it had to be transformed to to the space each subject’s measurements were performed on

(highres space). This transformation was conducted with a command line script detailed below:

flirt -in [Standard Space Mask] -ref [Output of Highres File] -applyxfm -init [Standard

to Highres transformation file] -datatype float -out [Transformed Mask Location]

The transformed mask obtained from this comand was the one fed to FeatQuery.

17

2.10 Pearson’s Correlation Coefficient

Determining the correlation factor between two characteristics of a patient’s brain (like the stimulation

data from an fMRI and the data from a Q-learning model) will be very important during the course of

this work. When performing correlation analysis, Pearson’s Correlation Coefficient (PCC) proves to be

a very powerful tool in quickly determining the linear correlation between two features or variables.

ρ(X,Y ) =1

N − 1

N∑i=1

(Xi − µXσX

)(Yi − µYσY

)(2.24)

The PCC can be calculated as seen in 2.24 where µX is the average of the all the values of X and σX

their standard deviation and its value can vary between -1 and 1. Most problems require the analysis of

the PCC between various features and a matrix format is adopted to display the correlation coefficients

for quick analysis. An example can be seen in Table 2.1.

A B CA ρ(A,A) ρ(A,B) ρ(A,C)B ρ(B,A) ρ(B,B) ρ(B,C)C ρ(C,A) ρ(C,B) ρ(C,C)

Table 2.1: Pearson’s Correlation Matrix

2.11 Spearman Rank Order Correlation

In alternative to the Pearson’s correlation we considered studying the Spearman Correlation between

our parameters as it is able to capture other types of relationships between variables that the PCC cannot.

Computing the the Spearman Rank Order Correlation Coefficient is very similar to the process re-

quired to determine the PCC. There is only one extra step required that consists in ranking the vectors

containing the two variables being correlated.

Figure 2.10: Two associated variables (A and B) and their ranked versions

18

Figure 2.10 shows two variables A and B and their respective ranked variables. They are ranked in

integers from lowest to highest. In figure 2.11 we can see the unranked variables A and B plotted along

with the line that would define their Pearson’s correlation compared to a similar plot of their ranked

counterparts.

Figure 2.11: Plot of variables A and B (unranked) and their respective trendline compared to aplot of their ranked counterparts.

We can infer by visual inspection that, were the PCC to be determined for the unranked variables,

its value wouldn’t be very high. On the other hand, the linear correlation between the ranked variables

is perfect. By ranking the variables we expose a relationship between them that wasn’t captured by the

PCC. This operation is the equivalent of fitting the unranked variables to any monotic function. The

Spearman correlation has the benefit of being less affected by outliers than the Pearson’s Correlation,

while being more prone to identify irrelevant relationships due to its flexibility.

19

3Related Work

Contents

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Base Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 The validity of modeling brain processes with RL . . . . . . . . . . . . . . . 25

3.4 The importance of the striatum when dealing with reward prediction error 28

20

3.1 Overview

Figure 3.1 details an overview of papers studying changes in behaviour and brain functioning as their

subjects execute reward oriented tasks, their methodologies and results, with ↑ indicating increases and

↓ indicating increases in whichever metric is referenced in front of it. Some of these studies will warrant

further inspection in subsequent sub-chapters.

Figure 3.1: Overview table

21

3.2 Base Studies

The studies described below served as the main basis for this work. Both were performed on humans,

with the same methodology. The data analyzed in this work originated from the studies below.

3.2.1 Effects of intranasal Oxytocin and Vasopressin on cooperative behavior

and associated brain activity in men

Oxytocin (OT) and Vasopressin (AVP) are substances known to affect social behavior in humans. In

”Effects of intranasal Oxytocin and Vasopressin on cooperative behavior and associated brain activity

in men” [1] the researchers investigated the effects of these substances in human male subjects through

a series of matches of an iterated version of the PD game against computers and computers that they

perceived as human beings.

In this variant of the PD game, in each round, the player that goes first (Player 1) chooses to either

cooperate or defect. After he makes his choice, a variable amount of time passes and his opponent (Player

2) gets to see the choice made by his opponent and makes his own. After both players choose their actions,

they see the outcome of the round after which another round will start.

Figure 3.2: Time-line for one round of PD regardless of the nature of the opponent

When both players defect (DD) both will be rewarded with 1$. When both players cooperate (CC)

both will receive 2$ instead. When one player cooperates but the other defects (CD or DC) meaning,

when one player “cheats” the other, the defecting player will get 3$ while the cooperating player gets

nothing.

22

Figure 3.3: PD payoff matrix for the game performed in Rilling’s study [1]

As it is usual in Prisoner’s Dilemma like games, it exists a Nash Equilibrium, a state where no

participant is motivated to switch their strategies, in a situation of mutual defection (DD) as both

players lose 1$ by deviating from their respective strategy. At the same time, mutual cooperation (CC)

situations are unstable since both players stand to gain 1$ more by defecting if they assume the other

player will continue to cooperate.

In the adapted the PD game the players play thirty rounds in a row, but they also play against a

human or a computer opponent two games each, one as player 1 and other as player 2. In reality, all

games were played against a computer algorithm no matter if players believed they were playing a human

or a computer. This computer algorithm behaved in a way that human players would not easily suspect

their nature when passing as human. When playing as Player 2, the algorithm would defect back on a

defecting Player 1 90% of the time but would only cooperate on a cooperating Player 1 67% of the time.

In this way the algorithm acts in a self-preserving way when facing a defector by rarely allowing itself to

be exploited while using its position of power to sometimes (33%) defecting on a cooperating Player 1 in

order to increase its gains. When playing as Player 1 the algorithm plays a probabilistic tit-for-two-tats.

Starts by always cooperating in the first round since defecting would result in a virtually assured mutual

defection that could set a negative trend for the whole game. If cooperated back (CC) the algorithm will

always continue cooperation. When defected back (CD) the algorithm will try to cut losses by defecting

90% of the time in the next round. This would likely lead to a mutual defection scenario (DD) in posterior

rounds which would lead the algorithm to try and resume cooperation by cooperating 33% of the time. If

at any time Player 2 decided to return to cooperation the algorithm would reciprocate 100% of the time.

The human subjects were monitored through fMRI during the whole game. The researchers also

registered the time stamps of specific events like round starts, Player 1 discovering that he was defected

by its opponent, etc allowing them to extract brain information associated with relevant occurrences.

After analyzing the data, the study made some interesting findings:

• Players as Player 1 across all drug groups were 9% more likely to cooperate after a CC outcome

23

when facing perceived humans (89%) than when facing a computer (80%). (p = 0.0002)

• Players as Player 1 across all drug groups were 9% less likely to cooperate after a CD outcome when

facing perceived humans (24%) than when facing a computer (33%). (p = 0.005)

• Players as Player 1 across all drug groups were 13% less likely to cooperate after a DD outcome

when facing perceived humans (38%) than when facing a computer (51%). (p = 0.0002)

• There were differences in behavior between players, playing as Player 1, that took OT and ones

that took AVP but when comparing each drug group with the placebo takers the differences were

not statistically significant.

• Players as Player 2 across all drug groups were 14% more likely to cooperate after a Cooperative

play when facing perceived humans (88%) than when facing a computer (74%). (p = 3 x 10( − 7))

• Players as Player 2 that were administered AVP were 10% more likely to cooperate after a Co-

operative play when facing perceived humans (96%) than subjects treated with OT (86%). (p =

0.01)

• Players as Player 2 that were administered AVP were 21% more likely to cooperate after a Co-

operative play when facing a computer (88%) than subjects treated with placebos (67%). (p =

0.007)

This seems to indicate that all subjects in general were:

• Less likely to betray cooperative humans than computers.

• More likely to lose trust on their opponent when facing humans that defected them.

• More likely to try to escape the DD Nash Equilibrium when facing computers.

• Less likely to betray players that trusted them when they were human.

Also, players that were administered AVP were:

• More likely to reward Cooperative actions with cooperation of their own than OT administered

subjects when facing humans.

• More likely to reward Cooperative actions with cooperation of their own than placebo takers when

facing computers.

24

3.2.2 Sex differences in the neural and behavioral response to intranasal oxy-

tocin and vasopressin during human social interaction

The first of study by Rilling’s team on the effects of OT and AVP in the brain [1] only contained male

subjects due to certain hormones showing widely variable levels across the menstrual cycle of women.

One of those hormones, estradiol, is known to affect OT receptors. Because of this, in their new study

‘Sex differences in the neural and behavioural response to intranasal oxytocin and vasopressin during

human social interaction’ [2], estradiol levels were measured by taking blood samples from all subjects

and compensated for in the models built.

All the remaining process was the same that was executed in their previous study [1]. The gathered

data was aggregated and studied. While playing as player 1 the results seem to suggest that:

• Women administered OT seem to be less likely to maintain cooperation when playing computers.

• Women administered either OT or AVP seem to be more likely to maintain defection when playing

computers.

• Both OT and AVP seem to make women differentiate less between computers and humans when

deciding to cooperate or not after mutual defection.

• Comparisons between male and female behavior as player 1 did not yield statistical significance.

While playing as player 2 the results seem to suggest that:

• Women administered OT seem to be less prone to defect on cooperative human players.

• Women administered AVP seem to be more likely to cooperate with individuals after they have

defected them.

• Men administered AVP seem to be more cooperative towards players that were cooperative in the

past both computer and human.

• Men administered OT seem to be more defective when playing computers that were defective in

the past.

3.3 The validity of modeling brain processes with RL

Previous studies [3,12–14] have modelled or hinted at the possibility of modelling reward based learning

tasks with RL. Some of these papers described below present some of the existing scientific work that

can be taken as a basis to validate the use of RL models to represent brain processes related to situations

where individuals learn optimal behaviours by receiving rewards associated to their choices.

25

3.3.1 Reinforcement learning in the brain

In ‘Reinforcement learning in the brain’ [3], Niv displays evidence of the way RL models the function of

dopaminergic neurons in the brains of mammals and humans. The author starts by referencing previously

made experiments with animals that showed when a sound played before a monkey was fed, that sound

would trigger an increase in activity in the monkey’s reward centers. However, this increase in stimulation

would start diminishing with time as the sound was played more and more. The author suggests that

this might indicate that since the animal was getting used to being fed after the bell rang its rewards

centers were not firing up as much as they were before the habituation had set in.

Figure 3.4: A monkey’s neuro-conditioning to a sound (CS) followed by being fed juice (US) ininstants a) b) and c)

In figure 3.4 can be seen a measure of a monkey’s reward center stimulation over time with two events.

The first, CS, corresponds to the playing of a sound while the second, US, corresponds to the feeding

of a tasty juice. In (a), it can be observed that as the trials advance the monkey’s brain starts having

its peaks before the juice is fed to him, but they also become less intense. In (b) it can be seen that

the monkey is now used to the reward of the juice but when introducing the sound its brain still shows

activity. In (c) the sound is played but no juice is given so when failing to receive the juice, the monkey

experiences a negative stimulus. The author emphasizes the similarities between Temporal Difference

Learning, a type of RL model, and the behavior observed both in the monkey and in its brain.

The author then elaborates on the way the necessity of using non-invasive brain activity measuring

26

techniques impacts tests with humans. According to the author the Functional Magnetic Resonance

Imaging (fMRI) is the most widely used technique, since it is non-invasive, but presents problems due to

the amount of noise it produces and due to the amount of data the users must filter. It is, despite its

flaws a very popular measuring technique in measuring the BOLD response in areas of the brain.

The paper points to other works in the neurosciences that identify various neural controllers in the

brain that, the scientific community hypothesize are used both in their specific tasks but also in conjunc-

tion with each other to perform more complex decisions. This can prove problematic when performing

experiments in this area as one specific task might use neural controllers that can easily be modeled by

RL algorithms while with other tasks of the same complexity might be impossible to do so. The author

also points out that many of these neural controllers work in the absence of dopamine. This further

supports the idea that neural controllers are varied and that, for now, is unrealistic to try to find a model

that can explain all the decision processes made by an animal in different situations. It also indicates

that exploring other substances other than dopamine might yield important results.

The author concludes by reiterating the parallels between the RPE measured in many RL algorithms

and the brain stimulations measured during the moments where animals and humans experience encounter

RPE’s be-tween their expected and received rewards and by pointing out that RL models have had, so

far, unprecedented success in modeling neural controllers, which, due to their relative simplicity, can be

very useful in furthering our knowledge of the inner workings of the brain.

3.3.2 Actions, Policies, Values, and the Basal Ganglia

In ‘Actions, Policies, Values and the Basal Ganglia’ [12] the authors propose a model for the decision

processes that occur when an individual decides which of the actions available to him is optimal.

The authors argue that there are two main types of behavior: habitual, and goal-directed behav-

ior. They label habitual behavior as devaluation-insensitive and goal-directed behavior as devaluation-

sensitive. De-valuation-insensitive behavior is behavior that does not adapt to changes in the environment

(i.e. subjects may continue eating food despite being satiated) while devaluation-sensitive behavior re-

sponds to these changes by realigning the subject’s priorities when selecting a new action. The authors

argue that both model-based and model-free RL fit the devaluation-insensitive category as these models

must relearn their values once the environment has changed while devaluation-sensitive systems immedi-

ately change detect the changes in the environment and change accordingly.

The authors point out that the Basal Ganglia is an area of the brain closely related to the decision

processes of habitual behavior. They point as evidence various studies around lesions in this area and the

effect they had on the habitual behaviors of the patients. They note that when injured only their habitual

behavior control system seemed to be affected which suggests that despite many situations demanding

joint action from both habitual and goal-oriented systems (despite there being no physical evidence of

27

arbitration between these systems), these can still act independently from each other.

The authors conclude by reiterating the duality of their proposed model in which a certain controller

decides on habitual behavior while other decides on goal-oriented. Since RL has been shown to have

more success modelling habitual behavior this should serve as further evidence that only certain parts of

the decision process of humans and animals can be explained by RL.

3.3.3 Model-based fMRI and its application to Reward Learning and Deci-

sion making

In ’Model-based fMRI and its application to Reward Learning and Decision making’ [15] we find

a detailed analysis of the technique of model-based fMRI, its advantages over more traditional fMRI

applications and some work performed in this area. The analysis describes the technique as the study

of correlations between data from fMRI analysis that look into the changes in the activity of regions of

interest in the brain, and data collected from computer models that describe whatever task the subject

being analyzed at the time of the fMRI was performing. If a correlation between these two factors is

found, one can then ascertain that the brain areas focused one during the fMRI analysis are relevant

to the subject’s performance of the task that has been computer modeled. This is a tried and proved

framework used for this type of work, and will therefore be chosen to test our hypothesis.

3.4 The importance of the striatum when dealing with reward

prediction error

The paper ‘Temporal prediction errors in a passive learning task activate human striatum’ [16] de-

scribes a study that concluded that increased brain activity in the striatum region of the brain is associated

with unexpected rewards.

3.4.1 Temporal prediction errors in a passive learning task activate human

striatum

In ‘Temporal prediction errors in a passive learning task activate human striatum’ [16] the authors

document an experiment where humans were fed either water or juice at predictable intervals in one

phase and at unpredictable intervals in another

Figure 3.5 shows the striatum of a patient registering a positive BOLD response when faced with an

unexpected reward. In a RL model this situation would equate to a positive reward prediction error that,

when associated with a positive BOLD response, could be indicative of a correlation between these two

occurrences.

28

Figure 3.5: Activation of the striatum when the subject receives an unexpected reward.

29

4Methods

Contents

4.1 Data Processing (ETL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Q-learning parameters estimation . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 RPE estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.5 BOLD/RPE correlation and respective ANOVA . . . . . . . . . . . . . . . . 34

4.6 Defining a Q-learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7 Boundaries for the η parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.8 Empirical model testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.9 Testing the Q-learning implementation . . . . . . . . . . . . . . . . . . . . . 42

4.10 Chi-square test to test confounding effects of Round Order . . . . . . . . . 44

4.11 ANOVA design to test effects of Sex, Drug and Opponent . . . . . . . . . 46

30

This section details the tasks conducted to study the postulated hypothesis. Section 4.1 will describe

the ETL process used for data processing for this thesis. Section 4.2 describes the process by which

the parameters for the Q-learning models were estimated while section 4.3 details the steps necessary

to estimate the RPE for each interaction. In sections 4.4 and 4.5 explain the process determining and

analyzing the correlation analysis that will be performed. Sections 4.6, 4.7, 4.8 and 4.9 we describe the

definition and testing process of the Q-learning model to be used (and its parameters). Finally, in section

4.10 we detail a Chi-Square test that was performed and the way it affected the design of the ANOVA

tests that were performed.

4.1 Data Processing (ETL)

To study the postulated hypothesis, it was necessary to process the fMRI data produced by Rilling’s

work [1] [2]. For that we used an ETL-like process.

Figure 4.1: ETL description scheme.

4.1.1 Extraction

The initial, raw data relevant to this project consisted of numerous .tar files. After a successfully

extracted all files were ordered by subject number, identifying the type of opponent they played against

(human/computer) and whether the round they were playing in was their first or second round. After

this, the files were ready to be processed by FEAT.

4.1.2 Transformation

The transformation is the lengthiest part of the ETL process as several operations are conducted

during this step. First, some checks were performed to verify if the event structure of each file is consistent

with the rest. Violating this structure by, for example, having events ordered in a different way, would

31

cause problems in later stages of the transformation process. Since in the original study the research

team analyzed events that are out of the bounds of this investigation (such as the reaction of a subject’s

brain to seeing that it’s his turn to play) these had to be filtered out. It was necessary, for a later stage

of the transformation process, to analyze the event files for relevant events in order to build the outcome

sequences for each experiment (i.e. subject 002 in round 1 started by observing a mutual cooperation

outcome followed by a mutual defection outcome, etc) in order to determine the amount of reward a

subject received in each round.

Rilling’s research team grouped all occurrences of the same type (i.e. user notices mutual defection),

effectively averaging the BOLD responses for each event. Since each occurrence of this type (instance

of subject noticing the outcome of a round) produces a different RPE by communicating to the subject

a certain reward corresponding to his previous action this required us to break down these events into

individual occurrences.

Next, we needed to automate the application of FEAT over the data. For this end, FEAT allows

the creation and reading of scripts through the command line that obey a specific text file format. This

required the creation of a script that generates a FEAT script for each experiment. Each one of those

FEAT scripts analyzes all the relevant events from the experiment and maps the stimulation that the

subject’s brain was experiencing during that moment.

After the processing of the data by FEAT is concluded we need to run a command with a FeatQuery

utility that allows us to rotate and change the shape of the mask to fit the brain structure of each specific

subject. This mask will be used in the next processing step to filter out all the brain activity originated

from areas of the brain that are not relevant for our hypothesis.

Finally, another script was created that takes all the data processed by FEAT and calls the necessary

FeatQuery commands to extract the amount of stimulus each subject felt in their rewards centers during

each event. This information is presented in a HTML report from which the relevant information is

extracted during the loading phase.

4.1.3 Loading

The loading phase starts by extracting from the HTML reports and the action sequences produced,

for each subject, during the extraction phase the pair of stimulus felt and occurrence that cause said

stimulus. This information is then be aggregated in a single .csv file (for each subject) along with the

remaining, previously available information (subject number, round number, player order and opponent

type). All the .csv files produced this way are then concatenated in a single .csv with the information

from all participants.

With regards to extracting data with the BOLD response from the relevant areas of the brain during

the experiment we can say that at this point the data is fully processed. It is however, still necessary

32

to identify each subject’s gender and the drug they were administered. For this we crossed the subject’s

information present in the aforementioned .csv file and two excel files created be the team at IMM that

contained the information of each subject’s gender, drug, round and action sequence (but not identity).

Some validations occurred at this stage. For example, the action sequence from a subject in one

file must match the action sequence from the subject with the same number in the other. With a few

exceptions this happened to all subjects. The ones that showed a mismatch between the action sequences

in both files were analyzed and most of them had an action sequence that matched the action sequence

of subject with a different number in the other file (i.e. subj002 from fMRI data had the same action

sequence as subj282 from the excel file). After contacting a former IMM researcher that also worked

with Rilling’s team in the past we were able determine some of these mistakes to caused by manual

transcription of data and to correct them. Other subjects’ data had to be removed due to inconsistencies

that couldn’t be corrected.

After validating and aggregating all the relevant data for all subjects and encoding the data numeri-

cally so that it can be more easily imported to MATLAB the data loading process is concluded.

4.2 Q-learning parameters estimation

After importing the data to MATLAB we were ready to estimate the parameters for the Q-learning

algorithm that best define each subject. To that end we created a function that receives a set of Q-

learning parameters (section 4.6 goes further into these) and a sequence of actions and respective rewards,

outputting the negative log likelihood of that sequence being produced by those parameters (as in 4.1

where n is the number of rounds, 30 in the case of this study).

− logL = −n∑t=1

log(P(at)) (4.1)

Then, by using a nonlinear solver (Interior-Point algorithm enhanced by a MultiStart procedure)

we find the Q-learning parameters that minimize the −logL. These will be the parameters that have

the biggest likelihood of resembling the ones that produced the action sequence provided. This, again,

assumes that the Q-learning algorithm is a good model for the decision process of subjects that find

themselves in this situation which seems to be supported by other studies [3].

4.3 RPE estimation

After estimating the parameters for the Q-learning model of a subject we are able to estimate the

RPE sequence by running the algorithm as if it was learning for the first time, now with the estimated

parameters. By doing this, the initial Q-values will get updated at each step, simulating the updates

33

that occurred in the subject’s brain during the experiment when receiving that specific reward sequence.

Later in the process, this sequence of RPE’s will be correlated with the sequence of BOLD responses in

the brain to test the postulated hypothesis.

4.4 Correlation Analysis

Obtaining a positive correlation between the BOLD response and the RPE (or other Q-learning

related metric) would indicate that there might be a link between brain activity in the reward centers

and Q-learning models derived from a parameter fitting process as described in the previous chapter.

A strong negative correlation would be unexpected but could also indicate the possible existence of a

relation between brain activity and Q-learning models with parameters fitted to represent said activity.

If no correlation is found the experiment will not verify the postulated hypothesis. Studies show that

for certain tasks, regions of the brain like the left putamen might respond to positive RPE’s while not

responding to negative RPE’s [16]. Due to this, studying the correlation between only positive RPE’s and

the BOLD response was considered worthwhile. There is also the possibility of studying the correlation

between the rewards received and brain activity, since this would circumvent the estimated Q-learning

models and provide us with a different insight.

4.5 BOLD/RPE correlation and respective ANOVA

After computing all the previous steps we are now able to calculate the correlation factor between the

array of BOLD responses each subject shows with its array of RPE’s.

For this we tested both the Pearson’s Correlation and the Spearman Correlation since we don’t

necessarily expect a linear correlation specifically to be present and there is a possibility for another,

more flexible relation to exist.

As shown in 4.2, the PCC can be obtained by correlating array R of RPE’s experienced by the subject

while experiencing the stimulation of reward centers described in the corresponding index of array F that

contains the BOLD responses. There will be NN ′ samples if N represents the number of subjects and

N ′ represents the number of rounds played by each subject.

ρ(R,F ) =1

NN ′ − 1

NN ′∑i1

Ri − µRσR

Fi − µFσF

(4.2)

Considering the definition of the Pearson’s Correlation defined in 4.2 we can define our Spearman’s

Correlation as in 4.3 provided we define RR and FR as the ranked form of arrays R and F , respectively,

where the values of these arrays are ranked from lowest to highest (i.e. all the values in both arrays get

replaced by their ordinal values with respect to the array they are included in).

34

rs = ρ(RR,FR) (4.3)

After obtaining the correlation factor for each participant we can now run an ANOVA in order to

determine the effect of various independent variables in our correlation. The independent variables

considered were the subject’s sex, administered drug and opponent type.

4.6 Defining a Q-learning Model

Different types of Q-learning models were considered for the modelling of the choice process of the

participants. Due to the nature of the task we considered the Boltzmann Policy (as in 4.4) as an adequate

policy to represent the way humans choose between exploration of new possibilities and exploitation of

current knowledge.

P(a) =eηQ(a)∑b∈A e

ηQ(b), η ∈ [0,+∞[ (4.4)

Due to the nature of the PD setting we modeled the problem as having only one state since the

environment does not change as the rounds progress. Due to this, the estimation of the future state

optimal next action Q-value (γmaxbQ(y,b)) is removed from the equation of the model as it takes a

simpler form as noted in 4.5 while the RPE, the measure that we hypothesize may be related to the

BOLD response in the striatum and was identified as a strong modeler of brain activity responsible for

reward estimation [12], gets defined as in 4.6.

Q(a) = Q(a) + α(r(a) −Q(a)), α ∈]0, 1] (4.5)

σ = r(a) −Q(a) (4.6)

As is supported by other work [17], processes modelled successfully by Q-learning algorithms can

sometimes fit behaviour patterns better if they allow for the existence of different α values between

interactions with the environment that bring positive or negative RPE values (α+ and α− respectively).

Due to this, the possibility of incorporating a second α value was taken into consideration.

Due to an expectancy of the possibility of the model fitting process over-fitting its parameters to the

observed action sequences of the participants, the possibility of assuming that each individual would start

with its initial Q-values equal to their optimal values (assuming perfect knowledge of the inner workings

of the algorithm that controlled their opponent) was considered in order to reduce model complexity.

The considered models are as follows:

• One α: Standard model with one α, one η to control the amount of exploratory behaviour and two

35

initial Q-values for both cooperation and defection.

• Two α: Model with one α+ for positive RPE’s, one α− for negative RPE’s, one η to control the

amount of exploratory behaviour and two initial Q-values for both cooperation and defection.

• One α, optimal initial Q-values: Same parameters as the one α model, except the initial Q-values

are assumed to be optimal.

• Two α, optimal initial Q-values: Same parameters as the two α model, except the initial Q-values

are assumed to be optimal.

4.7 Boundaries for the η parameter

Some initial tests showed that the Negative Log Likelihood minimization algorithm would sometimes

abuse the η parameter by setting very large values for it in order to increase determinism and thus creating

models that overfit to the action sequences we had available. Due to this we analyzed the outcome of the

likelihood function for different values and tried to determine a measure of “reasonable determinism” for

it. We considered that having a P (a) > 0.95 for any[Q(c), Q(d)

]would constitute a close to deterministic

situation as the probability of choosing one action over the other is very high.

Then, comparing the percentage of deterministic outcomes one value of η produces over another in the

relevant domain of Q-values allowed us a comparison of the “determinism” of two policies with different

Q-values.

Figure 4.2: Percentage of the search space with P (a) > 0.95 for each η

36

Image 4.2 shows percentage of the area of the likelihood function that has a likelihood over 0.95 in

the relevant domain(Q(c) ∈ [0, 2], Q(d) ∈ [1, 3]

)for all η ∈ [0, 20].

We wanted to allow the algorithm to attribute a certain degree of determinism to a participant without

allowing him to have very deterministic policies in situations where both Q-values are close to one another.

Taking this into account we decided to search for a η value that created an area of deterministic decision

of around 50%. The function surface area for deterministic outcomes with η = 3 is approximately 50.07%

so we defined the maximum value for the η parameter as 3.

Images 4.3 through 4.5 show the plots for P(cooperation) in the relevant domain for η ∈ {1.5, 3, 20}

Figure 4.3: Plot of the P(cooperation) over all possible Q-values with η=1.5

37

Figure 4.4: Plot of the P(cooperation) over all possible Q-values with η=3

Figure 4.5: Plot of the P(cooperation) over all possible Q-values with η=20

4.8 Empirical model testing

Due to some characteristics of the Q-learning models we had concerns that our current methodology

of model fitting was susceptible to over-fit, meaning that it would be possible to get models with low

Negative Log Likelihoods despite having high error rates when setting the model’s parameters.

As mentioned in chapter 4.7, some preliminary tests showed that, sometimes, the optimization algo-

rithm would attribute high values to the η parameter in order to over-fit to specific action sequences.

38

This happened since with only thirty rounds of subject interaction and by pure chance, behaviour might

appear deterministic even if the subjects decision policy was stochastic.

Other characteristic of our framework that could cause some problems was the lack of independence

between the parameters of our models. For example, a subject that starts by cooperating in the first

two interactions but starts defecting back after receiving nothing but defections for the entire experiment

might display this behaviour because he had an unreasonable expectation that his opponent would not

defect back (i.e. had a high Q0(C)) but quickly learned that this initial expectation was wrong (i.e.

had a high value for α). It could also be the case that the aforementioned subject had a low α (i.e.

slow learning speed) but that he had only a slight expectation that his opponent would not defect him

(i.e. moderately high Q0(C)), expectation that was quickly corrected after two interactions, despite the

subject’s low learning speed. This lack of independence between model parameters could also cause high

parameter error since there could be instances where models with wildly different parameters produce

similar action sequences, and thus, have similar likelihoods of fitting these action sequences.

Due to these concerns a series of empirical tests were performed. These tests followed the following

steps:

• Creation of artificial test subjects.

• Generation of Action and Reward sequences for all artificial subjects.

• Estimation of the subjects’ parameters through Negative Log Likelihood minimization.

• Computing of the average, per subject and per parameter, percentage error between real parameters

and estimated parameters.

• Computing of the average Negative Log Likelihood.

4.8.1 Artificial test subjects

Several artificial subjects were created. These were represented by sets of parameters that defined

their behaviour (i.e. α, η, Q0(c) and Q0(d)). For each parameter were set several points, evenly spaced

throughout their domain. Then all combinations of these parameters generated a single artificial subject.

For example, for a model with one α, one η and two Q-values there where created 144 subjects.

• α, α+ and α− ∈ [0.1, 0.3, 0.5, 0.9]

• η ∈ [0, 1, 1.5, 3]

• Q0(c) ∈ [0, 1, 2]

• Q0(d) ∈ [1, 1.5, 3]

39

4.8.2 Generation of Action and Reward sequences

After creating the various subjects’ models they where subject to a function that simulated the

experiment as they played against the same algorithm (implemented by us) the real subjects faced in

Rilling’s work for 30 rounds. The function would then output the Action and Reward Sequences of the

game just played.

4.8.3 Estimation of subjects’ parameters

With the reward and action sequences we were now able to infer the parameters of the model in order

to test the parameter estimation function.

4.8.4 Percentage Error

The percentage error metric for each model is defined as in 4.7:

• Let Θx = (θx1 , θx2 , ..., θ

xn) be the array that contains all the experimental parameters of the model.

• Let Θr = (θr1, θr2, ..., θ

rn) be the array that contains all the real parameters of the model.

• Let ∆ = (δ1, δ2, ..., δn) be the array that contains the δ values for the Θr where δi is the difference

between the maximum and minimum value of θri ’s domain. (i.e. since α ∈ [0, 1] then δα = 1)

E% =e%n

(4.7)

e% =

n∑i=1

| θxi − θri |δi

(4.8)

4.8.5 Test results

After running the test for all four models the metrics showed the following results (table 4.1):

ModelMean RPE’sCorrelation

E%Mean NegativeLog Likelihood

1 α 0,9055 +- 0,0061 0,3546 +- 0,0084 15,2974 +- 0,24461 α, optimal Q0 0,8920 +- 0,0032 0,3472 +- 0,0113 17,9174 +- 0,18602 α 0,8842 +- 0,0020 0,3476 +- 0,0018 15,4777 +- 0,06312 α, optimal Q0 0,8733 +- 0,0019 0,3329 +- 0,0039 18,2682 +- 0,0804

Table 4.1: Empirical test results for experiments with 30 rounds

40

ModelMean RPE’sCorrelation

E%Mean NegativeLog Likelihood

1 α 0,9383 +- 0,0049 0,2872 +- 0,0102 55,7899 +- 0,67811 α, optimal Q0 0,9399 +- 0,0041 0,2968 +- 0,0051 59,9047 +- 0,31472 α 0,9194 +- 0,0014 0,3044 +- 0,0057 56,5325 +- 0,16242 α, optimal Q0 0,9162 +- 0,0016 0,2987 +- 0,0016 61,4934 +- 0,2329

Table 4.2: Empirical test results for 100 rounds

Interpreting these metrics we can conclude that the E% is around 34% for all models while models

with lower E% are not always associated with lower Negative Log Likelihoods. This means that the

Negative Log Minimization function is not able to successfully infer the parameters of the model.

While this is would be a problem if we were dependent solely on the accuracy of the parameters we can

still test our hypothesis provided that our model produces a sequence of RPE’s that is strongly linearly

correlated with the real sequence of RPE’s. As can be seen in table 4.1 all models show high PCC’s

between the experimental and real RPE sequences (around 89%) which tells us that any correlation

calculated between the BOLD response and the experimental RPE sequence will be very similar to the

correlation between the BOLD response and the real RPE sequence for any individual.

Given this scenario our top priority when choosing a model becomes choosing the one that provides

the best experimental-real RPE correlation. According to the tendency illustrated in figure 4.6 we chose

to model the behaviour of the participants with one alpha model.

Figure 4.6: Experimental-real RPE correlation for various models

Consulting table 4.2 that shows the same empirical test done for table 4.1 but with sequences of 100

41

plays instead of 30 we can see an increase in Mean RPE’s Correlation and a decrease in E%. This shows

that increasing the number of times each participant plays in each round would provide data that would

allow for better models.

4.9 Testing the Q-learning implementation

After defining the Q-learning model we ran a short test of our implementation to see if it showed

the expected learning potential. Fig. 4.7 shows a graph that details the learning process of a subject

regarding the Q-values for cooperation (Q(cooperation)) with the following parameters:

• α ≈ 0, 3153

• η ≈ 1, 9443

• Q0(cooperation) = 2

• Q0(defection) = 1

Figure 4.7: Graph showing the evolution of a subject’s Q-values for cooperation.

As can be seen, the Q-value maintains its initial value of 2 during the initial period from iteration 0

to 5. After that, the rewards become more unstable and so, the Q-values orbit the average of the two

possible rewards.

Fig. 4.8 shows a graph that details the learning process of a subject regarding the Q-values for

defection (Q(defection)) with the following parameters:

42

• α ≈ 0, 4092

• η ≈ 2, 6141

• Q0(cooperation) = 2

• Q0(defection) ≈ 1, 5968

Figure 4.8: Graph showing the evolution of a subject’s Q-values for defection.

Again, the Q-values quickly converge to a stable reward and suffer oscillations wherever the reward

changes.

Fig. 4.9 shows the hypothetical Q-values trajectory that the subject from Fig.4.7 would experience if

its α value were to be artificially increased to 0,7.

43

Figure 4.9: Graph showing the evolution of a hypothetical subject’s Q-values for cooperation.

As can be seen the Q-values chase the reward value much more aggressively since the α value was

increased, showing that the RL algorithm is modeling the data as expected.

4.10 Chi-square test to test confounding effects of Round Order

After consulting the research team at IMM we studied the possibility of removing the variable

RoundOrderClass, that identifies whether a subject played against a computer or a human in his first

game, from the ANOVA since there wasn’t an expectancy that this measure would provide useful insights

into the data. It was important, however to determine whether or not the variable had a confounding

effect on other independent variables. For that we performed a Chi-squared test to ensure we are safe to

remove the RoundOrderClass variable.

Our Chi-squared test was performed under the null hypothesis ”RoundOrderClass is independent

from the remaining independent variables”.

44

Figure 4.10: Table showing the distribution of subjects across dependent variables.

As seen in figure 4.10 the percentage of subjects that share the same RoundOrderClass is very similar

across Sex and Drug groups which is a strong indicator that it might be independent from these two

variables.

45

Figure 4.11: Figure showing the results of the Chi-squared test.

Figure 4.11 shows that the significance level of both values of the variable RoundOrderClass are well

above 0.05 which doesn’t allow us to reject the null hypothesis meaning that there is no reason to believe

that RoundOrderClass could be dependent on any of the considered independent variables. This allows

us to safely remove this variable from our design.

4.11 ANOVA design to test effects of Sex, Drug and Opponent

With the conclusions taken from our tests in sections 4.8 and 4.10 the between-subject factors (vari-

ables that differentiate subjects) considered in our ANOVA will be Sex and Drug and the within-subject

factor (variable that measures changes in each subject over time) will be the Opponent Type. The depen-

dent variable will be a correlation between a measure taken from the Q-learning model (RPE or received

Rewards) and the BOLD response elicited in the brain. It is important to note that the independent

variable Opponent Type is considered an within factor since all subjects performed an experiment while

playing against each of the available opponent types (Human and Computer). Six different ANOVA’s will

be conducted that will analyze the Pearson’s and the Spearman correlation between the BOLD response

and another factor that will either be the RPE, positive only RPE’s or the Reward received at each time

point during the experiment.

46

5Results

Contents

5.1 Effects on the Pearson’s Correlation between RPE and the BOLD response 48

5.2 Effects on the Pearson’s Correlation between Reward and the BOLD re-

sponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Effects on the Spearman Correlation between RPE and the BOLD response 51

5.4 Effects on the Spearman Correlation between Reward and the BOLD

response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.5 Effects on the Pearson’s Correlation between Positive RPE’s and the

BOLD response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.6 Effects on the Spearman Correlation between Positive RPE’s and the

BOLD response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.7 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

47

In this section we elaborate on the results of the statistical tests performed on the data resulting from

the correlation analysis performed between the BOLD response of the brain and several other measures

obtained from our models. These statistical test will determine whether our independent variables:

subject Sex, administered Drug and Opponent Type have any effect on the correlation factor (PCC or

Spearman correlation) at study.

5.1 Effects on the Pearson’s Correlation between RPE and the

BOLD response

Firstly, we tested our main hypothesis, that there could be a correlation between the RPE and BOLD

response. We did this by analyzing the effects of the independent variables (i.e. subject Sex, administered

Drug and Opponent Type) on the dependent variable, the latter being the aforementioned correlation.

Figure 5.1: Table detailing interactions between the within subject factor and the between subjectfactors for RPE/BOLD Pearson’s Correlation.

As can be seen in figure 5.1 under the column ”Sig.” (significance) the within subject factor Opponent

Type and its interaction with the other between subject independent variables cannot generate differences

in the dependent variable with significance levels under 0.05. This means that we cannot reject the null

hypothesis that the mean value of the per-subject Pearson’s Correlation between the RPE and the BOLD

doesn’t change in function of these independent variable interactions.

Figure 5.2: Table detailing interactions between the between subject factors for RPE/BOLD Pear-son’s Correlation.

48

Looking at the table in figure 5.2 we can see that it shows the same layout as the one in figure 5.1

but instead describes the effects of the between subject independent variables both by themselves and

when interacting with one another. Again, under the column ”Sig.” we can see that most variables and

variable interactions don’t produce statistically significant changes in the mean correlation between RPE

and BOLD response except for the independent variable Sex that shows a p-value of 0.03. In figure 5.3 we

can see this effect as the mean PCC for female subjects is approximately 0.0042 while for male subjects

it is -0.0331.

Figure 5.3: Main effect of the independent variable Sex on the RPE/BOLD Pearson’s Correlation

This shows that we can reject the null hypothesis that ”Sex does not impact the Pearson’s Correlation

between RPE and BOLD response” and that males show a negative correlation while females show no

correlation.

49

5.2 Effects on the Pearson’s Correlation between Reward and

the BOLD response

Our second test analyzed the effects of the independent variables on the PCC between the Rewards

received and the BOLD response felt by each subject each time a reward was received. This tested

our second hypothesis that the that there could be a correlation between received rewards and brain

activation. This approach completely disregards the Q-learning model and focuses only on the input

received by the subjects.

Figure 5.4: Table detailing interactions between the within subject factor and the between subjectfactors for Reward/BOLD Pearson’s Correlation.

Figure 5.4 under the column ”Sig.” the within subject factor Opponent Type and its interaction with

the other between subject independent variables cannot generate differences in the dependent variable

with significance levels under 0.05. Again, we cannot reject the null hypothesis that the mean value of

the dependent variable doesn’t change in function of these independent variable interactions.

Figure 5.5: Table detailing interactions between the between subject factors for Reward/BOLDPearson’s Correlation.

Looking at the table in figure 5.5 more specifically at the column ”Sig.” we can see that no vari-

ables and variable interactions produce statistically significant changes in the mean correlation between

rewards received and BOLD response. We maintain our null hypothesis that ”Sex, administered Drug

and opponent type have no effect on the Pearson’s Correlation between received Rewards and the BOLD

response”.

50

5.3 Effects on the Spearman Correlation between RPE and the

BOLD response

For our third test we analyzed the effects of the independent variables on the Spearman correlation

between the RPE and the BOLD response felt by each subject during each experiment. This is our third

hypothesis and, while being similar to our first, tests the existence of a Spearman correlation to try and

find non-linear relationships between the RPE and brain activation.

Figure 5.6: Table detailing interactions between the within subject factor and the between subjectfactors for RPE/BOLD Spearman Correlation.

As can be seen in figure 5.6 under the column ”Sig.” the within subject factor Opponent Type and

its interaction with the other between subject independent variables cannot generate differences in the

dependent variable with significance levels under 0.05. Again, we cannot reject the null hypothesis

that the mean value of the dependent variable doesn’t change in function of these independent variable

interactions.

Figure 5.7: Table detailing interactions between the between subject factors for RPE/BOLDSpearman Correlation.

As with the PCC between RPE’s and BOLD responses (analyzed in chapter 5.1) we can see in figure

5.7 that the independent variable Sex is the only one that produces a statistically significant change

in mean Spearman correlation allowing us to reject the null hypothesis with a p-value of 0.031.This

shows that we can reject that ”Sex does not impact the Spearman correlation between RPE and BOLD

response”.

51

Figure 5.8: Main effect of the independent variable Sex on RPE/BOLD Spearman Correlation

Figure 5.8 shows a decrease in Spearman correlation from female subjects having a correlation of

0.0018 and male subjects showing a correlation of -0.033.

5.4 Effects on the Spearman Correlation between Reward and

the BOLD response

Our fourth test analyzed the effects of the independent variables on the Spearman correlation between

the Rewards received and the BOLD response felt by each subject each time a reward was received.

Again, as with our second hypothesis, this disregards the Q-learning model and solely looks for non-

linear relationships between Rewards received and BOLD response.

Figure 5.9: Table detailing interactions between the within subject factor and the between subjectfactors for Reward/BOLD Spearman Correlation.

Figure 5.9 shows us, under the column ”Sig.”, that the interaction between the within-subject factor

Opponent Type and the between-subject factor Sex produces a statistically significant change in mean

52

Spearman correlation. This allows us to reject the null hypothesis that the mean value of the dependent

variable doesn’t change in function of the interaction between Opponent Type and subject Sex.

Figure 5.10: Table detailing interactions between the between subject factors for Reward/BOLDSpearman Correlation.

Figure 5.10 shows that no between factor interaction produces a statistically significant change in

mean Spearman correlation.

Figure 5.11: Effect of the (Sex x Opponent Type) interaction on Reward/BOLD Spearman Cor-relation

In figure 5.11 we can see detailed the interaction between Sex and Opponent Type. The statistically

significant difference occurs between the correlation in female subjects playing against a CPU opponent

(where they show an average Spearman correlation of 0.023) and in male subjects that also play a

CPU opponent (where they show an average Spearman correlation of -0.045). Thus, we reject the null

hypothesis that ”The interaction between subject Sex and Opponent Type does not impact the Spearman

Correlation between received Rewards and BOLD response”.

53

5.5 Effects on the Pearson’s Correlation between Positive RPE’s

and the BOLD response

For our fifth test we analyzed the effects of the independent variables on the Pearson’s correlation

between the Positive RPEs and the BOLD response felt by each subject each time a reward was received.

By testing this as our fifth hypothesis, we try to determine if our brains respond differently when we are

positively surprised (positive RPE’s).

Figure 5.12: Table detailing interactions between the within subject factor and the between subjectfactors for Positive RPE/BOLD Pearson’s Correlation.

Figure 5.12 shows, under the column ”Sig.”, that the within subject factor Opponent Type produces a

statistically significant change in mean Pearson’s correlation. This allows us to reject the null hypothesis

that the mean value of the dependent variable doesn’t change in function of Opponent Type.

Figure 5.13: Table detailing interactions between the between subject factors for PositiveRPE/BOLD Pearson’s Correlation.

Figure 5.13 shows that no between factor interaction produces a statistically significant change in

mean Pearson’s correlation.

54

Figure 5.14: Main Effect of the independent variable Opponent Type on Positive RPE/BOLDPearson’s Correlation.

In figure 5.14 we can see the effect of the independent variable Opponent Type on the dependent

variable. The statistically significant difference occurs between the correlation in subjects playing against

a Human opponent (where they show an average PCC of 0.042) and in subjects that play against a CPU

opponent (where they show an average PCC of -0.036). We then reject the null hypothesis that ”The

subject’s Opponent Type does not impact the Pearson’s Correlation between Positive RPE’s and BOLD

response”.

55

5.6 Effects on the Spearman Correlation between Positive RPE’s

and the BOLD response

Finally, our last test analyzed the effects of the independent variables on the Spearman Correlation

between the Positive RPEs and the BOLD response felt by each subject each time a reward was received.

As with our fifth hypothesis we try to determine if our brains respond differently when we are positively

surprised (positive RPE’s), this time looking for a non-linear relationship between Positive RPE’s and

BOLD.

Figure 5.15: Table detailing interactions between the within subject factor and the between subjectfactors for Positive RPE/BOLD Spearman Correlation.

As can be seen in figure 5.15 under the column ”Sig.” the within subject factor Opponent Type and

its interaction with the other between subject independent variables cannot generate differences in the

dependent variable with significance levels under 0.05. We cannot reject the null hypothesis that the mean

value of the dependent variable doesn’t change in function of these independent variable interactions.

Figure 5.16: Table detailing interactions between the between subject factors for PositiveRPE/BOLD Spearman Correlation.

Looking at the table in figure 5.16 more specifically at the column ”Sig.” we can see that no vari-

ables and variable interactions produce statistically significant changes in the mean Spearman correlation

between Positive RPEs and the BOLD response. We then maintain our null hypothesis that ”Sex, ad-

ministered Drug and opponent type have no effect on the Spearman Correlation between Positive RPE’s

and the BOLD response”.

56

5.7 Results overview

Figure 5.17 presents a short overview of the obtained results.

Figure 5.17: Results overview table.

57

6Discussions

Contents

6.1 Main effect of Subject Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Main effect of Opponent Type . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Effect of the (Opponent Type x Subject Sex) interaction . . . . . . . . . . 59

6.4 Main effect of Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

58

In this chapter we will discuss the obtained results and some conclusions we took.

6.1 Main effect of Subject Sex

Whether we analyze the effects of independent variables on the Pearson’s or the Spearman correlation

of the RPE with the BOLD response there are differences in female and in male subjects. The female

subjects show a near zero correlation between these two factors while male subjects have a negative

correlation (both Pearson’s and Spearman). This difference could indicate that the activation of the

ROI goes down in males when rewards surprise the subject in a positive way (positive RPE) or heighten

activation of the ROI when they surprise the subject in a negative way (negative RPE). Previous work

as shown differences increased reward center activation in men compared to women, particularly when

receiving monetary rewards [2] [18]. While differences in activation are then expected, it is surprising to

see a correlation in an opposite direction. No difference between genders was registered when analyzing

only the positive RPE’s. Our results seem to indicate that the striatum increases in activity particularly

when rewards that are below men’s expectations. This is surprising and should be further studied.

6.2 Main effect of Opponent Type

When analyzing the Pearson’s Correlation between the Positive Reward Prediction Errors and the

activation of the ROI there was a difference between subjects facing a Computer and subjects facing a

putative Human opponent. When facing a Human, subjects show a positive correlation between RPE

(only positive ones) and BOLD response while, when facing a computer, this correlation turned negative.

This means that the activation of the ROI seems to be facilitated by the experience of positive RPEs

while facing humans while, the same experience against a computer produces an opposite effect. This is

surprising as previous research seems to indicate the existence of neuron structures in the striatum that

seem to process the gain of social and non-social rewards in the same way [19].

6.3 Effect of the (Opponent Type x Subject Sex) interaction

When analyzing the Spearman Correlation between the Rewards received (regardless of the subject’s

internal state) and the activation of the ROI there is a difference between female and male subjects,

particularly when they are playing against a computer opponent. Again, the female subjects show a

near zero correlation while the male subjects show a negative correlation that could indicate a lowered

activation of the ROI when positive rewards are received. The same does not appear to occur when

the subjects think they are facing a human opponent. In that situation, both male and female subjects’

correlations don’t differ significantly from zero. This indicates both that the difference detailed in chapter

59

6.2 mainly applies to male subjects and that the change in correlation for men and women when changing

opponent type goes into opposite directions, further accentuating the differences discussed in chapter 6.1.

6.4 Main effect of Drug

No main effects of any drug were measured in our analysis. Although Oxytocin has been shown to

affect the activation of the reward centre during both social and non-social learning [20], other studies

show that for non-social learning, Oxytocin seems to have no effect on the activation of the Nucleus

Accumbens (which is a part of the Striatum) [21]. This could explain the lack of apparent effect of the

drug in our analysis.

60

Bibliography

[1] J. K. Rilling, A. C. DeMarco, P. D. Hackett, R. Thompson, B. Ditzen, R. Patel, and G. Pagnoni,

“Effects of intranasal oxytocin and vasopressin on cooperative behavior and associated brain

activity in men,” Psychoneuroendocrinology, vol. 37, no. 4, pp. 447–461, 2012. [Online]. Available:

http://dx.doi.org/10.1016/j.psyneuen.2011.07.013

[2] J. K. Rilling, A. C. DeMarco, P. D. Hackett, X. Chen, P. Gautam, S. Stair, E. Haroon,

R. Thompson, B. Ditzen, R. Patel, and G. Pagnoni, “Sex differences in the neural

and behavioral response to intranasal oxytocin and vasopressin during human social

interaction,” Psychoneuroendocrinology, vol. 39, no. 1, pp. 237–248, 2014. [Online]. Available:


[3] Y. Niv, “Reinforcement learning in the brain,” Journal of Mathematical Psychology, vol. 53, no. 3,

pp. 139–154, 2009. [Online]. Available: https://www.princeton.edu/∼yael/Publications/Niv2009.pdf

[4] D. Ferreira, M. Lopes, J. Rilling, M. Antunes, and D. Prata, “The impact of oxytocin and vasopressin

intake on Prisoner’s Dilemma strategy: a computational modelling approach,” 2017.

[5] National Institute of Standards and Technology, “Two-Sample t-Test for Equal Means,” 2013.

[Online]. Available: http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm

[6] N. I. of Standards and Technology, “Critical Values of the Student’s t Distribution,” p. 1.3.6.7.2,

2013. [Online]. Available: http://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm

[7] E. Ostertagova and O. Ostertag, “Methodology and Application of Oneway ANOVA,” American

Journal of Mechanical Engineering, vol. 1, no. 7, pp. 256–261, 2013. [Online]. Available:

http://pubs.sciepub.com/ajme/1/7/21/index.html

[8] “N-Way ANOVA - MATLAB & Simulink.” [Online]. Available: https://www.mathworks.com/help/

stats/n-way-anova.html

[9] “Constrained Nonlinear Optimization Algorithms - MATLAB & Simulink.” [Online]. Available:

http://www.mathworks.com/help/optim/ug/constrained-nonlinear-optimization-algorithms.html

61



https://www.princeton.edu/~yael/Publications/Niv2009.pdf

http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm

http://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm

http://pubs.sciepub.com/ajme/1/7/21/index.html

https://www.mathworks.com/help/stats/n-way-anova.html

https://www.mathworks.com/help/stats/n-way-anova.html

http://www.mathworks.com/help/optim/ug/constrained-nonlinear-optimization-algorithms.html

[10] R. J. Vanderbei, Linear Programming: Foundations and Extensions, 1998, vol. 49, no. 1. [Online].

Available: http://link.springer.com/10.1057/palgrave.jors.2600987

[11] M. Jenkinson, C. F. Beckmann, T. E. Behrens, M. W. Woolrich, and S. M. Smith, “Fsl,”

NeuroImage, vol. 62, no. 2, pp. 782–790, aug 2012. [Online]. Available: http://www.ncbi.nlm.nih.

gov/pubmed/21979382http://linkinghub.elsevier.com/retrieve/pii/S1053811911010603

[12] N. D. Daw, Y. Niv, and P. Dayan, “Actions, policies, values and the basal gan-

glia,” Recent breakthroughs in basal ganglia research, no. February, pp. 91–106, 2005.

[Online]. Available: https://www.semanticscholar.org/paper/Actions-%2C-Policies-%2C-Values-%

2C-and-the-Basal-Ganglia-Daw-Niv/c9ee2d772062e7d0886ba5fc308a59a00862163e

[13] R. Clark-Elford, P. J. Nathan, B. Auyeung, V. Voon, A. Sule, U. Muller, R. Dudas, B. J. Sahakian,

K. L. Phan, and S. Baron-Cohen, “The effects of oxytocin on social reward learning in humans,”

International Journal of Neuropsychopharmacology, vol. 17, no. 2, pp. 199–209, 2014.

[14] B. B. Doll, K. G. Bath, N. D. Daw, and M. J. Frank, “Variability in Dopamine Genes Dissociates

Model-Based and Model-Free Reinforcement Learning,” Journal of Neuroscience, vol. 36, no. 4,

pp. 1211–1222, 2016. [Online]. Available: http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.

1901-15.2016

[15] J. P. O’Doherty, A. Hampton, and H. Kim, “Model-based fMRI and its application to reward learning

and decision making,” Annals of the New York Academy of Sciences, vol. 1104, pp. 35–53, 2007.

[16] S. M. McClure, G. S. Berns, and P. R. Montague, “Temporal prediction errors in a passive learning

task activate human striatum,” Neuron, vol. 38, no. 2, pp. 339–346, 2003. [Online]. Available:

https://www.sciencedirect.com/science/article/pii/S0896627303001545

[17] M. J. Frank, A. A. Moustafa, H. M. Haughey, T. Curran, and K. E. Hutchison, “Genetic

triple dissociation reveals multiple roles for dopamine in reinforcement learning,” Proceedings of

the National Academy of Sciences, vol. 104, no. 41, pp. 16 311–16 316, 2007. [Online]. Available:

http://www.pnas.org/content/104/41/16311

[18] G. Alarcon, A. Cservenka, and B. J. Nagel, “Adolescent neural response to reward is related to

participant sex and task motivation,” Brain and Cognition, vol. 111, pp. 51–62, 2017.

[19] S. J. Wake and K. Izuma, “A common neural code for social and monetary rewards in the human

striatum,” Social Cognitive and Affective Neuroscience, vol. 12, no. 10, pp. 1558–1564, 2017.

[20] J. Hu, S. Qi, B. Becker, L. Luo, S. Gao, Q. Gong, R. Hurlemann, and K. M. Kendrick, “Oxytocin

selectively facilitates learning with social feedback and increases activity and functional connectivity

62

http://link.springer.com/10.1057/palgrave.jors.2600987

http://www.ncbi.nlm.nih.gov/pubmed/21979382 http://linkinghub.elsevier.com/retrieve/pii/S1053811911010603

http://www.ncbi.nlm.nih.gov/pubmed/21979382 http://linkinghub.elsevier.com/retrieve/pii/S1053811911010603

https://www.semanticscholar.org/paper/Actions-%2C-Policies-%2C-Values-%2C-and-the-Basal-Ganglia-Daw-Niv/c9ee2d772062e7d0886ba5fc308a59a00862163e

https://www.semanticscholar.org/paper/Actions-%2C-Policies-%2C-Values-%2C-and-the-Basal-Ganglia-Daw-Niv/c9ee2d772062e7d0886ba5fc308a59a00862163e

http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.1901-15.2016

http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.1901-15.2016

https://www.sciencedirect.com/science/article/pii/S0896627303001545

http://www.pnas.org/content/104/41/16311

in emotional memory and reward processing regions,” Human Brain Mapping, vol. 36, no. 6, pp.

2132–2146, 2015.

[21] B. J. Mickey, J. Heffernan, C. Heisel, M. Pecina, D. T. Hsu, J. K. Zubieta, and

T. M. Love, “Oxytocin modulates hemodynamic responses to monetary incentives in

humans,” Psychopharmacology, vol. 233, no. 23-24, pp. 3905–3919, 2016. [Online]. Available:

http://dx.doi.org/10.1007/s00213-016-4423-6

63

http://dx.doi.org/10.1007/s00213-016-4423-6

reinforcement learning models of neuropeptide-modulated ... · neuropeptide-modulated human brain...

Documents