reinforcement learning models of neuropeptide-modulated ... · neuropeptide-modulated human brain...
TRANSCRIPT
Reinforcement Learning models of
neuropeptide-modulated human brain function
Luıs Eduardo Moutinho Guerra
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisors: Prof. Manuel Fernando Cabido Peres LopesCo-supervisor: Dr. Diana Prata
Examination Committee
Chairperson: Prof. Luıs Manuel Antunes VeigaSupervisor: Prof. Manuel Fernando Cabido Peres Lopes
Member of the Committee: Prof. Pedro Tiago Goncalves Monteiro
October 2018
Acknowledgments
Deciding to take my Master’s in this field was only due to my Bachelor’s teachers’ love for their craft.
Aspiring to branch out of my field due to my girlfriend’s advice ended up being very fulfilling.
Verifying my sanity was a task relegated to my mother and sister.
I couldn’t have done it without the support from all my friends at IST who soldiered alongside me.
Dampening my mood was only stopped by my hometown friends who bravely endured the dnd drought.
On-line friends made their support count and for that I must reward them. Pois.
Partaking in this journey really tested my limits and I couldn’t have made it without you.
E muito.
Abstract
The way alterations in the chemistry of the human brain affect social interactions is still not fully un-
derstood and deepening our knowledge in this field could allow us to create novel medical therapeutics
for a variety of diseases. Various Reinforcement Learning algorithms have been used to model learning
processes in both animals and humans. This thesis focuses on the study of the relation between the
activation of the Reward Centers of the human brain and specific parameters of a Reinforcement Learn-
ing algorithm. This algorithm is known as Q-learning and is used as a model for the learning process
of an individual playing an iterated Prisoner’s Dilemma styled social game for monetary rewards. This
relation is tested and compared between subject groups that are administered, by means of intranasal
spray, either placebos, Oxytocin or Vasopressin. Subjects are adults of both genders with ages in the 20
to 40 years range and are grouped by gender during experiments.
Keywords
Q-learning; Prisoner’s Dilemma; fMRI; Reinforcement Learning;
iii
Resumo
A maneira como as alteracoes na quımica do cerebro afetam as nossas interacoes sociais nao e ainda
completamente compreendida e aprofundar os nossos conhecimentos nesta area pode permitir-nos criar
novas terapeuticas para varias doencas. Varios algoritmos de Aprendizagem por Reforco foram utilizados
para modelar processos de aprendizagem, tanto em animais como em humanos. Esta tese foca-se no estudo
da relacao entre a ativacao dos Centros de Recompensa do cerebro humano e parametros especıficos de
um algoritmo de Aprendizagem por Reforco, conhecido por Q-learning, utilizado como modelo para o
processo de aprendizagem de um indivıduo enquanto este joga um jogo social iterativo semelhante ao
famoso Dilema do Prisioneiro. Esta relacao e testada e comparada entre grupos de participantes aos quais
foram administrados, por meio de spray intranasal, doses de um Placebo, Oxitocina, ou Vasopressina.
Os participantes sao adultos de ambos os sexos com idades compreendidas entre os vinte e quarenta anos
que foram agrupados por sexo durante as experiencias.
Palavras Chave
Q-learning; Dilema do Prisioneiro; fMRI; Aprendizagem por Reforco;
v
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Exploration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 ε–greedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.2 Boltzmann policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Two Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 N-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Interior Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 FSL, FEAT and FeatQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9.1 FEAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9.2 FeatQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.11 Spearman Rank Order Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Related Work 20
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Base Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Effects of intranasal Oxytocin and Vasopressin on cooperative behavior and asso-
ciated brain activity in men . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Sex differences in the neural and behavioral response to intranasal oxytocin and
vasopressin during human social interaction . . . . . . . . . . . . . . . . . . . . . . 25
3.3 The validity of modeling brain processes with Reinforcement Learning (RL) . . . . . . . . 25
vii
3.3.1 Reinforcement learning in the brain . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Actions, Policies, Values, and the Basal Ganglia . . . . . . . . . . . . . . . . . . . 27
3.3.3 Model-based fMRI and its application to Reward Learning and Decision making . 28
3.4 The importance of the striatum when dealing with reward prediction error . . . . . . . . . 28
3.4.1 Temporal prediction errors in a passive learning task activate human striatum . . 28
4 Methods 30
4.1 Data Processing (Extraction, Transformation, Loading (ETL)) . . . . . . . . . . . . . . . 31
4.1.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.3 Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Q-learning parameters estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Reward Prediction Error (RPE) estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 Blood-oxygen-level dependent (BOLD)/RPE correlation and respective Analysis of vari-
ance (ANOVA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 Defining a Q-learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Boundaries for the η parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Empirical model testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8.1 Artificial test subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.2 Generation of Action and Reward sequences . . . . . . . . . . . . . . . . . . . . . . 40
4.8.3 Estimation of subjects’ parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8.4 Percentage Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8.5 Test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 Testing the Q-learning implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 Chi-square test to test confounding effects of Round Order . . . . . . . . . . . . . . . . . 44
4.11 ANOVA design to test effects of Sex, Drug and Opponent . . . . . . . . . . . . . . . . . . 46
5 Results 47
5.1 Effects on the Pearson’s Correlation between RPE and the BOLD response . . . . . . . . 48
5.2 Effects on the Pearson’s Correlation between Reward and the BOLD response . . . . . . . 50
5.3 Effects on the Spearman Correlation between RPE and the BOLD response . . . . . . . . 51
5.4 Effects on the Spearman Correlation between Reward and the BOLD response . . . . . . 52
5.5 Effects on the Pearson’s Correlation between Positive RPE’s and the BOLD response . . 54
5.6 Effects on the Spearman Correlation between Positive RPE’s and the BOLD response . . 56
5.7 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
viii
6 Discussions 58
6.1 Main effect of Subject Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Main effect of Opponent Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Effect of the (Opponent Type x Subject Sex) interaction . . . . . . . . . . . . . . . . . . . 59
6.4 Main effect of Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
x
List of Figures
2.1 Prisoner’s Dilemma (PD) punishment distribution matrix . . . . . . . . . . . . . . . . . . 5
2.2 Mask that identifies a ROI around the left amygdala, one of the constituents of the reward
system of the subject’s brain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 FEAT’s Miscellaneous menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 FEAT’s Data menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 FEAT’s Stats menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 FEAT’s Post-stats menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 FEAT’s Full model setup menu, events tab . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 FEAT’s Full model setup menu, contrasts tab . . . . . . . . . . . . . . . . . . . . . 16
2.9 The FeatQuery menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Two associated variables (A and B) and their ranked versions . . . . . . . . . . . 18
2.11 Plot of variables A and B (unranked) and their respective trendline compared
to a plot of their ranked counterparts. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Overview table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Time-line for one round of PD regardless of the nature of the opponent . . . . 22
3.3 PD payoff matrix for the game performed in Rilling’s study [1] . . . . . . . . . . 23
3.4 A monkey’s neuro-conditioning to a sound (CS) followed by being fed juice
(US) in instants a) b) and c) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Activation of the striatum when the subject receives an unexpected reward. . 29
4.1 ETL description scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Percentage of the search space with P (a) > 0.95 for each η . . . . . . . . . . . . . 36
4.3 Plot of the P(cooperation) over all possible Q-values with η=1.5 . . . . . . . . . . . . 37
4.4 Plot of the P(cooperation) over all possible Q-values with η=3 . . . . . . . . . . . . . 38
4.5 Plot of the P(cooperation) over all possible Q-values with η=20 . . . . . . . . . . . . 38
4.6 Experimental-real RPE correlation for various models . . . . . . . . . . . . . . . . 41
xi
4.7 Graph showing the evolution of a subject’s Q-values for cooperation. . . . . . . 42
4.8 Graph showing the evolution of a subject’s Q-values for defection. . . . . . . . . 43
4.9 Graph showing the evolution of a hypothetical subject’s Q-values for cooperation. 44
4.10 Table showing the distribution of subjects across dependent variables. . . . . . 45
4.11 Figure showing the results of the Chi-squared test. . . . . . . . . . . . . . . . . . . 46
5.1 Table detailing interactions between the within subject factor and the between
subject factors for RPE/BOLD Pearson’s Correlation. . . . . . . . . . . . . . . . 48
5.2 Table detailing interactions between the between subject factors for RPE/BOLD
Pearson’s Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Main effect of the independent variable Sex on the RPE/BOLD Pearson’s
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Table detailing interactions between the within subject factor and the between
subject factors for Reward/BOLD Pearson’s Correlation. . . . . . . . . . . . . . . 50
5.5 Table detailing interactions between the between subject factors for Reward/BOLD
Pearson’s Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6 Table detailing interactions between the within subject factor and the between
subject factors for RPE/BOLD Spearman Correlation. . . . . . . . . . . . . . . . 51
5.7 Table detailing interactions between the between subject factors for RPE/BOLD
Spearman Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.8 Main effect of the independent variable Sex on RPE/BOLD Spearman Corre-
lation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.9 Table detailing interactions between the within subject factor and the between
subject factors for Reward/BOLD Spearman Correlation. . . . . . . . . . . . . . 52
5.10 Table detailing interactions between the between subject factors for Reward/BOLD
Spearman Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.11 Effect of the (Sex x Opponent Type) interaction on Reward/BOLD Spearman
Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.12 Table detailing interactions between the within subject factor and the between
subject factors for Positive RPE/BOLD Pearson’s Correlation. . . . . . . . . . . 54
5.13 Table detailing interactions between the between subject factors for Positive
RPE/BOLD Pearson’s Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.14 Main Effect of the independent variable Opponent Type on Positive RPE/BOLD
Pearson’s Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.15 Table detailing interactions between the within subject factor and the between
subject factors for Positive RPE/BOLD Spearman Correlation. . . . . . . . . . 56
xii
5.16 Table detailing interactions between the between subject factors for Positive
RPE/BOLD Spearman Correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.17 Results overview table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
xiii
xiv
List of Tables
2.1 Pearson’s Correlation Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Empirical test results for experiments with 30 rounds . . . . . . . . . . . . . . . . . . . . . 40
4.2 Empirical test results for 100 rounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xv
xvi
Acronyms
RL Reinforcement Learning
IMM Instituto de Medicina Molecular
PD Prisoner’s Dilemma
ROI Region of Interest
PCC Pearson’s Correlation Coefficient
RPE Reward Prediction Error
OT Oxytocin
AVP Vasopressin
fMRI Functional Magnetic Resonance Imaging
BOLD Blood-oxygen-level dependent
ETL Extraction, Transformation, Loading
ANOVA Analysis of variance
xvii
xviii
1Introduction
Contents
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1
1.1 Motivation
Our goal is to model the learning and decision processes that occur in the human brain while playing
a Prisoner’s Dilemma style game using a Reinforcement Learning (RL) algorithm known as Q-learning,
then use these models to correlate the intensity of the Blood-oxygen-level dependent (BOLD) response
registered in the reward centers of the individuals’ brain with various measures taken from these RL
models.
1.2 Introduction
As a social species, we constantly interact among ourselves to achieve our goals. Such interactions
allow us to coordinate efforts to achieve goals we wouldn’t be able to achieve alone. This puts us, however,
under the constant burden of resisting the temptation of engaging in anti-social behavior that could bring
us great momentary gains at the expense of those of our peers that would, from that moment on, be
less inclined to positively interact with us. Pro-social behavior brings trust and long-term stability while
anti-social behavior brings dis-trust and creates problems that are often only noticeable in the distant
future. It is then vital for the long-term success of any society, for cooperation to be seen as the default
behavior. To that end we collectively praise individuals who interact in cooperative ways (charity donors,
social volunteers, etc) while looking down on the ones who seek to further their ambitions at the expense
of others (criminals in general).
In a society with such a social setup, it is very important for each individual to be capable of judging
other’s actions and acting accordingly. Failing to do so might lead that individual to overestimate negative
actions towards him, and react in an overaggressive manner, or to underestimate positive actions, making
him look unappreciative. While these errors in judgment affect everyone, despite their mental health state,
patients that suffer from certain conditions can be unable to understand the intentions of the people they
interact with to a point that it negatively impacts their lives. It is known that the amount of stimulation
certain areas of the brain receive is very influential in the way an individual responds in situations where
that part of the brain is used. It is also known that certain drugs can be effective at influencing the way
certain regions of the brain interact with certain chemicals and this is the basis for most work in the
field of neurology. As such, a lot of work has been put into determining which substances could affect
the parts of the brain that are responsible for our social behavior. Studies by James Rilling [1] [2] found
links between increased activity in various regions of the brain that are part of our reward systems and
the administration of various drugs making the test subjects act in more or less cooperative ways. This
seems to align with the idea that our actions, while interacting socially, are at least partially dictated
by the rewards we perceive we are getting from them which means that pathologies that derive from
a diminished ability to correctly perceive rewards received or to realistically estimate potential rewards
2
could be corrected by applying these substances in a specific way.
As shown by Yael Niv [3] RL models are good approximations of the decision making processes that
occur in our brain as they emulate the reward based and model free nature of the discovery we make of
our environment in certain situations. This means that any drug induced changes to the decision process
of an individual should manifest in the RL models estimated by observing the evolution of their actions
in different situations.
This work was conducted in parallel to another study being performed by a research team at Instituto
de Medicina Molecular (IMM) that is trying to relate the administration of certain substances with human
individuals’ performance in a social game. The team is working on a paper, currently in writing [4], that
encompasses ideas of relating RL models of the human behavior during a Prisoner’s Dilemma styled game
with the type of strategies individuals tend to follow when playing.
We analyzed data resulting from prior brain examinations and tried to determine if certain regions of
the brain, constituents of the reward center, play a relevant part on the way we perceive and act on the
current state of a social interaction.
We analyzed data from both Rilling’s studies [1] [2]. This data relates to an experiment conducted
by the author’s team that involved examining human test subjects with Functional Magnetic Resonance
Imaging (FMRI) while they played repeated rounds of a Prisoner’s Dilemma style game against computers
and computers that were perceived as humans. This involved:
• Extracting all the data produced in both Rilling’s studies [1] [2].
• Transforming the data so that it becomes evident the association between each action sequence and
each reward system activation sequence for each subject.
• Processing the FMRI data so that individual reward center stimulus can be matched with specific
events or actions along the experiment time frame.
• Resorting to parameter fitting in order to fit the parameters of a Q-learning model to the action
sequences per-formed by the test subjects.
• Studying the correlation between the amount of stimulation found in the reward centers of the brain
and measures resulting from the Q-learning models.
3
2Background
Contents
2.1 Prisoner’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Exploration Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Two Sample T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 One-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 N-way ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Interior Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 FSL, FEAT and FeatQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.10 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.11 Spearman Rank Order Correlation . . . . . . . . . . . . . . . . . . . . . . . . 18
4
2.1 Prisoner’s Dilemma
The PD is a thought experiment that puts two prisoners in separate interrogation rooms. It is then
explained to the prisoners that if they incriminate their partner in a crime they both are accused of
committing together they get to walk free as long as their partner does not incriminate them back. If
this happens their partner serves three years in jail. However, if the prisoner chooses to remain silent and
his partner incriminates him he will have to serve the three years while the partner walks free. If both
incriminate one another they serve two years each but if both remain silent they will serve only one year
each.
Figure 2.1: PD punishment distribution matrix
The punishment distribution (Figure 2.1) across all possible outcomes is what makes this problem
very interesting as it creates a Nash Equilibrium (a state where two participants in a game perform a
strategy from which neither is incentivised to deviate from) around the outcome of mutual defection
since no matter what the partner decides to do the prisoner that contemplates the choice always stands
to gain an immediate advantage by choosing to defect. This suggests that if two rational players were
to play each other in a PD-style game they would always defect one another. However, when observing
the real world, we observe many instances of rational individuals identifying cooperation as the most
advantageous move since they find they can cultivate trust in others. The PD and other dilemmas like
it inspired new fields of study that seek to explain and model real world behaviors as well as trying to
find strategies to force people to take more socially responsible choices.
5
2.2 Reinforcement Learning
Reinforcement learning is a machine learning technique that is used to build agents that progressively
learn as they interact with their environment becoming progressively better at performing their task.
The learning process is iterative. In each iteration the agent chooses an action based on the information
gathered from previous interactions with the section of the environment it is currently in. This means
that the agent will learn the intricacies of the environment in a ”piece-by-piece” fashion. This means that
these environments or state-spaces are suitable to be modeled as Markov Decision Processes (MDP) since
most RL algorithms expect discrete state spaces and interactions between the agent and the environment
that follow the Markov Property.
Upon choosing an action for the current state the agent will observe a ”feedback” from the environ-
ment regarding the quality of the action taken. This feedback will either align or not with the agent’s
expectations. If the agent was expecting a different outcome it will proceed to change its expectations so
that in future interactions, it is capable of better choosing the most advantageous action.
After a sufficient number of interactions, the agent will ideally have traveled to all the possible states
in the state space a number of times high enough for it to try all available actions a sufficient number of
times. This should provide it with the ability of deciding which action to take no matter the state the
agent finds itself in.
2.3 Q-Learning
Q-learning is an off-policy RL algorithm that learns the optimal policy to navigate any finite state-
space (usually stored in the form of a matrix) provided it has the opportunity to reach each state a
sufficient number of times. The update of the Q-values is given by equation 2.1 where it details that the
Q value regarding the state x and action a should be determined by the sum of its current value with a
learning rate α that multiplies with the sum of the recently acquired reward with the estimation of the
reward that would be acquired by following a greedy policy (a value that is multiplied by the discount
factor or inflation rate γ) and the negative of the Q’s current value.
Q(x,a) = Q(x,a) + α(r(x,a) + γmaxbQ(y,b) −Q(x,a)), α, γ ∈]0, 1] (2.1)
σ = r(x,a) + γmaxbQ(y,b) −Q(x,a) (2.2)
The expression in equation 2.2, is often referred as RPE (Reward Prediction Error) as it represents
the error between the reward received and the one that the agent expected to receive. In summary
Q-learning works by iteratively adding the RPE observed for each state-action pair to the previously
estimated Q-value weighting the RPE and the maximum future action reward by an α, γ ∈]0, 1] pair.
6
2.4 Exploration Methods
To guarantee that the Q-learning algorithm visits a diverse number of states the policy that it follows
must not be greedy. For that, many methods, some simpler than others, exist to guarantee such diversity.
2.4.1 ε–greedy
In the ε–greedy policy, the agent executes the action that would be indicated by a greedy policy with
probability 1− ε. With probability ε the agent will perform an exploration step that will involve selecting
one of the non-optimal actions at random with a uniform probability distribution between them. This
ensures that the algorithm finds a balance between exploring the world for new knowledge and exploiting
the already accumulated knowledge to generate value.
2.4.2 Boltzmann policy
The Boltzmann policy functions similarly to the ε–greedy in that it deviates from choosing the optimal
state with a certain probability. Its sophistication lies in the fact that the probability of choosing any
specific action is proportional to the value the RL algorithm currently assigns it (in this case a Q-value).
P(a|x) =eηQ(x,a)∑b∈A e
ηQ(x,b), η ∈ [0,+∞[ (2.3)
As can be seen in 2.3 the probability of an agent that follows the Boltzmann policy choosing action
a in state x is given by the fraction of a numerator eηQ(x,a) that is as big as the Q-value of (a | x) by a
denominator∑b∈A e
ηQ(x,b) that sums eηQ(x,b) for all actions available in x. In this way, the probability
of following a certain action increases as its Q-value increases making the choice between exploration and
exploitation change dynamically as the exploration progresses. The value of η allows control over how
greedily the policy behaves, the higher the value of η the greedier the policy is.
7
2.5 Two Sample T-Test
In most branches of scientific work, it will be required to test the validity of the results of all experi-
ments. For this purpose, statistical testing tools are used. One of the most well know is the Two Sample
T-test [5]. The Two Sample T-test determines if the means of two populations from which were extracted
two samples (one from each population) are representative of the population they originated from or if
their characteristics are caused by statistical noise. This process occurs as in other hypothesis testing
techniques. A null hypothesis H0 must be defined in a way that rejecting or not the null hypothesis
provides us with relevant information about the statistical validity of the results.
In cases where a comparison between populations must be made (i.e. population that took a placebo
against a population that took some sort of experimental medication) a Two Sample T-test serves us well
by indicating if both populations are significantly different from each other (meaning that the medication
had any effect). For this, H0 must be set as stating that both populations have equal means 2.4.
H0 : µ1 = µ2 (2.4)
The Two Sample T-test defines a test statistic that is used to determine the whether H0 should or
should not be rejected 2.5 where x and y are the mean values of the samples taken from population 1
and 2 respectively, sx and sy are the standard deviations from populations each population and n and m
correspond to the number of samples each population possesses.
T =x− y√s2xn +
s2ym
(2.5)
After determining the value of T one can determine whether H0 should be rejected or not by following
2.6 where t(1− α2 , v) corresponds to the critical value of a t-distribution [6] with confidence level α (usually
equal to 0.05) and a v number of degrees of freedom where v is given by 2.7.
|T | > t1−α2 ,v (2.6)
v =
(s2xn
)2
+
(s2ym
)2
(s2xn
)2
(n−1) +
(s2yn
)2
(m−1
)(2.7)
If 2.6 turns out the be true, then H0 will be rejected meaning that the two distributions are likely to
have different means making the results statistically significant. If 2.6 turns out to be false, then H0 will
not be rejected meaning the results are not statistically significant.
8
2.6 One-way ANOVA
When comparing more than two independent samples from three or more populations, One-way
ANOVA [7], or One-way Analysis Of Variance, provides the capability of performing statistical tests on
the validity of experimental results. As with Two Sample T-test a null hypothesis H0 must be defined
so that it states that all the means for all populations are equal 2.8.
H0 : µ1 = µ2 = ... = µk (2.8)
The One-way ANOVA defines its test statistic 2.9 in function of the values of MSC and MSE.
F =MSC
MSE(2.9)
In order for MSE, the Mean of the Squared Errors between samples of the same sample group, to be
computed, as in 2.10 where SSE corresponds to the Sum of Square Errors between all samples of its own
sample group, N corresponds to the total amount of samples in all sample groups and k corresponds to
the number of sample groups, the value of SSE, the Sum of Square Errors between the samples of each
sample group, must be determined. Similarly to MSE, MSC, the Mean of the Squared Errors between
the means of all the sample groups, can be computed as in 2.11 where SSC, the Sum of Square Errors
between the means of all the sample groups, is dependent on determining SSC.
MSE =SSE
N − k(2.10)
MSC =SSC
k − 1(2.11)
SSE can be computed as in 2.12 where xij corresponds to the value of the j-th sample of the i-th
sample group and xi corresponds to the mean the i-th sample group. Similarly, SSC can be computed as
in 2.13 where x corresponds to the mean value of all the samples across all the sample groups as in 2.14.
SSE =
k∑i=1
ni∑j=1
(xij − xi)2 (2.12)
SSC =
k∑i=1
ni∑j=1
(xi − x)2 (2.13)
x =1
N
k∑i=1
ni∑j=1
xij (2.14)
After computing F the statistical test can be performed as in 2.15 where F1−α,k−1,N−k corresponds
9
to the critical value of a Fisher’s-distribution with confidence level α (usually equal to 0.05) and a k-1
and N-k of degrees of freedom.
F > F1−α,k−1,N−k (2.15)
If 2.15 turns out to be true then H0 is rejected, meaning that at least one of the populations is likely
to be significantly different than the others (1-α likely).
2.7 N-way ANOVA
When analyzing a problem where one dependent variable may or may not be influenced by a group
of independent variables and their interactions, the N-way ANOVA [8] is one of the most used tools to
study these possibilities. This scenario will generate numerous null hypothesis as not only the effect
of each independent variable by itself on the dependent variable is studied but also the interactions of
the independent variables between them might have an effect on the dependent variable. For a set of
independent variables α, β, γ the null hypothesis for their effect on the dependent variable by themselves
will be defined as in 2.16 while the interactions between them will be defined as in 2.17 (for an interaction
between independent variables α and β) or as in 2.18 (for an interaction between all three independent
variables).
H0 : α1 = α2 = ... = αi (2.16)
H0 : α1β1 = α1β2 = ... = αiβj (2.17)
H0 : α1β1γ1 = α1β1γ2 = ... = αiβjγk (2.18)
The N-way ANOVA is a specific case of a General Linear Model and can be defined (for the three
variables described) as in 2.19 where µ represents the overall mean of the dependent variable across all
groups, αi, βj and γk represent the effect of the group i, j or k from its respective independent variable
on the overall mean, parameter interactions such as (βjγk) represent the effects of interactions between
the independent variables involved and εijk represent the error associated with groups i, j and k when
present in a sample at the same time.
yijkr = µ+ αi + βj + γk + (αiβj) + (αiγk) + (βjγk) + (αiβjγk) + εijk (2.19)
All the independent variable related parameters in 2.19 are subject to a constraint that forces the sum
10
of their parameters across all groups to be 0 (e.g. equation 2.20 shows this constrain for the independent
variable α)
I∑i=1
αi = 0 (2.20)
The estimation of the independent variable group parameters is usually done through a Iteratively
re-weighted least squares method that iterates through the provided samples and finds the parameters
that best fit the General Linear Model to the data.
2.8 Interior Point Algorithm
The Interior Point algorithm [9] is a local optimization algorithm that minimizes a given function
while obeying certain constraints on its domain through the addition of barrier functions to the function
being minimized. In the context of this study all the constraints applied on the parameters of the function
will be constant as in 2.21 and so, according to Barrier function theory [10], when dealing with a function
f(x) with a constraint of x > b 2.21 we can dismiss the inequality constraint by subtracting µc(x) to f(x)
2.22 and maintain the same restriction so long as µ is a free parameter that, as it approaches zero, allows
proposed solutions to the minimization of the equation in 2.21 to approach b and so long as c(x) =∞ if
x < b.
f(x), x > b (2.21)
f(x) + µ c(x), c(x) = +∞ if x < b (2.22)
One commonly used barrier function is −log(x) since it tends to infinity when x tends to zero 2.23.
f(x) − µ log(x−b) (2.23)
And so, by minimizing 2.23 iteratively while decreasing µ we find the local minimum of 2.21 depending
on the initial value for x (x0). When dealing with functions with local minima the Interior Point Algorithm
may fail to find the global minimum. One way to circumvent this is to run the method multiple times
with different x0 and keep the best solution, thus increasing the probability that a global minimum would
be found.
11
2.9 FSL, FEAT and FeatQuery
FSL [11] is a library of tools that can be used to analyze Functional Magnetic Resonance Imaging
(fMRI) and other neuro-imaging exams. In particular there are two tools that are relevant to this project:
FEAT and FeatQuery.
FEAT is responsible for analyzing fMRI data that has been processed from an initial raw state into
an analyzable one resorting to BET (another FSL tool). For that the user must identify each event that
they wish to have analyzed. When using FEAT, one should consider an event as a group of occurrences
that will be averaged before being analyzed. This means that it is valid to define an event that contains
all the occurrences of a certain type that may occur during an experiment in order to understand what
happens in the brain during that type of event (on average), as well as to define each event as a single
occurrence of any type to get data on that specific occurrence.
After analyzing the data with FEAT, the user can resort to FeatQuery to read the files produced by
FEAT. For that it is important for the user to define what is called a Region of Interest (ROI). A ROI
is defined by a mask that delimits a region of the brain that the user wishes to have data on.
Figure 2.2: Mask that identifies a ROI around the left amygdala, one of the constituents of the reward systemof the subject’s brain.
After applying the mask and defining the ROI, FeatQuery will take the data of each voxel inside the
ROI and average it. Note that if the user defined events in FEAT composed of various occurrences each
voxel will have the average value of stimulation present in that region of the brain for all occurrences.
After the whole procedure the user should have the amount of stimulus a certain area of the brain was
subjected to during an event.
2.9.1 FEAT
Feat is the first tool we will use in the BOLD response data extraction. It will be responsible for
extracting the parameters from the function that describes the BOLD values through time, taking into
12
account that certain time points are classified with certain events (for example, beginning or end of a
round).
Figure 2.3: FEAT’s Miscellaneous menu
Figure 2.4: FEAT’s Data menu
In the Miscellaneous menu, as seen figure in 2.3, we have the ”Brain/Background threshold” field that
allows the regulation of the threshold for the background noise in the fMRI. Only stimulation above a
certain value will be considered relevant for the analysis while values below will be considered as noise.
13
Using option ”Z threshold” we can define the threshold that filters relevant BOLD responses from
weaker ones that may still be registered. These weaker BOLD responses might result from the lingering
effects of a past activation that still registers in the brain.
In FEAT’s Data menu (figure 2.4) we selected FEAT directories as the format our input would be in
(since that was the format data from previous processing phases was delivered to us) an chose which files
to input. After, we chose the total number of volumes (brain image) along with the time each of those
volumes took to collect. Finally, we chose the cutoff value for a High pass filter, to reduce noise in the
extraction.
Figure 2.5: FEAT’s Stats menu
Figure 2.5 shows FEAT’s Stats menu. This menu contains options to, for example compensate for
motion correction during scans, but since Rilling’s team did not use this functionality we also won’t use
it.
14
Figure 2.6: FEAT’s Post-stats menu
Figure 2.7: FEAT’s Full model setup menu, events tab
In figure 2.6 we can see FEAT’s Post-stats menu. In this menu we instructed FSL to look for clusters
15
of voxels (in the Thresholding option) that show average activation values over the value defined in the
Z threshold field while the P threshold defines the p-value used to define whether the Z threshold was
passed or not.
By selecting the option Full Model Setup in FEAT’s Stats menu we get to the menu described in figure
2.7. In this menu we selected the number of events each experiment contained. In our case we considered
60 events, 30 of them being the beginning of each of the 30 rounds while the other 30 corresponded to
the moment rewards were received in each round.
For each event we gave it a name and chose a function to model the hemodynamic response felt in
the brain during its occurrence. We chose the sinusoidal function as it is a staple when modelling this
type of brain activity.
Figure 2.8: FEAT’s Full model setup menu, contrasts tab
In the Contrasts tab of the Full Model Setup menu (fig 2.8) we defined the matrix that relates the
events to the baseline. We could have also related events to one another but that didn’t make sense due
to the design of the experiment and the singular nature of each of our events. No F-tests were performed
since our events are singular instances and not averages of many occurrences.
After concluding all the steps necessary to process one subject’s data into a FEAT folder we used a
functionality of FEAT to export all the configuration made to a script file so that they could later be
16
adapted to other subjects via a script.
2.9.2 FeatQuery
Figure 2.9 shows the FeatQuery menu. We used FeatQuery to apply the mask that defined our ROI.
First, we selected the folder containing the data processed by FEAT relative to one subject as one subject
was processed at a time. Then we chose the file containing the mask of that specific subject.
Figure 2.9: The FeatQuery menu
Due to the original mask being in standard space (a standardized set of coordinates that is used in
neuroscience) it had to be transformed to to the space each subject’s measurements were performed on
(highres space). This transformation was conducted with a command line script detailed below:
flirt -in [Standard Space Mask] -ref [Output of Highres File] -applyxfm -init [Standard
to Highres transformation file] -datatype float -out [Transformed Mask Location]
The transformed mask obtained from this comand was the one fed to FeatQuery.
17
2.10 Pearson’s Correlation Coefficient
Determining the correlation factor between two characteristics of a patient’s brain (like the stimulation
data from an fMRI and the data from a Q-learning model) will be very important during the course of
this work. When performing correlation analysis, Pearson’s Correlation Coefficient (PCC) proves to be
a very powerful tool in quickly determining the linear correlation between two features or variables.
ρ(X,Y ) =1
N − 1
N∑i=1
(Xi − µXσX
)(Yi − µYσY
)(2.24)
The PCC can be calculated as seen in 2.24 where µX is the average of the all the values of X and σX
their standard deviation and its value can vary between -1 and 1. Most problems require the analysis of
the PCC between various features and a matrix format is adopted to display the correlation coefficients
for quick analysis. An example can be seen in Table 2.1.
A B CA ρ(A,A) ρ(A,B) ρ(A,C)B ρ(B,A) ρ(B,B) ρ(B,C)C ρ(C,A) ρ(C,B) ρ(C,C)
Table 2.1: Pearson’s Correlation Matrix
2.11 Spearman Rank Order Correlation
In alternative to the Pearson’s correlation we considered studying the Spearman Correlation between
our parameters as it is able to capture other types of relationships between variables that the PCC cannot.
Computing the the Spearman Rank Order Correlation Coefficient is very similar to the process re-
quired to determine the PCC. There is only one extra step required that consists in ranking the vectors
containing the two variables being correlated.
Figure 2.10: Two associated variables (A and B) and their ranked versions
18
Figure 2.10 shows two variables A and B and their respective ranked variables. They are ranked in
integers from lowest to highest. In figure 2.11 we can see the unranked variables A and B plotted along
with the line that would define their Pearson’s correlation compared to a similar plot of their ranked
counterparts.
Figure 2.11: Plot of variables A and B (unranked) and their respective trendline compared to aplot of their ranked counterparts.
We can infer by visual inspection that, were the PCC to be determined for the unranked variables,
its value wouldn’t be very high. On the other hand, the linear correlation between the ranked variables
is perfect. By ranking the variables we expose a relationship between them that wasn’t captured by the
PCC. This operation is the equivalent of fitting the unranked variables to any monotic function. The
Spearman correlation has the benefit of being less affected by outliers than the Pearson’s Correlation,
while being more prone to identify irrelevant relationships due to its flexibility.
19
3Related Work
Contents
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Base Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 The validity of modeling brain processes with RL . . . . . . . . . . . . . . . 25
3.4 The importance of the striatum when dealing with reward prediction error 28
20
3.1 Overview
Figure 3.1 details an overview of papers studying changes in behaviour and brain functioning as their
subjects execute reward oriented tasks, their methodologies and results, with ↑ indicating increases and
↓ indicating increases in whichever metric is referenced in front of it. Some of these studies will warrant
further inspection in subsequent sub-chapters.
Figure 3.1: Overview table
21
3.2 Base Studies
The studies described below served as the main basis for this work. Both were performed on humans,
with the same methodology. The data analyzed in this work originated from the studies below.
3.2.1 Effects of intranasal Oxytocin and Vasopressin on cooperative behavior
and associated brain activity in men
Oxytocin (OT) and Vasopressin (AVP) are substances known to affect social behavior in humans. In
”Effects of intranasal Oxytocin and Vasopressin on cooperative behavior and associated brain activity
in men” [1] the researchers investigated the effects of these substances in human male subjects through
a series of matches of an iterated version of the PD game against computers and computers that they
perceived as human beings.
In this variant of the PD game, in each round, the player that goes first (Player 1) chooses to either
cooperate or defect. After he makes his choice, a variable amount of time passes and his opponent (Player
2) gets to see the choice made by his opponent and makes his own. After both players choose their actions,
they see the outcome of the round after which another round will start.
Figure 3.2: Time-line for one round of PD regardless of the nature of the opponent
When both players defect (DD) both will be rewarded with 1$. When both players cooperate (CC)
both will receive 2$ instead. When one player cooperates but the other defects (CD or DC) meaning,
when one player “cheats” the other, the defecting player will get 3$ while the cooperating player gets
nothing.
22
Figure 3.3: PD payoff matrix for the game performed in Rilling’s study [1]
As it is usual in Prisoner’s Dilemma like games, it exists a Nash Equilibrium, a state where no
participant is motivated to switch their strategies, in a situation of mutual defection (DD) as both
players lose 1$ by deviating from their respective strategy. At the same time, mutual cooperation (CC)
situations are unstable since both players stand to gain 1$ more by defecting if they assume the other
player will continue to cooperate.
In the adapted the PD game the players play thirty rounds in a row, but they also play against a
human or a computer opponent two games each, one as player 1 and other as player 2. In reality, all
games were played against a computer algorithm no matter if players believed they were playing a human
or a computer. This computer algorithm behaved in a way that human players would not easily suspect
their nature when passing as human. When playing as Player 2, the algorithm would defect back on a
defecting Player 1 90% of the time but would only cooperate on a cooperating Player 1 67% of the time.
In this way the algorithm acts in a self-preserving way when facing a defector by rarely allowing itself to
be exploited while using its position of power to sometimes (33%) defecting on a cooperating Player 1 in
order to increase its gains. When playing as Player 1 the algorithm plays a probabilistic tit-for-two-tats.
Starts by always cooperating in the first round since defecting would result in a virtually assured mutual
defection that could set a negative trend for the whole game. If cooperated back (CC) the algorithm will
always continue cooperation. When defected back (CD) the algorithm will try to cut losses by defecting
90% of the time in the next round. This would likely lead to a mutual defection scenario (DD) in posterior
rounds which would lead the algorithm to try and resume cooperation by cooperating 33% of the time. If
at any time Player 2 decided to return to cooperation the algorithm would reciprocate 100% of the time.
The human subjects were monitored through fMRI during the whole game. The researchers also
registered the time stamps of specific events like round starts, Player 1 discovering that he was defected
by its opponent, etc allowing them to extract brain information associated with relevant occurrences.
After analyzing the data, the study made some interesting findings:
• Players as Player 1 across all drug groups were 9% more likely to cooperate after a CC outcome
23
when facing perceived humans (89%) than when facing a computer (80%). (p = 0.0002)
• Players as Player 1 across all drug groups were 9% less likely to cooperate after a CD outcome when
facing perceived humans (24%) than when facing a computer (33%). (p = 0.005)
• Players as Player 1 across all drug groups were 13% less likely to cooperate after a DD outcome
when facing perceived humans (38%) than when facing a computer (51%). (p = 0.0002)
• There were differences in behavior between players, playing as Player 1, that took OT and ones
that took AVP but when comparing each drug group with the placebo takers the differences were
not statistically significant.
• Players as Player 2 across all drug groups were 14% more likely to cooperate after a Cooperative
play when facing perceived humans (88%) than when facing a computer (74%). (p = 3 x 10( − 7))
• Players as Player 2 that were administered AVP were 10% more likely to cooperate after a Co-
operative play when facing perceived humans (96%) than subjects treated with OT (86%). (p =
0.01)
• Players as Player 2 that were administered AVP were 21% more likely to cooperate after a Co-
operative play when facing a computer (88%) than subjects treated with placebos (67%). (p =
0.007)
This seems to indicate that all subjects in general were:
• Less likely to betray cooperative humans than computers.
• More likely to lose trust on their opponent when facing humans that defected them.
• More likely to try to escape the DD Nash Equilibrium when facing computers.
• Less likely to betray players that trusted them when they were human.
Also, players that were administered AVP were:
• More likely to reward Cooperative actions with cooperation of their own than OT administered
subjects when facing humans.
• More likely to reward Cooperative actions with cooperation of their own than placebo takers when
facing computers.
24
3.2.2 Sex differences in the neural and behavioral response to intranasal oxy-
tocin and vasopressin during human social interaction
The first of study by Rilling’s team on the effects of OT and AVP in the brain [1] only contained male
subjects due to certain hormones showing widely variable levels across the menstrual cycle of women.
One of those hormones, estradiol, is known to affect OT receptors. Because of this, in their new study
‘Sex differences in the neural and behavioural response to intranasal oxytocin and vasopressin during
human social interaction’ [2], estradiol levels were measured by taking blood samples from all subjects
and compensated for in the models built.
All the remaining process was the same that was executed in their previous study [1]. The gathered
data was aggregated and studied. While playing as player 1 the results seem to suggest that:
• Women administered OT seem to be less likely to maintain cooperation when playing computers.
• Women administered either OT or AVP seem to be more likely to maintain defection when playing
computers.
• Both OT and AVP seem to make women differentiate less between computers and humans when
deciding to cooperate or not after mutual defection.
• Comparisons between male and female behavior as player 1 did not yield statistical significance.
While playing as player 2 the results seem to suggest that:
• Women administered OT seem to be less prone to defect on cooperative human players.
• Women administered AVP seem to be more likely to cooperate with individuals after they have
defected them.
• Men administered AVP seem to be more cooperative towards players that were cooperative in the
past both computer and human.
• Men administered OT seem to be more defective when playing computers that were defective in
the past.
3.3 The validity of modeling brain processes with RL
Previous studies [3,12–14] have modelled or hinted at the possibility of modelling reward based learning
tasks with RL. Some of these papers described below present some of the existing scientific work that
can be taken as a basis to validate the use of RL models to represent brain processes related to situations
where individuals learn optimal behaviours by receiving rewards associated to their choices.
25
3.3.1 Reinforcement learning in the brain
In ‘Reinforcement learning in the brain’ [3], Niv displays evidence of the way RL models the function of
dopaminergic neurons in the brains of mammals and humans. The author starts by referencing previously
made experiments with animals that showed when a sound played before a monkey was fed, that sound
would trigger an increase in activity in the monkey’s reward centers. However, this increase in stimulation
would start diminishing with time as the sound was played more and more. The author suggests that
this might indicate that since the animal was getting used to being fed after the bell rang its rewards
centers were not firing up as much as they were before the habituation had set in.
Figure 3.4: A monkey’s neuro-conditioning to a sound (CS) followed by being fed juice (US) ininstants a) b) and c)
In figure 3.4 can be seen a measure of a monkey’s reward center stimulation over time with two events.
The first, CS, corresponds to the playing of a sound while the second, US, corresponds to the feeding
of a tasty juice. In (a), it can be observed that as the trials advance the monkey’s brain starts having
its peaks before the juice is fed to him, but they also become less intense. In (b) it can be seen that
the monkey is now used to the reward of the juice but when introducing the sound its brain still shows
activity. In (c) the sound is played but no juice is given so when failing to receive the juice, the monkey
experiences a negative stimulus. The author emphasizes the similarities between Temporal Difference
Learning, a type of RL model, and the behavior observed both in the monkey and in its brain.
The author then elaborates on the way the necessity of using non-invasive brain activity measuring
26
techniques impacts tests with humans. According to the author the Functional Magnetic Resonance
Imaging (fMRI) is the most widely used technique, since it is non-invasive, but presents problems due to
the amount of noise it produces and due to the amount of data the users must filter. It is, despite its
flaws a very popular measuring technique in measuring the BOLD response in areas of the brain.
The paper points to other works in the neurosciences that identify various neural controllers in the
brain that, the scientific community hypothesize are used both in their specific tasks but also in conjunc-
tion with each other to perform more complex decisions. This can prove problematic when performing
experiments in this area as one specific task might use neural controllers that can easily be modeled by
RL algorithms while with other tasks of the same complexity might be impossible to do so. The author
also points out that many of these neural controllers work in the absence of dopamine. This further
supports the idea that neural controllers are varied and that, for now, is unrealistic to try to find a model
that can explain all the decision processes made by an animal in different situations. It also indicates
that exploring other substances other than dopamine might yield important results.
The author concludes by reiterating the parallels between the RPE measured in many RL algorithms
and the brain stimulations measured during the moments where animals and humans experience encounter
RPE’s be-tween their expected and received rewards and by pointing out that RL models have had, so
far, unprecedented success in modeling neural controllers, which, due to their relative simplicity, can be
very useful in furthering our knowledge of the inner workings of the brain.
3.3.2 Actions, Policies, Values, and the Basal Ganglia
In ‘Actions, Policies, Values and the Basal Ganglia’ [12] the authors propose a model for the decision
processes that occur when an individual decides which of the actions available to him is optimal.
The authors argue that there are two main types of behavior: habitual, and goal-directed behav-
ior. They label habitual behavior as devaluation-insensitive and goal-directed behavior as devaluation-
sensitive. De-valuation-insensitive behavior is behavior that does not adapt to changes in the environment
(i.e. subjects may continue eating food despite being satiated) while devaluation-sensitive behavior re-
sponds to these changes by realigning the subject’s priorities when selecting a new action. The authors
argue that both model-based and model-free RL fit the devaluation-insensitive category as these models
must relearn their values once the environment has changed while devaluation-sensitive systems immedi-
ately change detect the changes in the environment and change accordingly.
The authors point out that the Basal Ganglia is an area of the brain closely related to the decision
processes of habitual behavior. They point as evidence various studies around lesions in this area and the
effect they had on the habitual behaviors of the patients. They note that when injured only their habitual
behavior control system seemed to be affected which suggests that despite many situations demanding
joint action from both habitual and goal-oriented systems (despite there being no physical evidence of
27
arbitration between these systems), these can still act independently from each other.
The authors conclude by reiterating the duality of their proposed model in which a certain controller
decides on habitual behavior while other decides on goal-oriented. Since RL has been shown to have
more success modelling habitual behavior this should serve as further evidence that only certain parts of
the decision process of humans and animals can be explained by RL.
3.3.3 Model-based fMRI and its application to Reward Learning and Deci-
sion making
In ’Model-based fMRI and its application to Reward Learning and Decision making’ [15] we find
a detailed analysis of the technique of model-based fMRI, its advantages over more traditional fMRI
applications and some work performed in this area. The analysis describes the technique as the study
of correlations between data from fMRI analysis that look into the changes in the activity of regions of
interest in the brain, and data collected from computer models that describe whatever task the subject
being analyzed at the time of the fMRI was performing. If a correlation between these two factors is
found, one can then ascertain that the brain areas focused one during the fMRI analysis are relevant
to the subject’s performance of the task that has been computer modeled. This is a tried and proved
framework used for this type of work, and will therefore be chosen to test our hypothesis.
3.4 The importance of the striatum when dealing with reward
prediction error
The paper ‘Temporal prediction errors in a passive learning task activate human striatum’ [16] de-
scribes a study that concluded that increased brain activity in the striatum region of the brain is associated
with unexpected rewards.
3.4.1 Temporal prediction errors in a passive learning task activate human
striatum
In ‘Temporal prediction errors in a passive learning task activate human striatum’ [16] the authors
document an experiment where humans were fed either water or juice at predictable intervals in one
phase and at unpredictable intervals in another
Figure 3.5 shows the striatum of a patient registering a positive BOLD response when faced with an
unexpected reward. In a RL model this situation would equate to a positive reward prediction error that,
when associated with a positive BOLD response, could be indicative of a correlation between these two
occurrences.
28
Figure 3.5: Activation of the striatum when the subject receives an unexpected reward.
29
4Methods
Contents
4.1 Data Processing (ETL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Q-learning parameters estimation . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 RPE estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 BOLD/RPE correlation and respective ANOVA . . . . . . . . . . . . . . . . 34
4.6 Defining a Q-learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Boundaries for the η parameter . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.8 Empirical model testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.9 Testing the Q-learning implementation . . . . . . . . . . . . . . . . . . . . . 42
4.10 Chi-square test to test confounding effects of Round Order . . . . . . . . . 44
4.11 ANOVA design to test effects of Sex, Drug and Opponent . . . . . . . . . 46
30
This section details the tasks conducted to study the postulated hypothesis. Section 4.1 will describe
the ETL process used for data processing for this thesis. Section 4.2 describes the process by which
the parameters for the Q-learning models were estimated while section 4.3 details the steps necessary
to estimate the RPE for each interaction. In sections 4.4 and 4.5 explain the process determining and
analyzing the correlation analysis that will be performed. Sections 4.6, 4.7, 4.8 and 4.9 we describe the
definition and testing process of the Q-learning model to be used (and its parameters). Finally, in section
4.10 we detail a Chi-Square test that was performed and the way it affected the design of the ANOVA
tests that were performed.
4.1 Data Processing (ETL)
To study the postulated hypothesis, it was necessary to process the fMRI data produced by Rilling’s
work [1] [2]. For that we used an ETL-like process.
Figure 4.1: ETL description scheme.
4.1.1 Extraction
The initial, raw data relevant to this project consisted of numerous .tar files. After a successfully
extracted all files were ordered by subject number, identifying the type of opponent they played against
(human/computer) and whether the round they were playing in was their first or second round. After
this, the files were ready to be processed by FEAT.
4.1.2 Transformation
The transformation is the lengthiest part of the ETL process as several operations are conducted
during this step. First, some checks were performed to verify if the event structure of each file is consistent
with the rest. Violating this structure by, for example, having events ordered in a different way, would
31
cause problems in later stages of the transformation process. Since in the original study the research
team analyzed events that are out of the bounds of this investigation (such as the reaction of a subject’s
brain to seeing that it’s his turn to play) these had to be filtered out. It was necessary, for a later stage
of the transformation process, to analyze the event files for relevant events in order to build the outcome
sequences for each experiment (i.e. subject 002 in round 1 started by observing a mutual cooperation
outcome followed by a mutual defection outcome, etc) in order to determine the amount of reward a
subject received in each round.
Rilling’s research team grouped all occurrences of the same type (i.e. user notices mutual defection),
effectively averaging the BOLD responses for each event. Since each occurrence of this type (instance
of subject noticing the outcome of a round) produces a different RPE by communicating to the subject
a certain reward corresponding to his previous action this required us to break down these events into
individual occurrences.
Next, we needed to automate the application of FEAT over the data. For this end, FEAT allows
the creation and reading of scripts through the command line that obey a specific text file format. This
required the creation of a script that generates a FEAT script for each experiment. Each one of those
FEAT scripts analyzes all the relevant events from the experiment and maps the stimulation that the
subject’s brain was experiencing during that moment.
After the processing of the data by FEAT is concluded we need to run a command with a FeatQuery
utility that allows us to rotate and change the shape of the mask to fit the brain structure of each specific
subject. This mask will be used in the next processing step to filter out all the brain activity originated
from areas of the brain that are not relevant for our hypothesis.
Finally, another script was created that takes all the data processed by FEAT and calls the necessary
FeatQuery commands to extract the amount of stimulus each subject felt in their rewards centers during
each event. This information is presented in a HTML report from which the relevant information is
extracted during the loading phase.
4.1.3 Loading
The loading phase starts by extracting from the HTML reports and the action sequences produced,
for each subject, during the extraction phase the pair of stimulus felt and occurrence that cause said
stimulus. This information is then be aggregated in a single .csv file (for each subject) along with the
remaining, previously available information (subject number, round number, player order and opponent
type). All the .csv files produced this way are then concatenated in a single .csv with the information
from all participants.
With regards to extracting data with the BOLD response from the relevant areas of the brain during
the experiment we can say that at this point the data is fully processed. It is however, still necessary
32
to identify each subject’s gender and the drug they were administered. For this we crossed the subject’s
information present in the aforementioned .csv file and two excel files created be the team at IMM that
contained the information of each subject’s gender, drug, round and action sequence (but not identity).
Some validations occurred at this stage. For example, the action sequence from a subject in one
file must match the action sequence from the subject with the same number in the other. With a few
exceptions this happened to all subjects. The ones that showed a mismatch between the action sequences
in both files were analyzed and most of them had an action sequence that matched the action sequence
of subject with a different number in the other file (i.e. subj002 from fMRI data had the same action
sequence as subj282 from the excel file). After contacting a former IMM researcher that also worked
with Rilling’s team in the past we were able determine some of these mistakes to caused by manual
transcription of data and to correct them. Other subjects’ data had to be removed due to inconsistencies
that couldn’t be corrected.
After validating and aggregating all the relevant data for all subjects and encoding the data numeri-
cally so that it can be more easily imported to MATLAB the data loading process is concluded.
4.2 Q-learning parameters estimation
After importing the data to MATLAB we were ready to estimate the parameters for the Q-learning
algorithm that best define each subject. To that end we created a function that receives a set of Q-
learning parameters (section 4.6 goes further into these) and a sequence of actions and respective rewards,
outputting the negative log likelihood of that sequence being produced by those parameters (as in 4.1
where n is the number of rounds, 30 in the case of this study).
− logL = −n∑t=1
log(P(at)) (4.1)
Then, by using a nonlinear solver (Interior-Point algorithm enhanced by a MultiStart procedure)
we find the Q-learning parameters that minimize the −logL. These will be the parameters that have
the biggest likelihood of resembling the ones that produced the action sequence provided. This, again,
assumes that the Q-learning algorithm is a good model for the decision process of subjects that find
themselves in this situation which seems to be supported by other studies [3].
4.3 RPE estimation
After estimating the parameters for the Q-learning model of a subject we are able to estimate the
RPE sequence by running the algorithm as if it was learning for the first time, now with the estimated
parameters. By doing this, the initial Q-values will get updated at each step, simulating the updates
33
that occurred in the subject’s brain during the experiment when receiving that specific reward sequence.
Later in the process, this sequence of RPE’s will be correlated with the sequence of BOLD responses in
the brain to test the postulated hypothesis.
4.4 Correlation Analysis
Obtaining a positive correlation between the BOLD response and the RPE (or other Q-learning
related metric) would indicate that there might be a link between brain activity in the reward centers
and Q-learning models derived from a parameter fitting process as described in the previous chapter.
A strong negative correlation would be unexpected but could also indicate the possible existence of a
relation between brain activity and Q-learning models with parameters fitted to represent said activity.
If no correlation is found the experiment will not verify the postulated hypothesis. Studies show that
for certain tasks, regions of the brain like the left putamen might respond to positive RPE’s while not
responding to negative RPE’s [16]. Due to this, studying the correlation between only positive RPE’s and
the BOLD response was considered worthwhile. There is also the possibility of studying the correlation
between the rewards received and brain activity, since this would circumvent the estimated Q-learning
models and provide us with a different insight.
4.5 BOLD/RPE correlation and respective ANOVA
After computing all the previous steps we are now able to calculate the correlation factor between the
array of BOLD responses each subject shows with its array of RPE’s.
For this we tested both the Pearson’s Correlation and the Spearman Correlation since we don’t
necessarily expect a linear correlation specifically to be present and there is a possibility for another,
more flexible relation to exist.
As shown in 4.2, the PCC can be obtained by correlating array R of RPE’s experienced by the subject
while experiencing the stimulation of reward centers described in the corresponding index of array F that
contains the BOLD responses. There will be NN ′ samples if N represents the number of subjects and
N ′ represents the number of rounds played by each subject.
ρ(R,F ) =1
NN ′ − 1
NN ′∑i1
Ri − µRσR
Fi − µFσF
(4.2)
Considering the definition of the Pearson’s Correlation defined in 4.2 we can define our Spearman’s
Correlation as in 4.3 provided we define RR and FR as the ranked form of arrays R and F , respectively,
where the values of these arrays are ranked from lowest to highest (i.e. all the values in both arrays get
replaced by their ordinal values with respect to the array they are included in).
34
rs = ρ(RR,FR) (4.3)
After obtaining the correlation factor for each participant we can now run an ANOVA in order to
determine the effect of various independent variables in our correlation. The independent variables
considered were the subject’s sex, administered drug and opponent type.
4.6 Defining a Q-learning Model
Different types of Q-learning models were considered for the modelling of the choice process of the
participants. Due to the nature of the task we considered the Boltzmann Policy (as in 4.4) as an adequate
policy to represent the way humans choose between exploration of new possibilities and exploitation of
current knowledge.
P(a) =eηQ(a)∑b∈A e
ηQ(b), η ∈ [0,+∞[ (4.4)
Due to the nature of the PD setting we modeled the problem as having only one state since the
environment does not change as the rounds progress. Due to this, the estimation of the future state
optimal next action Q-value (γmaxbQ(y,b)) is removed from the equation of the model as it takes a
simpler form as noted in 4.5 while the RPE, the measure that we hypothesize may be related to the
BOLD response in the striatum and was identified as a strong modeler of brain activity responsible for
reward estimation [12], gets defined as in 4.6.
Q(a) = Q(a) + α(r(a) −Q(a)), α ∈]0, 1] (4.5)
σ = r(a) −Q(a) (4.6)
As is supported by other work [17], processes modelled successfully by Q-learning algorithms can
sometimes fit behaviour patterns better if they allow for the existence of different α values between
interactions with the environment that bring positive or negative RPE values (α+ and α− respectively).
Due to this, the possibility of incorporating a second α value was taken into consideration.
Due to an expectancy of the possibility of the model fitting process over-fitting its parameters to the
observed action sequences of the participants, the possibility of assuming that each individual would start
with its initial Q-values equal to their optimal values (assuming perfect knowledge of the inner workings
of the algorithm that controlled their opponent) was considered in order to reduce model complexity.
The considered models are as follows:
• One α: Standard model with one α, one η to control the amount of exploratory behaviour and two
35
initial Q-values for both cooperation and defection.
• Two α: Model with one α+ for positive RPE’s, one α− for negative RPE’s, one η to control the
amount of exploratory behaviour and two initial Q-values for both cooperation and defection.
• One α, optimal initial Q-values: Same parameters as the one α model, except the initial Q-values
are assumed to be optimal.
• Two α, optimal initial Q-values: Same parameters as the two α model, except the initial Q-values
are assumed to be optimal.
4.7 Boundaries for the η parameter
Some initial tests showed that the Negative Log Likelihood minimization algorithm would sometimes
abuse the η parameter by setting very large values for it in order to increase determinism and thus creating
models that overfit to the action sequences we had available. Due to this we analyzed the outcome of the
likelihood function for different values and tried to determine a measure of “reasonable determinism” for
it. We considered that having a P (a) > 0.95 for any[Q(c), Q(d)
]would constitute a close to deterministic
situation as the probability of choosing one action over the other is very high.
Then, comparing the percentage of deterministic outcomes one value of η produces over another in the
relevant domain of Q-values allowed us a comparison of the “determinism” of two policies with different
Q-values.
Figure 4.2: Percentage of the search space with P (a) > 0.95 for each η
36
Image 4.2 shows percentage of the area of the likelihood function that has a likelihood over 0.95 in
the relevant domain(Q(c) ∈ [0, 2], Q(d) ∈ [1, 3]
)for all η ∈ [0, 20].
We wanted to allow the algorithm to attribute a certain degree of determinism to a participant without
allowing him to have very deterministic policies in situations where both Q-values are close to one another.
Taking this into account we decided to search for a η value that created an area of deterministic decision
of around 50%. The function surface area for deterministic outcomes with η = 3 is approximately 50.07%
so we defined the maximum value for the η parameter as 3.
Images 4.3 through 4.5 show the plots for P(cooperation) in the relevant domain for η ∈ {1.5, 3, 20}
Figure 4.3: Plot of the P(cooperation) over all possible Q-values with η=1.5
37
Figure 4.4: Plot of the P(cooperation) over all possible Q-values with η=3
Figure 4.5: Plot of the P(cooperation) over all possible Q-values with η=20
4.8 Empirical model testing
Due to some characteristics of the Q-learning models we had concerns that our current methodology
of model fitting was susceptible to over-fit, meaning that it would be possible to get models with low
Negative Log Likelihoods despite having high error rates when setting the model’s parameters.
As mentioned in chapter 4.7, some preliminary tests showed that, sometimes, the optimization algo-
rithm would attribute high values to the η parameter in order to over-fit to specific action sequences.
38
This happened since with only thirty rounds of subject interaction and by pure chance, behaviour might
appear deterministic even if the subjects decision policy was stochastic.
Other characteristic of our framework that could cause some problems was the lack of independence
between the parameters of our models. For example, a subject that starts by cooperating in the first
two interactions but starts defecting back after receiving nothing but defections for the entire experiment
might display this behaviour because he had an unreasonable expectation that his opponent would not
defect back (i.e. had a high Q0(C)) but quickly learned that this initial expectation was wrong (i.e.
had a high value for α). It could also be the case that the aforementioned subject had a low α (i.e.
slow learning speed) but that he had only a slight expectation that his opponent would not defect him
(i.e. moderately high Q0(C)), expectation that was quickly corrected after two interactions, despite the
subject’s low learning speed. This lack of independence between model parameters could also cause high
parameter error since there could be instances where models with wildly different parameters produce
similar action sequences, and thus, have similar likelihoods of fitting these action sequences.
Due to these concerns a series of empirical tests were performed. These tests followed the following
steps:
• Creation of artificial test subjects.
• Generation of Action and Reward sequences for all artificial subjects.
• Estimation of the subjects’ parameters through Negative Log Likelihood minimization.
• Computing of the average, per subject and per parameter, percentage error between real parameters
and estimated parameters.
• Computing of the average Negative Log Likelihood.
4.8.1 Artificial test subjects
Several artificial subjects were created. These were represented by sets of parameters that defined
their behaviour (i.e. α, η, Q0(c) and Q0(d)). For each parameter were set several points, evenly spaced
throughout their domain. Then all combinations of these parameters generated a single artificial subject.
For example, for a model with one α, one η and two Q-values there where created 144 subjects.
• α, α+ and α− ∈ [0.1, 0.3, 0.5, 0.9]
• η ∈ [0, 1, 1.5, 3]
• Q0(c) ∈ [0, 1, 2]
• Q0(d) ∈ [1, 1.5, 3]
39
4.8.2 Generation of Action and Reward sequences
After creating the various subjects’ models they where subject to a function that simulated the
experiment as they played against the same algorithm (implemented by us) the real subjects faced in
Rilling’s work for 30 rounds. The function would then output the Action and Reward Sequences of the
game just played.
4.8.3 Estimation of subjects’ parameters
With the reward and action sequences we were now able to infer the parameters of the model in order
to test the parameter estimation function.
4.8.4 Percentage Error
The percentage error metric for each model is defined as in 4.7:
• Let Θx = (θx1 , θx2 , ..., θ
xn) be the array that contains all the experimental parameters of the model.
• Let Θr = (θr1, θr2, ..., θ
rn) be the array that contains all the real parameters of the model.
• Let ∆ = (δ1, δ2, ..., δn) be the array that contains the δ values for the Θr where δi is the difference
between the maximum and minimum value of θri ’s domain. (i.e. since α ∈ [0, 1] then δα = 1)
E% =e%n
(4.7)
e% =
n∑i=1
| θxi − θri |δi
(4.8)
4.8.5 Test results
After running the test for all four models the metrics showed the following results (table 4.1):
ModelMean RPE’sCorrelation
E%Mean NegativeLog Likelihood
1 α 0,9055 +- 0,0061 0,3546 +- 0,0084 15,2974 +- 0,24461 α, optimal Q0 0,8920 +- 0,0032 0,3472 +- 0,0113 17,9174 +- 0,18602 α 0,8842 +- 0,0020 0,3476 +- 0,0018 15,4777 +- 0,06312 α, optimal Q0 0,8733 +- 0,0019 0,3329 +- 0,0039 18,2682 +- 0,0804
Table 4.1: Empirical test results for experiments with 30 rounds
40
ModelMean RPE’sCorrelation
E%Mean NegativeLog Likelihood
1 α 0,9383 +- 0,0049 0,2872 +- 0,0102 55,7899 +- 0,67811 α, optimal Q0 0,9399 +- 0,0041 0,2968 +- 0,0051 59,9047 +- 0,31472 α 0,9194 +- 0,0014 0,3044 +- 0,0057 56,5325 +- 0,16242 α, optimal Q0 0,9162 +- 0,0016 0,2987 +- 0,0016 61,4934 +- 0,2329
Table 4.2: Empirical test results for 100 rounds
Interpreting these metrics we can conclude that the E% is around 34% for all models while models
with lower E% are not always associated with lower Negative Log Likelihoods. This means that the
Negative Log Minimization function is not able to successfully infer the parameters of the model.
While this is would be a problem if we were dependent solely on the accuracy of the parameters we can
still test our hypothesis provided that our model produces a sequence of RPE’s that is strongly linearly
correlated with the real sequence of RPE’s. As can be seen in table 4.1 all models show high PCC’s
between the experimental and real RPE sequences (around 89%) which tells us that any correlation
calculated between the BOLD response and the experimental RPE sequence will be very similar to the
correlation between the BOLD response and the real RPE sequence for any individual.
Given this scenario our top priority when choosing a model becomes choosing the one that provides
the best experimental-real RPE correlation. According to the tendency illustrated in figure 4.6 we chose
to model the behaviour of the participants with one alpha model.
Figure 4.6: Experimental-real RPE correlation for various models
Consulting table 4.2 that shows the same empirical test done for table 4.1 but with sequences of 100
41
plays instead of 30 we can see an increase in Mean RPE’s Correlation and a decrease in E%. This shows
that increasing the number of times each participant plays in each round would provide data that would
allow for better models.
4.9 Testing the Q-learning implementation
After defining the Q-learning model we ran a short test of our implementation to see if it showed
the expected learning potential. Fig. 4.7 shows a graph that details the learning process of a subject
regarding the Q-values for cooperation (Q(cooperation)) with the following parameters:
• α ≈ 0, 3153
• η ≈ 1, 9443
• Q0(cooperation) = 2
• Q0(defection) = 1
Figure 4.7: Graph showing the evolution of a subject’s Q-values for cooperation.
As can be seen, the Q-value maintains its initial value of 2 during the initial period from iteration 0
to 5. After that, the rewards become more unstable and so, the Q-values orbit the average of the two
possible rewards.
Fig. 4.8 shows a graph that details the learning process of a subject regarding the Q-values for
defection (Q(defection)) with the following parameters:
42
• α ≈ 0, 4092
• η ≈ 2, 6141
• Q0(cooperation) = 2
• Q0(defection) ≈ 1, 5968
Figure 4.8: Graph showing the evolution of a subject’s Q-values for defection.
Again, the Q-values quickly converge to a stable reward and suffer oscillations wherever the reward
changes.
Fig. 4.9 shows the hypothetical Q-values trajectory that the subject from Fig.4.7 would experience if
its α value were to be artificially increased to 0,7.
43
Figure 4.9: Graph showing the evolution of a hypothetical subject’s Q-values for cooperation.
As can be seen the Q-values chase the reward value much more aggressively since the α value was
increased, showing that the RL algorithm is modeling the data as expected.
4.10 Chi-square test to test confounding effects of Round Order
After consulting the research team at IMM we studied the possibility of removing the variable
RoundOrderClass, that identifies whether a subject played against a computer or a human in his first
game, from the ANOVA since there wasn’t an expectancy that this measure would provide useful insights
into the data. It was important, however to determine whether or not the variable had a confounding
effect on other independent variables. For that we performed a Chi-squared test to ensure we are safe to
remove the RoundOrderClass variable.
Our Chi-squared test was performed under the null hypothesis ”RoundOrderClass is independent
from the remaining independent variables”.
44
Figure 4.10: Table showing the distribution of subjects across dependent variables.
As seen in figure 4.10 the percentage of subjects that share the same RoundOrderClass is very similar
across Sex and Drug groups which is a strong indicator that it might be independent from these two
variables.
45
Figure 4.11: Figure showing the results of the Chi-squared test.
Figure 4.11 shows that the significance level of both values of the variable RoundOrderClass are well
above 0.05 which doesn’t allow us to reject the null hypothesis meaning that there is no reason to believe
that RoundOrderClass could be dependent on any of the considered independent variables. This allows
us to safely remove this variable from our design.
4.11 ANOVA design to test effects of Sex, Drug and Opponent
With the conclusions taken from our tests in sections 4.8 and 4.10 the between-subject factors (vari-
ables that differentiate subjects) considered in our ANOVA will be Sex and Drug and the within-subject
factor (variable that measures changes in each subject over time) will be the Opponent Type. The depen-
dent variable will be a correlation between a measure taken from the Q-learning model (RPE or received
Rewards) and the BOLD response elicited in the brain. It is important to note that the independent
variable Opponent Type is considered an within factor since all subjects performed an experiment while
playing against each of the available opponent types (Human and Computer). Six different ANOVA’s will
be conducted that will analyze the Pearson’s and the Spearman correlation between the BOLD response
and another factor that will either be the RPE, positive only RPE’s or the Reward received at each time
point during the experiment.
46
5Results
Contents
5.1 Effects on the Pearson’s Correlation between RPE and the BOLD response 48
5.2 Effects on the Pearson’s Correlation between Reward and the BOLD re-
sponse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Effects on the Spearman Correlation between RPE and the BOLD response 51
5.4 Effects on the Spearman Correlation between Reward and the BOLD
response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.5 Effects on the Pearson’s Correlation between Positive RPE’s and the
BOLD response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Effects on the Spearman Correlation between Positive RPE’s and the
BOLD response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 Results overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
47
In this section we elaborate on the results of the statistical tests performed on the data resulting from
the correlation analysis performed between the BOLD response of the brain and several other measures
obtained from our models. These statistical test will determine whether our independent variables:
subject Sex, administered Drug and Opponent Type have any effect on the correlation factor (PCC or
Spearman correlation) at study.
5.1 Effects on the Pearson’s Correlation between RPE and the
BOLD response
Firstly, we tested our main hypothesis, that there could be a correlation between the RPE and BOLD
response. We did this by analyzing the effects of the independent variables (i.e. subject Sex, administered
Drug and Opponent Type) on the dependent variable, the latter being the aforementioned correlation.
Figure 5.1: Table detailing interactions between the within subject factor and the between subjectfactors for RPE/BOLD Pearson’s Correlation.
As can be seen in figure 5.1 under the column ”Sig.” (significance) the within subject factor Opponent
Type and its interaction with the other between subject independent variables cannot generate differences
in the dependent variable with significance levels under 0.05. This means that we cannot reject the null
hypothesis that the mean value of the per-subject Pearson’s Correlation between the RPE and the BOLD
doesn’t change in function of these independent variable interactions.
Figure 5.2: Table detailing interactions between the between subject factors for RPE/BOLD Pear-son’s Correlation.
48
Looking at the table in figure 5.2 we can see that it shows the same layout as the one in figure 5.1
but instead describes the effects of the between subject independent variables both by themselves and
when interacting with one another. Again, under the column ”Sig.” we can see that most variables and
variable interactions don’t produce statistically significant changes in the mean correlation between RPE
and BOLD response except for the independent variable Sex that shows a p-value of 0.03. In figure 5.3 we
can see this effect as the mean PCC for female subjects is approximately 0.0042 while for male subjects
it is -0.0331.
Figure 5.3: Main effect of the independent variable Sex on the RPE/BOLD Pearson’s Correlation
This shows that we can reject the null hypothesis that ”Sex does not impact the Pearson’s Correlation
between RPE and BOLD response” and that males show a negative correlation while females show no
correlation.
49
5.2 Effects on the Pearson’s Correlation between Reward and
the BOLD response
Our second test analyzed the effects of the independent variables on the PCC between the Rewards
received and the BOLD response felt by each subject each time a reward was received. This tested
our second hypothesis that the that there could be a correlation between received rewards and brain
activation. This approach completely disregards the Q-learning model and focuses only on the input
received by the subjects.
Figure 5.4: Table detailing interactions between the within subject factor and the between subjectfactors for Reward/BOLD Pearson’s Correlation.
Figure 5.4 under the column ”Sig.” the within subject factor Opponent Type and its interaction with
the other between subject independent variables cannot generate differences in the dependent variable
with significance levels under 0.05. Again, we cannot reject the null hypothesis that the mean value of
the dependent variable doesn’t change in function of these independent variable interactions.
Figure 5.5: Table detailing interactions between the between subject factors for Reward/BOLDPearson’s Correlation.
Looking at the table in figure 5.5 more specifically at the column ”Sig.” we can see that no vari-
ables and variable interactions produce statistically significant changes in the mean correlation between
rewards received and BOLD response. We maintain our null hypothesis that ”Sex, administered Drug
and opponent type have no effect on the Pearson’s Correlation between received Rewards and the BOLD
response”.
50
5.3 Effects on the Spearman Correlation between RPE and the
BOLD response
For our third test we analyzed the effects of the independent variables on the Spearman correlation
between the RPE and the BOLD response felt by each subject during each experiment. This is our third
hypothesis and, while being similar to our first, tests the existence of a Spearman correlation to try and
find non-linear relationships between the RPE and brain activation.
Figure 5.6: Table detailing interactions between the within subject factor and the between subjectfactors for RPE/BOLD Spearman Correlation.
As can be seen in figure 5.6 under the column ”Sig.” the within subject factor Opponent Type and
its interaction with the other between subject independent variables cannot generate differences in the
dependent variable with significance levels under 0.05. Again, we cannot reject the null hypothesis
that the mean value of the dependent variable doesn’t change in function of these independent variable
interactions.
Figure 5.7: Table detailing interactions between the between subject factors for RPE/BOLDSpearman Correlation.
As with the PCC between RPE’s and BOLD responses (analyzed in chapter 5.1) we can see in figure
5.7 that the independent variable Sex is the only one that produces a statistically significant change
in mean Spearman correlation allowing us to reject the null hypothesis with a p-value of 0.031.This
shows that we can reject that ”Sex does not impact the Spearman correlation between RPE and BOLD
response”.
51
Figure 5.8: Main effect of the independent variable Sex on RPE/BOLD Spearman Correlation
Figure 5.8 shows a decrease in Spearman correlation from female subjects having a correlation of
0.0018 and male subjects showing a correlation of -0.033.
5.4 Effects on the Spearman Correlation between Reward and
the BOLD response
Our fourth test analyzed the effects of the independent variables on the Spearman correlation between
the Rewards received and the BOLD response felt by each subject each time a reward was received.
Again, as with our second hypothesis, this disregards the Q-learning model and solely looks for non-
linear relationships between Rewards received and BOLD response.
Figure 5.9: Table detailing interactions between the within subject factor and the between subjectfactors for Reward/BOLD Spearman Correlation.
Figure 5.9 shows us, under the column ”Sig.”, that the interaction between the within-subject factor
Opponent Type and the between-subject factor Sex produces a statistically significant change in mean
52
Spearman correlation. This allows us to reject the null hypothesis that the mean value of the dependent
variable doesn’t change in function of the interaction between Opponent Type and subject Sex.
Figure 5.10: Table detailing interactions between the between subject factors for Reward/BOLDSpearman Correlation.
Figure 5.10 shows that no between factor interaction produces a statistically significant change in
mean Spearman correlation.
Figure 5.11: Effect of the (Sex x Opponent Type) interaction on Reward/BOLD Spearman Cor-relation
In figure 5.11 we can see detailed the interaction between Sex and Opponent Type. The statistically
significant difference occurs between the correlation in female subjects playing against a CPU opponent
(where they show an average Spearman correlation of 0.023) and in male subjects that also play a
CPU opponent (where they show an average Spearman correlation of -0.045). Thus, we reject the null
hypothesis that ”The interaction between subject Sex and Opponent Type does not impact the Spearman
Correlation between received Rewards and BOLD response”.
53
5.5 Effects on the Pearson’s Correlation between Positive RPE’s
and the BOLD response
For our fifth test we analyzed the effects of the independent variables on the Pearson’s correlation
between the Positive RPEs and the BOLD response felt by each subject each time a reward was received.
By testing this as our fifth hypothesis, we try to determine if our brains respond differently when we are
positively surprised (positive RPE’s).
Figure 5.12: Table detailing interactions between the within subject factor and the between subjectfactors for Positive RPE/BOLD Pearson’s Correlation.
Figure 5.12 shows, under the column ”Sig.”, that the within subject factor Opponent Type produces a
statistically significant change in mean Pearson’s correlation. This allows us to reject the null hypothesis
that the mean value of the dependent variable doesn’t change in function of Opponent Type.
Figure 5.13: Table detailing interactions between the between subject factors for PositiveRPE/BOLD Pearson’s Correlation.
Figure 5.13 shows that no between factor interaction produces a statistically significant change in
mean Pearson’s correlation.
54
Figure 5.14: Main Effect of the independent variable Opponent Type on Positive RPE/BOLDPearson’s Correlation.
In figure 5.14 we can see the effect of the independent variable Opponent Type on the dependent
variable. The statistically significant difference occurs between the correlation in subjects playing against
a Human opponent (where they show an average PCC of 0.042) and in subjects that play against a CPU
opponent (where they show an average PCC of -0.036). We then reject the null hypothesis that ”The
subject’s Opponent Type does not impact the Pearson’s Correlation between Positive RPE’s and BOLD
response”.
55
5.6 Effects on the Spearman Correlation between Positive RPE’s
and the BOLD response
Finally, our last test analyzed the effects of the independent variables on the Spearman Correlation
between the Positive RPEs and the BOLD response felt by each subject each time a reward was received.
As with our fifth hypothesis we try to determine if our brains respond differently when we are positively
surprised (positive RPE’s), this time looking for a non-linear relationship between Positive RPE’s and
BOLD.
Figure 5.15: Table detailing interactions between the within subject factor and the between subjectfactors for Positive RPE/BOLD Spearman Correlation.
As can be seen in figure 5.15 under the column ”Sig.” the within subject factor Opponent Type and
its interaction with the other between subject independent variables cannot generate differences in the
dependent variable with significance levels under 0.05. We cannot reject the null hypothesis that the mean
value of the dependent variable doesn’t change in function of these independent variable interactions.
Figure 5.16: Table detailing interactions between the between subject factors for PositiveRPE/BOLD Spearman Correlation.
Looking at the table in figure 5.16 more specifically at the column ”Sig.” we can see that no vari-
ables and variable interactions produce statistically significant changes in the mean Spearman correlation
between Positive RPEs and the BOLD response. We then maintain our null hypothesis that ”Sex, ad-
ministered Drug and opponent type have no effect on the Spearman Correlation between Positive RPE’s
and the BOLD response”.
56
5.7 Results overview
Figure 5.17 presents a short overview of the obtained results.
Figure 5.17: Results overview table.
57
6Discussions
Contents
6.1 Main effect of Subject Sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Main effect of Opponent Type . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Effect of the (Opponent Type x Subject Sex) interaction . . . . . . . . . . 59
6.4 Main effect of Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
58
In this chapter we will discuss the obtained results and some conclusions we took.
6.1 Main effect of Subject Sex
Whether we analyze the effects of independent variables on the Pearson’s or the Spearman correlation
of the RPE with the BOLD response there are differences in female and in male subjects. The female
subjects show a near zero correlation between these two factors while male subjects have a negative
correlation (both Pearson’s and Spearman). This difference could indicate that the activation of the
ROI goes down in males when rewards surprise the subject in a positive way (positive RPE) or heighten
activation of the ROI when they surprise the subject in a negative way (negative RPE). Previous work
as shown differences increased reward center activation in men compared to women, particularly when
receiving monetary rewards [2] [18]. While differences in activation are then expected, it is surprising to
see a correlation in an opposite direction. No difference between genders was registered when analyzing
only the positive RPE’s. Our results seem to indicate that the striatum increases in activity particularly
when rewards that are below men’s expectations. This is surprising and should be further studied.
6.2 Main effect of Opponent Type
When analyzing the Pearson’s Correlation between the Positive Reward Prediction Errors and the
activation of the ROI there was a difference between subjects facing a Computer and subjects facing a
putative Human opponent. When facing a Human, subjects show a positive correlation between RPE
(only positive ones) and BOLD response while, when facing a computer, this correlation turned negative.
This means that the activation of the ROI seems to be facilitated by the experience of positive RPEs
while facing humans while, the same experience against a computer produces an opposite effect. This is
surprising as previous research seems to indicate the existence of neuron structures in the striatum that
seem to process the gain of social and non-social rewards in the same way [19].
6.3 Effect of the (Opponent Type x Subject Sex) interaction
When analyzing the Spearman Correlation between the Rewards received (regardless of the subject’s
internal state) and the activation of the ROI there is a difference between female and male subjects,
particularly when they are playing against a computer opponent. Again, the female subjects show a
near zero correlation while the male subjects show a negative correlation that could indicate a lowered
activation of the ROI when positive rewards are received. The same does not appear to occur when
the subjects think they are facing a human opponent. In that situation, both male and female subjects’
correlations don’t differ significantly from zero. This indicates both that the difference detailed in chapter
59
6.2 mainly applies to male subjects and that the change in correlation for men and women when changing
opponent type goes into opposite directions, further accentuating the differences discussed in chapter 6.1.
6.4 Main effect of Drug
No main effects of any drug were measured in our analysis. Although Oxytocin has been shown to
affect the activation of the reward centre during both social and non-social learning [20], other studies
show that for non-social learning, Oxytocin seems to have no effect on the activation of the Nucleus
Accumbens (which is a part of the Striatum) [21]. This could explain the lack of apparent effect of the
drug in our analysis.
60
Bibliography
[1] J. K. Rilling, A. C. DeMarco, P. D. Hackett, R. Thompson, B. Ditzen, R. Patel, and G. Pagnoni,
“Effects of intranasal oxytocin and vasopressin on cooperative behavior and associated brain
activity in men,” Psychoneuroendocrinology, vol. 37, no. 4, pp. 447–461, 2012. [Online]. Available:
http://dx.doi.org/10.1016/j.psyneuen.2011.07.013
[2] J. K. Rilling, A. C. DeMarco, P. D. Hackett, X. Chen, P. Gautam, S. Stair, E. Haroon,
R. Thompson, B. Ditzen, R. Patel, and G. Pagnoni, “Sex differences in the neural
and behavioral response to intranasal oxytocin and vasopressin during human social
interaction,” Psychoneuroendocrinology, vol. 39, no. 1, pp. 237–248, 2014. [Online]. Available:
http://dx.doi.org/10.1016/j.psyneuen.2013.09.022
[3] Y. Niv, “Reinforcement learning in the brain,” Journal of Mathematical Psychology, vol. 53, no. 3,
pp. 139–154, 2009. [Online]. Available: https://www.princeton.edu/∼yael/Publications/Niv2009.pdf
[4] D. Ferreira, M. Lopes, J. Rilling, M. Antunes, and D. Prata, “The impact of oxytocin and vasopressin
intake on Prisoner’s Dilemma strategy: a computational modelling approach,” 2017.
[5] National Institute of Standards and Technology, “Two-Sample t-Test for Equal Means,” 2013.
[Online]. Available: http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm
[6] N. I. of Standards and Technology, “Critical Values of the Student’s t Distribution,” p. 1.3.6.7.2,
2013. [Online]. Available: http://www.itl.nist.gov/div898/handbook/eda/section3/eda3672.htm
[7] E. Ostertagova and O. Ostertag, “Methodology and Application of Oneway ANOVA,” American
Journal of Mechanical Engineering, vol. 1, no. 7, pp. 256–261, 2013. [Online]. Available:
http://pubs.sciepub.com/ajme/1/7/21/index.html
[8] “N-Way ANOVA - MATLAB & Simulink.” [Online]. Available: https://www.mathworks.com/help/
stats/n-way-anova.html
[9] “Constrained Nonlinear Optimization Algorithms - MATLAB & Simulink.” [Online]. Available:
http://www.mathworks.com/help/optim/ug/constrained-nonlinear-optimization-algorithms.html
61
[10] R. J. Vanderbei, Linear Programming: Foundations and Extensions, 1998, vol. 49, no. 1. [Online].
Available: http://link.springer.com/10.1057/palgrave.jors.2600987
[11] M. Jenkinson, C. F. Beckmann, T. E. Behrens, M. W. Woolrich, and S. M. Smith, “Fsl,”
NeuroImage, vol. 62, no. 2, pp. 782–790, aug 2012. [Online]. Available: http://www.ncbi.nlm.nih.
gov/pubmed/21979382http://linkinghub.elsevier.com/retrieve/pii/S1053811911010603
[12] N. D. Daw, Y. Niv, and P. Dayan, “Actions, policies, values and the basal gan-
glia,” Recent breakthroughs in basal ganglia research, no. February, pp. 91–106, 2005.
[Online]. Available: https://www.semanticscholar.org/paper/Actions-%2C-Policies-%2C-Values-%
2C-and-the-Basal-Ganglia-Daw-Niv/c9ee2d772062e7d0886ba5fc308a59a00862163e
[13] R. Clark-Elford, P. J. Nathan, B. Auyeung, V. Voon, A. Sule, U. Muller, R. Dudas, B. J. Sahakian,
K. L. Phan, and S. Baron-Cohen, “The effects of oxytocin on social reward learning in humans,”
International Journal of Neuropsychopharmacology, vol. 17, no. 2, pp. 199–209, 2014.
[14] B. B. Doll, K. G. Bath, N. D. Daw, and M. J. Frank, “Variability in Dopamine Genes Dissociates
Model-Based and Model-Free Reinforcement Learning,” Journal of Neuroscience, vol. 36, no. 4,
pp. 1211–1222, 2016. [Online]. Available: http://www.jneurosci.org/cgi/doi/10.1523/JNEUROSCI.
1901-15.2016
[15] J. P. O’Doherty, A. Hampton, and H. Kim, “Model-based fMRI and its application to reward learning
and decision making,” Annals of the New York Academy of Sciences, vol. 1104, pp. 35–53, 2007.
[16] S. M. McClure, G. S. Berns, and P. R. Montague, “Temporal prediction errors in a passive learning
task activate human striatum,” Neuron, vol. 38, no. 2, pp. 339–346, 2003. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S0896627303001545
[17] M. J. Frank, A. A. Moustafa, H. M. Haughey, T. Curran, and K. E. Hutchison, “Genetic
triple dissociation reveals multiple roles for dopamine in reinforcement learning,” Proceedings of
the National Academy of Sciences, vol. 104, no. 41, pp. 16 311–16 316, 2007. [Online]. Available:
http://www.pnas.org/content/104/41/16311
[18] G. Alarcon, A. Cservenka, and B. J. Nagel, “Adolescent neural response to reward is related to
participant sex and task motivation,” Brain and Cognition, vol. 111, pp. 51–62, 2017.
[19] S. J. Wake and K. Izuma, “A common neural code for social and monetary rewards in the human
striatum,” Social Cognitive and Affective Neuroscience, vol. 12, no. 10, pp. 1558–1564, 2017.
[20] J. Hu, S. Qi, B. Becker, L. Luo, S. Gao, Q. Gong, R. Hurlemann, and K. M. Kendrick, “Oxytocin
selectively facilitates learning with social feedback and increases activity and functional connectivity
62
in emotional memory and reward processing regions,” Human Brain Mapping, vol. 36, no. 6, pp.
2132–2146, 2015.
[21] B. J. Mickey, J. Heffernan, C. Heisel, M. Pecina, D. T. Hsu, J. K. Zubieta, and
T. M. Love, “Oxytocin modulates hemodynamic responses to monetary incentives in
humans,” Psychopharmacology, vol. 233, no. 23-24, pp. 3905–3919, 2016. [Online]. Available:
http://dx.doi.org/10.1007/s00213-016-4423-6
63
64
65