applicability of interactive genetic algorithms to multi-agent systems: experiments on games used in...

Applicability of Interactive Genetic Algorithms to

Multi-agent Systems: Experiments on Games Used

in Smart Grid Simulations.

by

Yomna Mahmoud Ibrahim Hassan

A Thesis Presented to the

Masdar Institute of Science and Technology

in Partial Fulfillment of the Requirements for the Degree of

Master of Science

in

Computing and Information Science

c©2011 Masdar Institute of Science and Technology

All rights reserved

AUTHOR’S DECLARATION

I understand that copyright in my thesis is transferred to Masdar Institute of Science

and Technology.

ACCEPTANCE DECLARATION

This thesis has been accepted and approved by Masdar Institute of Science and

Technology on August 01, 2011.

EXAMINATION COMMITTEE MEMBERS

Jacob Crandall, Advisor, Masdar Institute of Science and Technology

Davor Svetinovic, Masdar Institute of Science and Technology

Iyad Rahwan, Masdar Institute of Science and Technology

ii

Abstract

A common goal of many organizations over the next decades is to enhance the

efficiency of electrical power grids. This entails : (1) modifying the power grid

structure to be able to utilize the available resources in the best way possible, and

(2) introducing new energy sources that are able to benefit from the surrounding

circumstances. The trend toward the use of renewable energy sources requires the

development of power systems that are able to accommodate variability and in-

termittency in electricity generation. Therefore, these power grids, usually called

“smart grids,” must be dynamic enough to adapt smoothly to changes in the envi-

ronment and human preferences.

In a smart grid, each decision maker can be represented as an intelligent agent

that consumes or produces electricity. Each agent interacts with other agents and

the surrounding environment. The goal of these agents may vary between main-

taining the stability of electricity in the grid from the generation side and increas-

ing users’ satisfaction with the electricity service from the consumers’ side (which

is our focus). This is done through the interaction between different agents to

schedule and divide the tasks of consumption and generation among each other,

depending on the need and the type of each agent.

In this thesis, we investigate the use of interactive genetic algorithms to derive

intelligent behavior that enables an agent on the consumer’s side to consume the

proper amount of electricity to satisfy human preferences. This behavior must take

iii

into account the existence of other agents within the system, which increases the

dynamicity of the system. In order to evaluate the effectiveness of the suggested

algorithms within a multi-agent settings, we test our algorithms in repeated ma-

trix games when they associate with other copies of themselves, and against other

known multi-agent learning algorithms. We run different variations of the genetic

algorithm, with and without human input, in order to determine what are the factors

that affect the performance of the algorithm within a dynamic multi-agent system.

Our results show reasonable potential for using genetic algorithms in such cir-

cumstances, particularly when they utilize effective human input.

iv

This research was supported by the Government of Abu Dhabi to help fulfill the

vision of the late President Sheikh Zayed Bin Sultan Al Nayhan for sustainable

development and empowerment of the UAE and humankind.

v

Acknowledgments

I would first like to thank Masdar for giving us the opportunity to conduct our re-

search, and increase the potential of a suitable environment to achieve a successful

research.

I would also like to express my gratitude towards my committee members, starting

with my advisor Professor Jacob Crandall, for his support, encouragement and con-

tinuous enthusiasm about our research for the past 2 years. Furthermore, I would

like to thank Dr. Davor Svetinovic and Dr. Iyad Rahwan for their feedback.

Last, but never least, I would like to thank my family and friends, for their

continuous support and their belief in me.

Yomna Mahmoud Ibrahim Hassan,

Masdar City, August 1, 2011.

vi

Contents

1 Introduction 1

1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation and Relevance to the Masdar Initiative . . . . . . . . . 2

1.3 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Literature Review 7

2.1 Electrical Power grids . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Smart grids . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Multi-agent systems . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Matrix games . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Types of matrix games . . . . . . . . . . . . . . . . . . . 12

2.3.2 Solution concepts . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Repeated matrix games . . . . . . . . . . . . . . . . . . . 15

2.3.4 Stochastic games . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Learning in repeated matrix games . . . . . . . . . . . . . . . . . 17

vi

2.4.1 Belief-based learning . . . . . . . . . . . . . . . . . . . . 18

2.4.2 No-regret learning . . . . . . . . . . . . . . . . . . . . . 18

2.4.3 Reinforcement learning . . . . . . . . . . . . . . . . . . . 18

2.5 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Genetic algorithms . . . . . . . . . . . . . . . . . . . . . 19

2.5.2 Genetic algorithm structure . . . . . . . . . . . . . . . . . 20

2.6 Genetic algorithms in repeated matrix games . . . . . . . . . . . 27

2.6.1 Genetic algorithms in distributed systems . . . . . . . . . 29

2.6.2 Genetic algorithms in dynamic systems . . . . . . . . . . 29

2.7 Interactive learning . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.7.1 Interactive learning in repeated matrix games . . . . . . . 32

2.7.2 Interactive genetic algorithms . . . . . . . . . . . . . . . 32

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3 Experimental Setup 36

3.1 Games’ structure . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.1 Prisoner’s dilemma . . . . . . . . . . . . . . . . . . . . . 37

3.1.2 Chicken . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1.3 Shapley’s game . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.4 Cooperative games . . . . . . . . . . . . . . . . . . . . . 38

3.2 Knowledge and Information . . . . . . . . . . . . . . . . . . . . 39

3.3 Opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 GIGA-WolF . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.4 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5 Performance of GIGA-WoLF and Q-learning . . . . . . . . . . . 43

3.5.1 Prisoner’s dilemma . . . . . . . . . . . . . . . . . . . . . 43

3.5.2 Chicken . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

vii

3.5.3 Shapley’s game . . . . . . . . . . . . . . . . . . . . . . . 44

3.5.4 Cooperative games . . . . . . . . . . . . . . . . . . . . . 44

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Learning using Genetic Algorithms 47

4.1 Algorithm structure . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.1 Basic genetic algorithm . . . . . . . . . . . . . . . . . . . 51

4.1.2 Genetic algorithm with history propagation . . . . . . . . 52

4.1.3 Genetic algorithm with stopping condition . . . . . . . . 52

4.1.4 Genetic algorithm with dynamic parameters’ setting . . . 52

4.1.5 Genetic algorithm with dynamic parameters’ setting and

stopping condition . . . . . . . . . . . . . . . . . . . . . 53

4.2 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2.1 Genetic algorithms vs. GIGA-WoLF . . . . . . . . . . . . 55

4.2.2 Genetic algorithms Vs. Q-learning . . . . . . . . . . . . . 56

4.2.3 Genetic algorithms in self play . . . . . . . . . . . . . . . 59

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Interactive genetic algorithms 63

5.1 Human input framework . . . . . . . . . . . . . . . . . . . . . . 64

5.1.1 Evaluate the population . . . . . . . . . . . . . . . . . . . 64

5.1.2 Select set of histories . . . . . . . . . . . . . . . . . . . . 66

5.1.3 Generate statistics for selected histories . . . . . . . . . . 67

5.1.4 Generating a new population from human input . . . . . . 68

5.2 Interactive genetic algorithms: Six variations . . . . . . . . . . . 69

5.2.1 Effect of input quality on the performance of GA . . . . . 69


5.3.1 Interactive genetic algorithms . . . . . . . . . . . . . . . 73

viii

5.3.2 Modificiations on interactive genetic algorithms . . . . . . 78

5.3.3 The effect of human input quality on interactive genetic

algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6 IGA in N-player matrix games 90

6.1 N-player prisoner’s dilemma . . . . . . . . . . . . . . . . . . . . 90

6.2 Strategy representation . . . . . . . . . . . . . . . . . . . . . . . 92

6.3 Human input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

7 Conclusions and Future work 99

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

ix

List of Tables

2.1 Payoff matrix for the Prisoner’s dilemma. . . . . . . . . . . . . . 12



3.2 Payoff matrix for chicken game. . . . . . . . . . . . . . . . . . . 38

3.3 Payoff matrix of Shapley’s game. . . . . . . . . . . . . . . . . . . 38

3.4 Payoff matrix of a fully cooperative matrix game. . . . . . . . . . 39

4.1 Variables used within the algorithms . . . . . . . . . . . . . . . . 49


4.3 Payoff of a fully cooperative matrix game. . . . . . . . . . . . . . 56

4.4 Payoff matrix for chicken game. . . . . . . . . . . . . . . . . . . 56

4.5 Payoff matrix of Shapley’s game. . . . . . . . . . . . . . . . . . . 57

5.1 Properties of the different variations of IGA algorithms. . . . . . . 70

5.2 Acceptable and unacceptable human inputs for the selected 2-agent

matrix games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

x

List of Figures

2.1 A traditional Electrical Grid [8]. . . . . . . . . . . . . . . . . . . 9

2.2 A smart electrical grid [8]. . . . . . . . . . . . . . . . . . . . . . 10

2.3 Payoff space for the prisoner’s dilemma game [25]. . . . . . . . . 16

2.4 Roulette wheel selection mechanism [30]. . . . . . . . . . . . . . 22

2.5 Crossover in genetic algorithms. . . . . . . . . . . . . . . . . . . 23

2.6 Mutation in genetic algorithms. . . . . . . . . . . . . . . . . . . . 24

2.7 Basic structure of genetic algorithms. . . . . . . . . . . . . . . . 25

2.8 The interactive artificial learning process [27]. . . . . . . . . . . . 31

3.1 Payoffs of GIGA-Wolf and Q-Learning within selected games. . . 46

4.1 Chromosome structure . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Effect of variations on GA on final payoffs against GIGA-WoLF

(all games). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3 Effect of variations on GA on final payoffs against GIGA-WoLF in

prisoner’s dilemma. . . . . . . . . . . . . . . . . . . . . . . . . . 58

xi


cooperation game. . . . . . . . . . . . . . . . . . . . . . . . . . . 58


Chicken game. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


Shapley’s game. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.7 Effect of variations on GA on final payoffs against Q-learning (all

games). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.8 Effect of variations on GA on final payoffs against Q-learning in


4.9 Effect of variations on GA on final payoffs against Q-learning in a

Cooperation game. . . . . . . . . . . . . . . . . . . . . . . . . . 60


Chicken game. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60


Shapley’s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.12 Sample of the chromosomes generated vs. Q-learning in prisoner’s

dilemma. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.13 Effect of history propagation on GA against Q-learning in Prison-

ers dilemma. Values shows are the average payoff per generation. 61

4.14 Effect of variations on GA against Q-learning in cooperation game

(Average payoff per generation). . . . . . . . . . . . . . . . . . . 61

4.15 Effect of variations on GA on final payoffs in self play (all games) 61

4.16 Effect of variations on GA on final payoffs in prisoner’s dilemma . 62

4.17 Effect of variations on GA on final payoffs in a Cooperation game 62

4.18 Effect of variations on GA on final payoffs in Chicken game . . . 62

4.19 Effect of variations on GA on final payoffs in Shapley’s game . . . 62

xii

5.1 Human input framework. . . . . . . . . . . . . . . . . . . . . . . 65

5.2 Designed graphical user interface. . . . . . . . . . . . . . . . . . 66

5.3 Evaluation metrices example . . . . . . . . . . . . . . . . . . . . 72

5.4 Effect of human input on basic IGA against GIGA-WoLF (all games). 73

5.5 Effect of basic human input on the performance (final payoff) of

GA vs. GIGA-WoLF in prisoner’s dilemma. . . . . . . . . . . . . 74


GA vs. GIGA-WoLF in a cooperation game. . . . . . . . . . . . . 74


GA vs. GIGA-WoLF in chicken game. . . . . . . . . . . . . . . . 74


GA vs. GIGA-WoLF in Shapley’s game. . . . . . . . . . . . . . . 74

5.9 Effect of basic human input on GA against GIGA-WoLF in Shap-

ley’s (Average payoff per generation). . . . . . . . . . . . . . . . 75

5.10 Effect of basic human input on GA on final payoffs against Q-

learning (all games). . . . . . . . . . . . . . . . . . . . . . . . . 75

5.11 A sample of the chromosomes generated from basic IGA vs. Q-

learning in prisoner’s dilemma. . . . . . . . . . . . . . . . . . . . 76


GA vs. Q-learning in prisoner’s dilemma. . . . . . . . . . . . . . 76


GA vs. Q-learning in a cooperation game. . . . . . . . . . . . . . 76


GA vs. Q-learning in chicken game. . . . . . . . . . . . . . . . . 76


GA vs. Q-learning in shapley’s game. . . . . . . . . . . . . . . . 76

xiii

5.16 Effect of basic human input on GA on final payoffs in self-play (all

games). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.17 Effect of basic human input on GA in self-play in chicken game



GA in self-play in prisoner’s dilemma. . . . . . . . . . . . . . . . 78


GA in self-play in a cooperation game. . . . . . . . . . . . . . . . 78


GA in self-play in chicken game. . . . . . . . . . . . . . . . . . . 78


GA in self-play in shaply’s game. . . . . . . . . . . . . . . . . . . 78

5.22 Effect of variations on human input on GA against GIGA-WoLF

(all games). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.23 Effect of variations on IGA against GIGA-WoLF in prisoners dilemma


5.24 Effect of variations on IGA on final payoffs vs. GIGA-WoLF in


5.25 Effect of variations on IGA on final payoffs vs. GIGA-WoLF in a



Chicken game. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80


Shapley’s game. . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.28 Effect of variations on IGA on final payoffs vs. Q-learning (all

games). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xiv

5.29 Effect of variations on IGA on final payoffs vs. Q-learning in pris-

oner’s dilemma. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.30 Effect of variations on IGA on final payoffs vs. Q-learning in a


5.31 Effect of variations on IGA on final payoffs vs. Q-learning in

chicken game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.32 Effect of variations on IGA on final payoffs vs. Q-learning in Shap-

ley’s game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.33 Effect of variations on IGA on final payoffs in self-play (all games). 83

5.34 Effect of variations on IGA on final payoffs in self-play in pris-

oner’s dilemma. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.35 Effect of variations on IGA on final payoffs in self-play in a coop-

eration game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.36 Effect of variations on IGA on final payoffs in self-play in chicken

game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.37 Effect of variations on IGA on final payoffs in self-play in shap-

ley’s game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.38 Human input quality and its effect on final payoffs of IGA against

GIGA-WoLF (all games). . . . . . . . . . . . . . . . . . . . . . . 85

5.39 Human input quality and its effect on final payoffs of IGA vs.

GIGA-WoLF in prisoner’s dilemma. . . . . . . . . . . . . . . . . 85


GIGA-WoLF in a cooperation game. . . . . . . . . . . . . . . . . 85


GIGA-WoLF in chicken game. . . . . . . . . . . . . . . . . . . . 85


GIGA-WoLF in shapley’s game. . . . . . . . . . . . . . . . . . . 85

xv

5.43 Human input quality and its effect on IGA vs. GIGA-WoLF in

prisoners dilemma (Average payoff per generation). . . . . . . . . 86

5.44 Effect of human input quality and its effect on IGA against GIGA-

WoLF in cooperation (Average payoff per generation). . . . . . . 86

5.45 Human input quality and its effect on final payoffs of IGA vs. Q-

learning (all games). . . . . . . . . . . . . . . . . . . . . . . . . 86


learning in prisoners dillema (per generation). . . . . . . . . . . . 87


learning in prisoner’s dilemma. . . . . . . . . . . . . . . . . . . . 87


learning in a cooperation game. . . . . . . . . . . . . . . . . . . . 87


learning in chicken game. . . . . . . . . . . . . . . . . . . . . . . 87


learning in shapley’s game. . . . . . . . . . . . . . . . . . . . . . 87

5.51 Human input quality and its effect on final payoffs of IGA in self-

play (all games). . . . . . . . . . . . . . . . . . . . . . . . . . . . 88


play in prisoner’s dilemma. . . . . . . . . . . . . . . . . . . . . . 88


play in a cooperation game. . . . . . . . . . . . . . . . . . . . . . 88


play in chicken game. . . . . . . . . . . . . . . . . . . . . . . . . 88


play in shapley’s game. . . . . . . . . . . . . . . . . . . . . . . . 88

6.1 Payoff matrix of the 3-player prisoner’s dilemma. . . . . . . . . . 91

xvi

6.2 Relationship between the fraction of cooperators and the utility re-

ceived by a game participant. . . . . . . . . . . . . . . . . . . . . 92

6.3 Final payoffs of selected opponents in 3-player prisoner’s dilemma. 93

6.4 Effect of human input on the performance of GA in the 3-player

prisoner’s dilemma in self-play. . . . . . . . . . . . . . . . . . . . 95

6.5 Effect of human input on the performance of GA in the 3-player

prisoner’s dilemma with 1-player as GIGA-WoLF and 1-player in

self-play. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.6 Effect of human input on the performance of GA in 3-player pris-

oner’s dilemma with 2-players as GIGA-WoLF. . . . . . . . . . . 96


oner’s dilemma with 1-player as Q-learning and 1-player in self-play. 96


oner’s dilemma with 2-players as Q-learning. . . . . . . . . . . . 97


oner’s dilemma with 1-player as Q-learning and 1-player as GIGA-

WoLF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

xvii

List of Algorithms

3.1 GIGA-WolF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Basic genetic algorithm . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Genetic algorithm with history propagation . . . . . . . . . . . . 52

4.3 Genetic algorithm with stopping condition . . . . . . . . . . . . . 53

4.4 Genetic algorithm with dynamic parameters’ setting . . . . . . . . 54

4.5 Genetic algorithm with dynamic parameters’ setting and stopping

condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xviii

CHAPTER 1

Introduction

1.1 Problem Definition

In multi-agent systems (MAS), intelligent agents interact with each other seeking

to maximize their own welfare. In many instances, these agents need to learn over-

time in order to become more successful. One of the main issues within MAS is

the ability of each agent in the system to learn effectively and co-exist with hetero-

geneous agents within the system. MAS is considered one of the most prominent

fields of research because its structure describes many of sreal life problems.

Extensive research has been performed in an effort to design a learning al-

gorithm applicable to work in MAS [10, 53, 91]. However, proposed solutions

typically suffer from at least one of the following problems:

1. Inability to adapt with the increase of dynamic situations within the system.

2. Settlement into myopic, non-evolving solutions.

1

CHAPTER 1. INTRODUCTION 2

3. Requirement of extensive learning time in order to reach an acceptable solu-

tion.

Power systems are widely used as an example of MAS [51, 106]. In such sys-

tems, each consumer can be considered as an agent. Each agent in this multi-agent

system must learn an intelligent behavior in order to maximize its own personal

gain. In this case, the gain from the consumer’s perspective is the ability to satisfy

the user’s consumption and preferences.

1.2 Motivation and Relevance to the Masdar Initiative

Electricity consumption has increased drastically in the past decade as a result of

an enormous increase in population and technology. In the UAE for instance, there

has been a sudden increase in the usage of high-tech appliances and ability to add

more electrical devices than ever before [70]. This increase in consumption, while

still relying on the old electrical grid as a mean of distributing electricity, leads

to high losses in electricity. This is partially due to the variation in consumption

among different sectors, where each sector consumes electricity based on differ-

ent schedules. Another recent development in modern power systems is the entry

of renewable energy sources on a large scale. The use of these energy sources,

which generate an intermittent and less predictable supply, is expected to continue

to increase over the next few decades to reduce the consumption of less environ-

mentally friendly energy resources. To effectively handle these forms of electricity

generation and increased electricity usage, a more intelligent distribution structure

must be implemented. This intelligent solution is referred to in the literature as the

“smart grid” [16].

Electricity dispatch, in which electricity supply must be matched with elec-

tricity demand, is a common problem in electrical grids. Research and industrial


work has been dedicated to design systems that are capable of taking information

from generators and consumers to determine how to effectively distribute electric-

ity. These types of systems are called “electricity management systems” [105].

Traditional management systems typically use a centralized structure, and rely

on a market operator to manage electricity distribution [99]. This central agent

determines the distribution of electricity either by applying a constant scheduling

mechanism or by running an off-line predication mechanism to model and predict

supply and demand [59]. However, management systems relying on a central single

operator do not match the structure of how an electricity market operate. Normally,

electricity markets rely on different entities (including the generators, distributors

and consumers) to make a decision. As such; electricity markets are better mod-

eled by distributed interactions between consumers and distributors. Furthermore,

traditional management systems have other drawbacks, such as:

1. The extensive computational power required at the decision center, which

results in slow response times and an inability to keep up with real-time

updates [37].

2. The system’s inability to respond adequately to an event not covered by the

system occurs. This is the result of the static structure of the scheduling

mechanism running in the system, and its inability to adapt to regulatory

changes in the environment in a real-time manner [37].

3. The algorithm running at the center of the system must be completely re-

designed if the configuration has to be changed (addition or removal of an

element or an agent) [37].

4. The centralized model’s focus concentrates solely on the overall response of

the power grid, making it difficult to model real-time interactions between

different entities [59].


However, the realization of a distributed power system requires the develop-

ment of additional capabilities to manage electricity transmission efficiently with-

out relying on a central agent. By moving into a multi-agent perspective, where

we rely on different decision makers, we must determine how charges can be dis-

tributed amongst generation companies, how to take transmission constraints into

account, and which regulation mechanisms are suitable for such a system [99].

Additionally, informational and motivational aspects of the agents, such as beliefs,

desires, intentions and commitments, must be investigated [13]. Furthermore, with

variations in workloads and energy sources, it is hard to define a single policy that

will perform well in all cases [32]. This leads us to a requirement for making

the agents more intelligent and adaptable. To achieve this, agents need to evolve

their own behavior with respect to scheduling their consumption according to these

changes, without the need to redesign their decision making structure. This can be

potentially achieved by applying game theory and machine learning concepts.

Previous work has been done in order to investigate the use of evolutionary

algorithms (EA) within the area of electricity management [67]. These algorithms

have been shown to be successful under specific conditions [81, 20, 83]. Unfor-

tunately, under other conditions, EA tend to perform poorly. One such situation

occurs when there is a very large solution space defined by two or more interact-

ing subspaces. One solution to this challenge is to run EAs on a multi-agent dis-

tributed structure of the system, and use the interaction between agents as means of

decreasing the search space by dividing it among different agents, and make each

agent benefit from others’ experience [102]. This introduces the concept of “multi-

objective fitness function”, where EA work on multiple fitness functions for each

agent. The ideal solution, in most cases, does not exist because of the contradictory

nature of objective functions, so compromises have to be made.

A second challenge of the use of EA for distributed power systems is a chal-


lenge that exists in most of the existing learning algorithms, which is the inability

to adapt quickly to the changes in the environment and user preferences. EAs are

expected to work best if the market is not volatile.

Research has been done to try to overcome these problems in evolutionary al-

gorithms, mainly in single agent systems. One suggested solution was to utilize

human input through “Interactive evolutionary algorithms” [33, 6]. In these algo-

rithms, human input is used by the algorithm to decrease the amount of time needed

to reach a stable and efficient policy for the system. To ensure that the policy fol-

lows human preferences, and that it is robust enough to cope with variations in

the system. The latest of our knowledge, the idea of using interactive evolutionary

algorithms in multi-agent systems has not be studied to date.

Therefore, our objective is to find a learning algorithm that can be used by

individual entities in power systems to effectively acquire and manage energy re-

sources to satisfy user preferences. This learning algorithm should be able to adapt

to the behavior of other learning entities, changes in the environment and changes

in user preferences.

1.3 Thesis statement

In this research, we study the performance of genetic algorithms (GAs) as a learn-

ing methodology for an agent within a multi-agent system. We discuss the effect of

integrating human input into GAs (known as interactive genetic algorithms) within

a multi-agent system. By conducting different experiments, we try to identify how

human input can be integrated into GAs, and test the applicability of interactive

genetic algorithms (IGAs) in repeated matrix games. The matrix games we use in

our experiments cover different possibilities and variations of the agents’ payoffs

in order to examine the effect of this variation on the algorithms’ performance.


We run different variations of our algorithms against different learning opponents,

including itself, Q-learning [100] and GIGA-WoLF [10].

1.4 Thesis Overview

Through this thesis, we will give a detailed analysis of related work done on the

aforementioned problem. We will begin by giving a literature review on related

topics through chapter 2 such as: multi-agent systems, genetic algorithms, and

interactive genetic algorithms. In chapter 3, we move forward to an overview of

the problem and our experimental setup. We show different variations of GAs

implemented through our experiments in chapter 4, and we study the effect of each

of these variations on the final performance of the system.

The next part of our thesis discusses and evaluates a potential framework for in-

tegrating human input into GAs. In chapter 5, we experiment the suggested frame-

work within the repeated matrix games and evaluate their performance against the

previously selected learning opponents. In order to evaluate the scalability of our

algorithms, in chapter 6, we apply the GA with and without human input in a 3-

player environment. We show how GA perform in such an environment and if

there are certain features that are not able to propagate to large-scale enviornments.

This study will help us gain an understanding of how our algorithms will perform

within more complex systems.

Finally, we will give a detailed discussion of the results (Chapter ??). This dis-

cussion helps in deriving the conclusions presented in chapter 7. We then suggest

potential future research to be based on this thesis.

CHAPTER 2

Literature Review

In this section, we present an overview of different fields related to our research.

We also discuss related work and explain how this work came to help in our exper-

iments. We will start by giving an overview about electrical power grids in section

2.1. Then, we will connect it to multi-agent systems, and explain the structure of

a standard MAS problem. We then move along to the specific MAS branch of

problems that we study in this thesis, which are repeated matrix games.

After giving the required background about the problem structure, we start giv-

ing background about the solution methodology we are using. We give an overview

about evolutionary algorithms in general (which includes genetic algorithms), after

which we explain the history and structure of genetic algorithms. We then examine

the work done in the field of learning in multi-agent systems, and start relating the

work done within genetic algorithms and learning in multi-agent systems. This re-

lation is materialized through different topics including genetic algorithms within

dynamic systems, genetic algorithms for developing strategies in matrix games and

7

CHAPTER 2. LITERATURE REVIEW 8

finally, our main topic, interactive genetic algorithms.

2.1 Electrical Power grids

An electrical power grid is a network of interconnected entities, in which electric-

ity generators are connected to consumers through a set of transmission lines and

distributors (Figure 2.1). Existing electrical grids mainly rely on classic control

systems. The structure of these systems is based on the following: The generator

generates a certain amount of electricity (depending on its capacity), which is then

distributed to the consumers through the distributers. This main structure faces

many difficulties, mainly when it tries to deal with variable types of generators,

distributors and consumers.

In real life systems, this variety is to be expected. In the generators’ case,

renewable energy sources have become extensively used for power generation, es-

pecially within the past decade. This leads to intermittency in the supply that is

usually hard to model [48]. Within the consumers’ case, although various research

is targeted towards modeling electricity demand and consumption patterns [43, 24],

demand is not always deemed predictable. This unpredictability leads to many con-

straints regarding distribution (which is also called a demand-supply problem), and

requires additional research to visualize more work that can efficiently enhance the

generation, distribution and consumption cycles through more intelligent means.

2.1.1 Smart grids

In order to enhance the efficiency of electricity generation, distribution and con-

sumption within the power grid, a logical solution is the merger of intelligent au-

tomation systems into the electrical power grid to form the “smart grid” [37].

A smart grid (see Figure 2.2) delivers electricity from suppliers to consumers


Figure 2.1: A traditional Electrical Grid [8].

using two-way digital communications to control appliances at consumers’ homes.

This could save energy, reduce costs and increase reliability and transparency if the

risks inherent in executing massive information technology projects are avoided.

Smart grids are being promoted by many governments as a way of addressing en-

ergy independence, global warming and emergency resilience issues [37].

In our research, we consider the management of electricity from the consumer

side, where an “agent” here represents a consumer, which tries to satisfy its needs

in the presence of external circumstances.


Figure 2.2: A smart electrical grid [8].

2.2 Multi-agent systems

A multi-agent system (MAS) is a system composed of multiple interacting intelli-

gent agents. Many of the real world problems, including electricity grids’ demand-

response problems, can be easily visualized as a MAS [78, 106]. In power systems,

each entity (distributors, consumers, and generators), is represented as an agent.

Each agent may have an individual goal to reach while interacting with other enti-

ties (agents).

Various research has been in done in the field to solve the electrical supply-

demand problem using multi-agent simulations [97, 80]. As we are testing a

new technique in this thesis, we wanted to base our work on research trends that

have been used before in the field of simulating electrical power grids. One of

these trends is considering our MAS as a simple matrix game (more details in

the following section). This decision was based on the fact that various elec-

trical grids, electricity scheduling and electricity market simulators use matrix

games [61, 73, 62, 5]. The supplier’s goal in these games is to supply power to


the consumers with the best price possible, while maintaining stability in the grid.

On the other hand, the consumer agents must satisfy their needs while minimizing

their costs (depending on the consumers’ preferences). All of these goals should

be satisfied keeping in mind the existence of other agents and external influences.

In order to achieve its goal, each agent follows a “strategy,” which is either fixed or

adaptable with time.

As a result of this extensive usage as a representation of an electricity market

problem, we choose to evaluate our algorithms within repeated matrix games. In

the following section we give more details about matrix games and their structure.

2.3 Matrix games

Matrix games are a subset of what are called “stochastic games.” In these games,

each player takes an action, which produces a reward. In a matrix game (also called

“normal form game”), the payoffs (rewards) towards the players action space is de-

fined in the form of a matrix. This action space represents the set of possible actions

that each player can perform within this game. Depending on the rewards they get,

the players decide the “strategy” they are going to follow, where the strategy rep-

resents the decision of which action to play over time [42].

For clarification, consider the game represented in Table 2.1. In this matrix

game, we have two players, player1 and player2. Player 1 (row player) can play

either action A or B (so its strategy space is {A,B}). Likewise, Player 2 (column

player) can play either action a or b. From this we can conclude that the set of

possible joint actions is {(A, a), (A, b), (B, a), (B, b)}. Each cell within the matrix

shown in the figure represents the reward for each player when this joint action

occurs. In the matrix given as example, the payoff to the row player (player 1) is

listed first, followed by the payoff to the column player (player 2). For example, if


a bA -1,2 3,2B 0,0 2,-1

Table 2.1: Payoff matrix for the Prisoner’s dilemma.

the row player plays B and the column player plays a (which is the joint action (B,

a)), then the row player receives a payoff of 2, while the column player receives a

payoff of -1.

A strategy for an agent (player) i is a distribution πi over its action set Ai.

Similarly, it can be defined as the probability that a player will move from one state

to another. A strategy can be either a pure strategy (where the probability of playing

one of the actions is 1, while the probability of playing any of the other actions is

0), or a mixed strategy, where each action is played with a certain probability over

time. The joint strategy played by the n agents is π=(π1, π2,...,πn) and, thus, ri(π)

is the expected payoff for agent i when the joint strategy π is played.

Depending on the action taken, a player’s situation changes overtime. The

situation of the player can be represented in what is called a “state” [11]. In this

example, this situation where each player takes a certain payoff in the presence of

a certain joint-action pair represents the state.

2.3.1 Types of matrix games

Matrix games can be divided into different types using different criteria. In this

section, we discuss the main criteria that affected how we selected the games used

for the experiments. Details about the exact games used in the experiments are

mentioned in the next chapter.


Cooperative vs. non-cooperative games

A cooperative game represents a situation in which cooperation amongst the play-

ers is the most beneficial to them. Therefore, these games mainly require an effi-

cient arrangement between players to reach the cooperative behavior. An example

of these games is a coordination game [17].

On the other hand, non-cooperative games are not defined as games in which

players do not cooperate, but as games in which any cooperation must be self-

enforcing. Most of realistic problems fall under the non-cooperative games.

Symmetric vs. asymmetric games

A symmetric game is a game where the payoffs for playing a particular strategy

depend only on the other strategies employed, not on who is playing them. If the

order of the players can be changed without changing the payoff to the strategies,

then a game is symmetric. On the other hand, in asymmetric games, the action set

for each player is different than the others. For simplicity, we focus in this research

on symmetric games.

2.3.2 Solution concepts

In matrix games, there are concepts through which we can identify strategies for

the players, which, if played, can lead to a state known as equilibrium. These

concepts can be useful in evaluating the performance of the strategy played. We

will give an overview about some of these concepts, which are directly related to

our research.

Best response

A best response is a strategy that produces the most favorable outcome for a player,

given the other players’ strategies [36]. Therefore the strategy πi∗ is a best response


for agent i if ri(πi∗,π−i)≥ ri(πi,π−i) for all possible πi.

Nash equilibrium

A Nash equilibrium, is a set of strategies, one for each player, that has the property

that no player can unilaterally change his strategy and get a better payoff. The Nash

equilibrium (NE) has had the most impact on the design and evaluation of multi-

agent learning algorithms to date. The concept of a Nash equilibrium is based on

the best response. When all agents play best responses to the strategies of other

agents, the result is an NE. Nash showed that every game has at least one NE [74].

However, there is no known algorithm for calculating NEs in polynomial time [77].

If we consider that all players are self-interested, each of them would tend to

play the best response to the strategies of other agents, if they know them, therefore

resulting in an NE. Many games have an infinite number of NEs. In the case of

repeated games, these NEs are called NEs of the repeated game(rNEs), which we

will discuss shortly. Therefore, the main goal of an intelligent learning agent is

not just to play best-response to the surrounding agents, but also to influence other

agents to play according to what is profitable to the agent as much as possible.

Maximin

Maximin is a decision rule used in different fields for minimizing the worst possible

loss while maximizing the potential gain. Alternatively, it can be thought of as

maximizing the minimum gain. The maximin theorem states [75]:

For every two-person, zero-sum game with finite strategies, there ex-

ists a value V and a mixed strategy for each player, such that (a) Given

player 2’s strategy, the best payoff possible for player 1 is V, and (b)

Given player 1’s strategy, the best payoff possible for player 2 is -

V [75].


Equivalently, Player 1’s strategy guarantees him a payoff of V regardless of Player

2’s strategy, and similarly Player 2 can guarantee himself a payoff of -V. The name

minimax arises because each player minimizes the maximum payoff possible for

the other since the game is zero-sum, he also maximizes his own minimum payoff.

Pareto efficiency

Named after Vilfredo Pareto, Pareto efficiency (optimality) is a measure of effi-

ciency. An outcome of a game is Pareto efficient if there is no other outcome that

makes every player at least as well off and at least one player strictly better off.

That is, if an outcome is Pareto optimal, there does not exist another outcome that

does not provide at least one other player a lower payoff.

2.3.3 Repeated matrix games

A repeated matrix game, from the name, is a matrix game that is played repeatedly.

The joint action taken by the agents identifies the payoff (reward) in each round

(or stage) of the game, which can help the player in taking the decision of which

action can be taken next through learning. The task of a learning agent in a repeated

matrix game is to learn to play a strategy πi such that its average payoff over time

(denoted r̄i for agent i) is maximized. Let r̄i be given by:

r̄i =1T

T

∑t=1

ri(πit ,π−i

t) (2.1)

where πi is the strategy played by agent i at time t, π−i is the joint strategy played

by all the agents except agent i, and 1 ≤ T ≤ ∞ is the number of episodes in the

game. In our work, we consider simultaneous action games, where in each round,

both agents play an action (without knowing the action of the other agents within

the same round).


Figure 2.3: Payoff space for the prisoner’s dilemma game [25].

In order to evaluate the strategies played, and see different equilibria, the con-

cept of the one-shot Nash equilibrium does not fully represent the equilibrium in

repeated games. Therefore, in order to define the concept of NE within a repeated

game, we discuss what is called the folk theorem [90].

Consider the case in the prisoners’ dilemma, shown in Figure 2.3, which shows

the joint payoffs of the two players. The x-axis shows the payoffs of the row player

and the y-axis shows the payoffs of the column player. The combination of the

shaded regions (light and dark) in the figure represents what is called the “Convex-

Hull”, which is all the possible joint-action pairs possible within the game. As

can be noticed, on average, the player guarantees itself a higher stable payoff by

playing defect; neither of the players has incentive to receive an average payoff

(over time) less than 0. Therefore, the darkly shaded region in the figure shows the

set of expected joint payoffs that the agents may possibly accept as average payoffs

within each step of the game.

The folk theorem states that any joint payoffs in the convex hull can be sus-


Cooperate DefectCooperate 3,3 0,5Defect 5,0 1,1


tained by an rNE, provided that the discount rates of the players are close to

unity (i.e., players believe that play will continue with high probability after each

episode). This theorem helps us understand the fact that, in repeated games, it is

possible to have an infinite number of NE.

2.3.4 Stochastic games

In real life situations, the more detailed stochastic games (where we have different

possible games to transit from one action-state pair to another) can be considered

more informative and suitable for modeling. However, research has been done [26]

in experimenting if learning algorithms that work within repeated matrix games

can be extended into repeated stochastic games or not. This extension was found

to give suitable results within the prisoner’s dilemma and its stochastic version

(for 2-agent games), which gives us the motivation needed in pursuing the current

experimentation within matrix games.

2.4 Learning in repeated matrix games

In this section, we give an overview of the related work in multi-agent learning,

especially algorithms that were used within matrix games. Because of the exis-

tence of various multi-agent learning algorithms found in the literature, we restrict

our attention to those that have had the most impact on the multi-agent learning

community as well as to those that seem to be particularly connected to the work

presented in this thesis. We divided the learning algorithms that we review into

three different (although related) categories: belief-based learning, reinforcement


learning, and no-regret learning [26].

2.4.1 Belief-based learning

Belief-based learning is based on the idea of constructing a model of the opponent’s

behavior. These models usually rely on previous interactions with the opponent.

Using this model, we try to find the best response with respect to this model. One

of the most known belief-based learning algorithms is fictitious play [15].

2.4.2 No-regret learning

A no-regret algorithm compares its performance with the “best action” available

within its set. Regret in this case is defined as the difference between the rewards

obtained by the agent and the rewards the agent might have obtained if it followed

a certain combination of its history of actions. In the long run, an algorithm with

no-regret plays such that it has little or no regret for not having played any other

strategy. GIGA-WoLF [10] is one example of a no-regret algorithm. We describe

GIGA-WoLF in greater detail within the next chapter.

2.4.3 Reinforcement learning

Reinforcement learning (RL) methods involve learning what to do so as to maxi-

mize (future) payoffs. RL agents use trial and error to learn which strategies pro-

duce the highest payoffs. The main idea of reinforcement learning is that through

time, the learner tries to take an action that maximizes a certain reward. Reinforce-

ment learning is widely used within matrix game environments [91, 52]. There are

several known learning algorithms that can be identified as reinforcement learning,

including: Q-learning [100] and evolutionary algorithms [70].


2.5 Evolutionary algorithms

Evolutionary algorithms are a popular form of reinforcement learning. Evolution-

ary algorithms (EAs) are population-based metaheuristic optimization algorithms

that use biology-inspired mechanisms like mutation, crossover, natural selection,

and survival of the fittest in order to refine a set of solution candidates iteratively.

Each iteration of an EA involves a competitive selection designed to remove poor

solutions from the population. The solutions with high ”fitness” are recombined

with other solutions by swapping parts of a solution with another.

Solutions are also mutated by making a small change to a single element of

the solution. Recombination and mutation are used to generate new solutions that

are biased towards solutions that are most fit [58]. This process is repeated until

the solution population converges to the solution with a high fitness value. In gen-

eral, evolutionary algorithms are considered an effective optimization method [2].

Survival of the fittest concept, together with the evolutionary process, guarantees a

better adaptation of the population [58].

2.5.1 Genetic algorithms

A Genetic Algorithm (GA) [45] is a type of evolutionary algorithm. GAs are based

on a biological metaphor, in which learning is a competition among a population of

evolving candidate problem solutions. A fitness function evaluates each solution to

decide whether it will contribute to the next generation of solutions. Then, through

operations analogous to gene transfer in sexual reproduction, the algorithm creates

a new population of candidate solutions [65, 44].

The main feature of GAs is that they encode the problem within binary string

individuals. In addition to other encoding techniques. Another feature of GAs is

their simplicity as a concept, and their parallel search nature, which makes it possi-

ble to easily modify GAs so they can be adapted to a distributed environment [18].


2.5.2 Genetic algorithm structure

In this section we give a description of how GAs work. In order to get a bet-

ter understanding of the algorithms, we define certain terminology that we use

throughout subsequent sections.

Fitness Fitness is the value of the objective function for a certain solution.

The goal of the algorithm is either to minimize or maximize this value, depending

on the objective function.

Genome “Chromosome” A genome, or frequently called a chromosome, is

the representation of a solution (strategy in the case of matrix games) that is to be

taken at a certain point of time. The GA generates various chromosomes, each

of which is assigned a certain fitness according to its performance. Using this

fitness, the known evolutionary functions are implemented in order to create new

population (generation) of chromosomes.

Gene Genes are the units that form a certain genome (chromosome). The

evolutionary functions such as mutation and crossover are mainly performed on

the genes within the chromosomes.

Solution space The solution space defines the set of all possible chromo-

somes (solutions) within a certain system. Through the evolutionary functions, we

try to cover as much as possible of the solution space for proper evaluation, without

trying all possible solutions one by one, as in the case of brute-search, in order to

save time.

After giving the proper definitions, we are going to describe how GAs work. In

the following sections, we will discuss the main components that vary among dif-

ferent GAs according to the application. These components include: chromosome


representation, selection process (fitness calculation and representation), mutation

process, and crossover process.

Chromosome structure

As previously mentioned, one important feature of GAs is their focus on fixed-

length characters’ strings, although variable-length strings and other structures

have been used. Within matrix games, these character strings represent the binary

encoding of a certain strategy [3, 31]. But others have used non-binary encoding,

depending on their application [86].

It should be noted, however, that there are special cases in which we consider

the use of a binary encoding perfectly acceptable. In a prisoner’s dilemma, for ex-

ample, agents have to make decisions that are intrinsically binary, namely decisions

between cooperation and defection. The use of a binary encoding of strategies then

seems like a very natural choice that is unlikely to cause undesirable artifacts [4].

Fitness functions and selection

The fitness function is a representation of the quality of each solution (chromo-

some). This representation varies from one application to another. According to

the fitness value, we select the fittest chromosomes and then perform crossover

and mutation functions on these chromosomes to generate the new chromosomes.

These techniques include: roulette wheel selection, rank base selection, elitism and

tournament based selection.

In the following section, we discuss each method. We exclude going over

tournament based selection, as it requires experimenting and testing the solution

through the selection process, which is not be suitable to our application since it is

an offline training technique for the algorithm.


Figure 2.4: Roulette wheel selection mechanism [30].

Roulette wheel selection Parents are selected according to their fitness. The

better the chromosomes, the more chances they have to be selected. In order to

get a better understanding, imagine a roulette wheel where all chromosomes in the

population are distributed on a wheel. Each chromosome gains its share of the

wheel size according to its fitness (Figure 2.4). A marble is thrown to select the

chromosome on this wheel. The fittest chromosome has a higher opportunity of

being selected.

Rank based selection Roullete wheel selection has problems when the fit-

nesses vary widely. For example, if the best chromosome’s fitness is 90 percent of

all the roulette wheel then the other chromosomes will have very few chances to

be selected. Rank selection first ranks the population and then every chromosome

receives fitness from this ranking. The worst will have a fitness of 1, the second

worst will have a fitness of 2, and this continues till we reach the best chromo-

some. The best chromosome will have a fitness of N (where N is the number of


Figure 2.5: Crossover in genetic algorithms.

chromosomes in the population).

Elitism The idea of elitism has already been introduced. When creating new

population by crossover and mutation, we have a big chance that we will lose the

best chromosome. Elitism selection starts by copying a certain percentage of the

best chromosomes to the new population. The rest of the population is then created

by applying mutation and crossover on the selected elite. Elitism can very rapidly

increase the performance of a GA because it prevents the algorithm from forgetting

the best found solution.

Crossover and Mutation

Crossover and mutation are the main functions of any genetic algorithm after se-

lection. They are the functions responsible for the creation of new chromosomes

out of the existing chromosomes.

In the crossover phase, all of the selected chromosomes are paired up, and

with a probability called “crossover probability,” they are mixed together so that a

certain part of one of the parents is replaced by a part of the same length from the

other parent chromosome (Figure 2.5). The crossover is accomplished by randomly


Figure 2.6: Mutation in genetic algorithms.

choosing a site along the length of the chromosome, and exchanging the genes of

the two chromosomes for each gene past this crossover site.

After the crossover, each of the genes of the chromosomes (except for the elite

chromosome) is mutated to any one of the codes with a probability defined as

the “mutation probability” (Figure 2.6). With the crossover and mutations com-

pleted, the chromosomes are once again evaluated for another round of selection

and reproduction. Setting the parameters concerned with crossover and mutation

is mainly dependent on the application at hand and the chromosome structure [45].

Algorithm summary

Genetic algorithms are based on the fundamental algorithm structure as shown in

Figure 2.7. First, an initial population of N individuals, which evolves at each gen-

eration, is created. Generally, we can say that a generation of solutions is obtained

from the previous generation through the following procedure: solutions are ran-

domly selected from the current population. Pairs of selected individuals are then

submitted to the crossover operation with a given crossover probability Pc. Each

descendant is then submitted to a mutation operation with a mutation probability

Pm, which is usually very small. The chromosome’s ability to solve the problem

is determined by its fitness function; the final step in the generation process is the

substitution of individuals of the current population with low performance by the


Figure 2.7: Basic structure of genetic algorithms.

new descendants. The algorithm stops after a predefined number, Gen, of genera-

tions has been created. An alternative stopping mechanism is a limit on computing

time [101].

Advantages and Applications of genetic algorithms

GAs represent one of the most renowned optimization search techniques, espe-

cially with the presence of high potential, non-linear search spaces. GAs have been

used to solve different single agent problems [70]. As the computational require-

ments increase, it became more applicable to distribute different methods within

the GA to different agents, where all agents in this case have the same goal. As

a summary, we can say that GAs are most efficient and appropriate for situations

such as the following:


• The search space is large, complex, or not easily understood

• There is no programmatic method that can be used to narrow the search space

• Traditional optimization methods, such as dynamic programming, are not

sufficient

Genetic algorithms may be utilized in solving a wide range of problems across

multiple fields such as science, business, engineering, and medicine. The following

provides a few examples:

• Optimization: production scheduling, call routing for call centers, routing

for transportation, determining electrical circuit layouts

• Machine learning: designing neural networks, designing and controlling

robots

• Business applications: utilized in financial trading, credit evaluation, budget

allocation, fraud detection

Genetic algorithms are important in machine learning for various reasons:

1. They can work on discrete spaces, where generally gradient methods cannot

be applied.

2. They can be used to search parameters for other machine learning models

such as fuzzy sets and neural networks.

3. They can be used in situations where the only information we have is a

measurement of performance, and here it competes with temporal difference

techniques, such as Q-learning [86].

4. They converge to a near optimal solution after exploring only a small fraction

of the search space [98, 49].


5. They can be easily hybridized and customized depending on the application.

6. They may also be advantageous in situations when needs to find a near opti-

mal solution [87].

While the great advantage of GA is the fact that they find a solution through

evolution, this is also the biggest disadvantage. Evolution is inductive in nature,

so it does not evolve towards a good solution, but it evolves away from bad cir-

cumstances [84]. This can cause a species to evolve into an evolutionary dead end.

This disadvantage can be clearly seen within more dynamic system, where stabilz-

ing within a dead-end can stop the algorithm from satisfying a dynamic learning

process.

2.6 Genetic algorithms in repeated matrix games

GAs have also been used within matrix games. Mainly, they have been used in

this context as either a method for computing the Nash equilibrium of a certain

game [22, 2, 54], or for generating the “optimal” chromosomes (strategies) to be

played within a certain game [4]. These applications have been either solved by a

single genetic algorithm running through the game parameters to reach an optimal

solution [34], or by using what is called a “co-evolutionary” technique, which ba-

sically involves the usage of two genetic algorithms (both having the same goal),

in order to reach the optimal solution in a more trusted and efficient manner [52].

Note here that co-evolution is considered as an “offline” learning technique,

as it require testing all the current chromosomes within the population against all

other chromosomes. This is not the same case while playing online, where the

chromosome is tested against only the current associate it was set up to play against

(not the whole population), which gives it less opportunity of exploration in front

of different criteria.


GAs have been used before in formation of strategies in dynamic Prisoner’s

Dilemma games. For example, Axelrod has used genetic algorithms in order to find

the most desirable strategies to be used in the prisoner’s dilemma [4]. Axelrod’s

stimulus-response players were modeled as strings, where each character within

the string corresponds to a possible state (one possible history) and decodes to the

player’s action in the next period. The longer the steps in memory to be taken into

consideration, the longer the string representing the chromosome will be. This is

as a result of the increase in the possible number of states. In addition, moving to a

game with more than two possible moves of will lengthen the string. Increasing the

number of players will also increase the number of states. Formally, the number of

states is given by am×p, where there are a actions and p players, and each player

keeps m periods of time in its history [44].

Another example was the usage of GA within a simple formulation of a buyer-

seller dilemma [92]. The GA implements a mixed strategy for the seller as an

individual member of the population. Each population member is therefore a vec-

tor of probabilities for each action that all add up to 1.0. Within this experiment,

they discussed the performance of GA in contrast of other RL techniques. The dif-

ference in performance between the GA agents and RL agents is primarily because

GA agents are population based. Since RL agents deal with only one strategy, they

are faster in adapting it in response to the feedback received from the environment.

In contrast, the GA agents are population based. It takes the GA population

as a whole longer to respond to the feedback received. For the same reason, the

GA agents are expected to exhibit less variance, and hence better convergence

properties. This was a good start in using genetic algorithm as a learning technique

instead of optimization. However, it still needed more work regarding working

against simple learning agents, for example without full state representation, and

without more specific domain knowledge. The authors of this work also raised the


question of how human input can contribute as potential future work [92].

In order to get a better understanding about how GAs may perform in more

complicated situations, we now discuss within the following sections the perfor-

mance of GAs within similar fields, such as distributed and dynamic systems.

2.6.1 Genetic algorithms in distributed systems

In distributed systems, the primary form of GAs that has been used was co-evolutionary

algorithms [56, 52, 107]. In this case, each GA agent represents one possible so-

lution, and with the existence of other GA agents, they try to verify which solution

will possibly be optimal. In another context, where each GA agent evolves its own

set of solutions, all the agents are centralized with the same objective function (all

the agents cooperate and communicate in order to reach the same goal) [47, 69].

As we can see in all of these situations, the GA agent is not completely independent

of other existing agents’ objective.

2.6.2 Genetic algorithms in dynamic systems

The goal of a GA within a dynamic system changes from finding an “optimal”

answer to tracking a certain goal (and enhancing the overall performance). Most

real world, artificial systems and societies change due to changes in a number of

external factors in the environments, agents learning new knowledge, or changes

in the make up of the population. When the environment changes over time, result-

ing in modifications of the fitness function from one cycle to another, we say that

we are in the presence of a dynamic environment [93]. Several researchers have

addressed related issues in previous work. Examples include evolutionary models

and co-evolutionary models where the population is changing over time, and stud-

ies in the viscosity of populations [68, 103]. Different from classical GA the goal

of such system is to maximize the average result instead of determining the best


optimal solution, where you track the performance of different solutions over time

instead of reaching a certain optimal target.

Brank [12] surveys the strategies for making evolutionary algorithms, which

include GAs, suitable for dynamic problems. The author grouped the different

techniques into three categories:

• React to changes, where as soon as a change in the environment has been

detected explicit actions are taken

• Maintain diversity throughout the run, where convergence is avoided all the

time and it is hoped that a spread-out population can adapt to modifications

more easily [55]

• Maintain an additional memory through generations (memory-based approaches),

where the evolutionary algorithm is supplied with memory to be able to re-

call useful information from past generations

Many methods have been presented to make genetic algorithms applicable in dy-

namic environments. First, researchers have modeled change in the environment

by introducing noise into the system whereby agents actions are mis-implemented

or mis-interpreted by other agents[29]. Another idea has been to localize the search

within a certain part of the search space. This can be either done through intelli-

gent initialization of the population [85], or as done within the “memetic algo-

rithm” [79], by evaluating close and similar neighbors of the chromosomes on trial

in addition to the chromosomes already tested. This is where we get part of our

motivation within interactive learning. This idea motivated our feedback of gener-

ating populations based on feedback from users(evaluating close neighbors helps

me evaluate existing ones).

The aforementioned methods either do not consider the existence of other het-

erogeneous learning entities in the system, or learn only under certain identified


constraints [1]. However, experimental results are promising and show interesting

properties of the adaptive behavior of GA techniques.

2.7 Interactive learning

Another factor to consider to potentially enhance the performance of genetic al-

gorithms is to gather human input in real time to teach the algorithm. Within

any learning algorithm, this can be done by merging the learning algorithm with

human-machine interaction, resulting in what is called in the literature “interac-

tive artificial learning.” Using human input as a part of the learning procedure can

provide a more concrete reward mechanism, which can increase the convergence

speed [96, 27]. These learning methods occur in either the “act,” “observe” or “up-

date” step of an interactive artificial learning mechanism [27]. Experiments have

been performed to evaluate potential effects of human input on the learning curve

in multi-agent environments. Results show a significant improvement in learning,

depending on the quality of the human input [27, 28].

Figure 2.8: The interactive artificial learning process [27].


2.7.1 Interactive learning in repeated matrix games

Within the repeated games environment. Experiments have been performed in

order to analyze the effect of the human input on the performance of the learn-

ing algorithms [27]. The algorithms used in these experiments use “learning by

demonstration” (LbD). Results showed that LbD does help learning agents to learn

non-myopic equilibrium in repeated stochastic games when human demonstrations

are well-informed. On the other hand, when human demonstrations are less in-

formed, these agents do not always learn behavior that produces (more successful)

non-myopic equilibria. However, it appears that well-formed variations of LbD al-

gorithms that distinguish between informed and uninformed demonstrations could

learn non-myopic equilibrium.

When humans play iterated prisoners’ dilemma games, their performance de-

pends on many factors [32, 41, 59]. Thus, it can be concluded that a similar trend

applies to LbD algorithms, and that there is a chance that LbD algorithms could

potentially provide information about the game and associates that would provide

a context that facilitates better demonstrations.

2.7.2 Interactive genetic algorithms

An interactive genetic algorithm (IGA) is defined as a genetic algorithm that uses

human evaluation. These algorithms belong to a more general category of interac-

tive evolutionary computation. The main application of these techniques include

domains where it is hard or impossible to design a computational fitness function,

including evolving images, music, various artistic designs and forms to fit a user’s

aesthetic preferences.

In an in interactive genetic algorithm (IGA), the algorithm interacts with the

human in attempt to quickly learn effective behavior and to better consider human

preferences. Previous work on IGA in distributed tasks has shown that human


input can allow genetic algorithms to learn more effectively [33, 38]. However,

such successes required heavy user interaction, which causes human fatigue [33].

Previous work in interactive evolutionary learning in single-agent systems has

analyzed methods for decreasing the amount of necessary human interaction in

interactive genetic learning. These methods either apply bootstrapping techniques,

which rely on estimations of the reward in between iterations instead of a direct

reward from the user [60, 66], or they divide the set of policies to be evaluated into

clusters, where the user only evaluates the center of the cluster and not all policies

[82].

Another suggestion for reducing human fatigue, which is applicable only in

multi-agent systems, is by using input from other agents (and potentially other

agents’ experiences), as one’s own experience [39].

Interaction between a human and the algorithm may occur in different stages

of GA and in different ways. The most common way is to make a human part of

the fitness evaluation for the population. This can be done by either ranking avail-

able solutions [94], or directly assigning the fitness function value to the available

policies in the population. Other work, which targets reducing human fatigue as

mentioned above, had the human evaluate only selected representatives of the pop-

ulation [82, 88, 60]. Also human input has been investigated in the mutation stage,

where the human first selects the best policy from his point of view, and suggests a

mutation operation to enhance its performance [33].

Babbar-Sebens et al. [28] realized a problem with IGA, which was how an algo-

rithm can overcome the temporal changes in human input. This situation not only

leads the genetic algorithm to prematurely converge, but it can also reduce diversity

in the genetic algorithm population. This occurs when solutions that initially have

poor human rankings are not able to survive the GA’s selection process. Loss of

these solutions early in the search process could be detrimental to the performance


of the genetic algorithm, if these solutions have the potential to perform better

when preferences change later. That is why they suggested the use of case-based

memory per population of policies. This memory acts as a continuous memory of

the population and its fitnesses to give a more continuous and non-myopic view of

evaluating the performance of the chromosomes through the generations (instead

of having the evaluations be based on a single generation biases) [6].

IGAs have also been used to define robot behavior in known environments. A

child (representing the human factor) trains the genetic algorithm through feed-

back on the evolved population. This training happens by selecting the top three

preferred routes to be taken by the robot [66]. Usually IGA are used in more visual

problems, where we can easily make the user rank or evaluate chromosomes in

certain domains such as music and design [40, 14, 89]. It also has been used in

resource allocation problems (which usually are more static) [94, 7].

The last related example was the interaction of a human with a GA within a

board game, where GA plays against another GA or against a human. The GA

here works on a limited set of (easy to reach) solutions that describe a behavior for

the whole game (not move by move). The human in this case sets the parameters in

the beginning of the game, including the number of generations and the mutation

rate [21].

Interactive genetic algorithms in multi-agent systems

Research that studied interactive genetic algorithms in multi-agent systems was

mainly focused on dividing the IGA functions (including human interaction, mu-

tation, crossover) into separate agents [57, 51]. In this case, the IGA is not fully

independent of the other modules, where all modules interact with each other to

reach a common objective.


2.8 Summary

Past work has derived intelligent behavior in repeated games using various method-

olgies. However, existing solutions have various problems, including:

• They require too much input from the user.

• They force certain constraints on users which decrease their comfort level.

• They do not learn fast enough for real time systems, and are not able to deal

with environments or goals that change over time.

• They cannot be used in distributed systems.

• They do not consider how human input can be incorporated in the system.

From this, we conclude that there is a need of a solution that is able to target

these drawbacks.

CHAPTER 3

Experimental Setup

In this chapter, we will present the experimental setup used to test our hypothesis.

Since our goal is to test the performance of interactive genetic algorithms (IGAs)

within a multi-agent system setting, we designed an experiment that will allow us

to do that. In order to test the efficiency of the algorithm, we run it against itself and

other learning algorithms in a variety of matrix games. The two renowned learning

algorithms we use are GIGA-Wolf [10] and Q-learning [100].

3.1 Games’ structure

In this section, we give an overview of the matrix games’ we use to evaluate our

algorithms. The expectations from the players within the games differ from one

game to another. Therefore, we expect a different response from each learning

algorithm.

36

CHAPTER 3. EXPERIMENTAL SETUP 37



3.1.1 Prisoner’s dilemma

The prisoner’s dilemma is perhaps the most studied social dilemma [4], [3], as it

appears to model many real life situations. In the prisoner’s dilemma (Table 3.1),

defection is each agent’s dominant action. However, both agents can increase their

payoffs simultaneously by influencing the other agent to cooperate. To do so, an

agent must (usually) be willing to cooperate (at least to some degree) in the long-

run. An n-agent, m-action version of this game has also been studied [95].

As we can see from the matrix in Table 3.1, if we consider the reward of each

player for each joint-action pair, the following rule should apply within a prisoner’s

dilemma matrix: rdc ≥ rcc ≥ rdd ≥ rcd . Here, i represents my action, j represents

the opponent’s action, and ri j represents my reward at the joint action pair (i, j).

However, it is more desirable than mutual defection for both players to choose the

first actions (C,C) and obtain rcc.

3.1.2 Chicken

The game of Chicken is a game of conflicting interests. Chicken models the Cuban

Missile Crisis [19], among other real-life situations. The game has two one-shot

NEs ((C, d) and (D, c)). However, in the case of a repeated game, agents may

be unwilling to receive a payoff of 2 continuously when much more profitable

solutions are available. Thus, in such cases, compromises can be reached, such as

the Nash bargaining solution (Swerve, Swerve) (Table 3.2). Therefore, the game is

similar to the prisoner’s dilemma game (Table 3.1) in that an “agreeable” mutual

solution is available. This solution, however, is unstable since both players are


Swerve StraightSwerve 6,6 4,7Straight 7,4 2,2

Table 3.2: Payoff matrix for chicken game.

a b cA 0,0 0,1 1,0B 1,0 0,0 0,1C 0,1 1,0 0,0

Table 3.3: Payoff matrix of Shapley’s game.

individually tempted to stray from it.

3.1.3 Shapley’s game

Shapley’s game [36] is a 3-action game. It is a variation from the rock-paper-

scissors game. Shapley’s game has often been used to show that various learning

algorithms do not converge. The game has a unique one-shot NE in which all

agents play randomly. The NE of this game gives a payoff of 1/3 to each agent

(in a 2-agent formulation). However, the players can reach a compromise in which

both receive an average payoff of 1/2. This situation can be reached if both players

alternate between receiving a payoff of 1 and receiving a payoffs of 0. The payoff

matrix for this game is shown in Table 3.3.

3.1.4 Cooperative games

As mentioned in the previous chapter, cooperative games are exactly opposite to

competitive games (which are part of non-cooperative games). In these games, all

agents share common goals, some of which may be more profitable than others.

Table 3.4 shows the payoff matrix of a fully cooperative game.


a bA 4,4 0,0B 0,0 2,2

Table 3.4: Payoff matrix of a fully cooperative matrix game.

3.2 Knowledge and Information

In matrix games, the more information any learning algorithm has about the game

and its associates, the more efficiently it can learn. Some of the information is

usually hidden within the learning process, and the algorithm has to deal with only

the information available. The following list shows the possible variations in the

level of knowledge of the agent. This will help us have a better understanding of

how our algorithm, and its opponents view the surrounding world.

• The agent’s own action. The agent has to basically know its actions in order

to know how to act in the first place.

• The agent’s own payoffs. The agent may know the reward it has taken at a

certain point of time, or how the actions are rewarded over time.

• Associates’ actions. The agent can either know directly which action was

taken by an associate, or be able to predict it over time.

• Associates’ payoffs. The agent can know what outcomes can be used to

motivate or threaten the other associates to be able to act accordingly.

• Associates’ internal structure. The agent may have a knowledge of how the

opponent reacts to certain situations. This knowledge usually is gained by

attempts to model the associates over time.

In our experiment, we assume that the algorithm has a complete knowledge

about its own payoffs and actions as well as the opponent’s history of actions (from

previous plays). No knowledge of the opponents internal structure is assumed.


Algorithm 3.1 GIGA-WolFxt is the strategy according to which I play my actionzt is the “baseline” strategyloop

x̂t+1← xt +ηt ∗ rt

zt+1← zt +ηt ∗ rt/3δt+1←min(‖zt+1− zt‖/‖zt+1− x̂t+1‖)xt+1 = x̂t+1 +δt+1 ∗ (zt+1− x̂t+1)

end loop

3.3 Opponents

The following section overviews the learning algorithms that we selected as oppo-

nents in our experiments.

3.3.1 GIGA-WolF

GIGA-WoLF [10] (Generalized Infinitesimal Gradiet Ascent-Win or Learn Fast) is

a gradient ascent algorithm. It is also a model-free algorithm like Q-learning. The

idea of the algorithm is that it compares its strategy to a baseline strategy. It learns

quickly if the strategy is performing worse than the baseline strategy. On the other

hand, if the strategy is performing better than the baseline strategy, it learns at a

slower rate. Algorithm 3.1 shows the basic update structure of the algorithm.

As we can see, this algorithm consists of two main components. The first

component is the “GIGA” component. The idea of “GIGA” is that after each play

the agent updates its strategy in the direction of the gradient of its value function.

The “WoLF” component was introduced later [11]. The idea is to use two different

strategy update steps, one which is updated with a faster learning rate than the other.

To distinguish between those situations, the player keeps track of two policies.

Each policy is concerned with assigning the probabilities of taking a certain action

under this specific situation.

GIGA-WoLF is a no-regret algorithm. No-regret learning converges to NEs


Algorithm 3.2 Q-learningfor each state-action pair (s, a) do

Q(s,a)← 0end forloop

Depending on exploration rate ε , select an action a and execute itReceive immediate reward rObserve the new state s′

Update the table entry for Q(s,a) as follows:Q(s,a)← (1−α)∗Q(s,a)+α ∗ (r + γ ∗maxa′ Q(s′,a′))s← s′

end loop

in dominance-solvable, constant-sum, and 2-action general-sum games, but do not

necessarily converge in Shapleys Game [50].

3.3.2 Q-learning

Q-learning [100] is a reinforcement learning technique that is widely used in ar-

tificial intelligence research [35], [72], [9]. It can be also viewed as a dynamic

programing technique, in which it iteratively tries to learn its ”to-go” payoff(called

a Q-value) over time. Its main idea can be summarized as follows: an agent tries

an action at a particular state, and evaluates its consequences in terms of the imme-

diate reward or penalty it receives and its estimate of the value of the next state. By

trying all actions in all states repeatedly, it learns which are best overall, judged by

the long-term discounted reward. Algorithm 3.2 shows the main structure of the

algorithm.

For all states and action pairs, Q(s, a) converges to the true value under the

optimal policy when (i) the environment has the Markov property, (ii) the agent

visits all states and takes all actions infinitely often, and (iii) the learning rate α

is decreased properly. However, if the agent always chooses the actions greedily

during learning, Q-values may converge to a local optimum because the agent may

not visit all states sufficiently. To avoid this, the agent usually uses a stochastic


methods (like ε-greedy) to choose actions. The ε-greedy method chooses an ac-

tion that has the maximum Q-value with probability (1-ε) or a random action with

probability ε [72].

The Q-learning algorithm we are using in our experiments has the following

settings:

• Discount factor γ = 0.95.

• State is represented by the previous joint action of the agent and its asso-

ciates.

• Exploration rate ε = 1.0/(10.0+(t/1000.0)), where t represents the number

of rounds played.

3.4 Evaluation criteria

In order to evaluate the performance of our algorithms, we will mainly focus on

two main points:

1. The average fitness of population per generation: This will help us to evalu-

ate the performance of the algorithm regarding convergence and the ability

of the algorithm to learn over time.

2. The final payoff achieved: By knowing and studying the final payoff, we

know the final performance of the algorithm in comparison to other algo-

rithms.

These evaluation criteria are averaged over 10 runs of each algorithm against

the selected opponent in order to eliminate the effect of randomness. We show the

variations within the final payoffs over all the conducted runs in order to verify

the stability of the performance of the algorithm within a specific game against a

specific opponent.


3.5 Performance of GIGA-WoLF and Q-learning

In order to get a better understanding of the behavior of the algorithms in the se-

lected games, we compared the performance of the algorithms against each other

and in self-play with available reports from previous work [10], [104], [71]. Fig-

ure 3.1 shows our results. These results represent the average of the average of

final payoffs of both learners over 10 runs within each game. Each run consists of

running both learning algorithms against each other for 100,000 steps. In addition,

the figure shows the standard deviation of the payoffs from the average (to show if

it converges consistently to the same rewards or not).

3.5.1 Prisoner’s dilemma

In the prisoner’s dilemma, we expected the agents to learn mutual defection when

GIGA-WoLF plays against Q-learning. This is due to the fact that we are working

with a no-regret algorithm, which learns within few iterations to defect. On the

other hand, the Q-learner learns slower than the GIGA-Wolf, so it takes it a larger

number of iterations for the Q-learner to learn to defect against GIGA-WoLF. The

Q-learner we used uses a high value for γ as it was shown from previous work

that it increases the probability of cooperation if the other learners are willing to

cooperate [71].

3.5.2 Chicken

When GIGA-WoLF and Q-learning interact in Chicken, they did not stabilize to

a fixed outcome among the 10 simulations we ran. Depending on the state action

pairs experimented by the Q-learner in the beginning of a simulation, GIGA-WoLF

acts accordingly. Thus, if the Q-learner started by attempting to swerve in the be-

ginning, the GIGA-WoLF will go straight. But in most of the cases, GIGA-WoLF


will go with the safe option (if Q-learner tried to go straight in the beginning), and

it will swerve for the rest of the game.

Both Q-learning and GIGA-WoLF are not able to reach the compromise situ-

ation of (swerve, swerve) in self-play. One agent always “bullies” the other into

swerving, while the other goes straight.

3.5.3 Shapley’s game

Previous work [9], [50] shows that GIGA-WoLF’s policy does not converge in

Shapley’s game. This fact is apparent in both self-play and against Q-learning. As

a result, players receive an average payoff near the NE value in this game. The

best performance for Q-learning is in self play, as it is often able to learn over time

to reach the solution of iterating between “winning” and “loosing”. But again, in

some cases it is still unable to reach this satisfactory solution.

3.5.4 Cooperative games

Within cooperative games, GIGA-WoLF will find it easy to maintain one of the

actions over time, giving it the ability to reach cooperation quickly in self-play.

This property helps the Q-learner to easily discover the state-action pair that has

the maximum Q-value (as both of the agents in this case have the same goal). As

a result, our Q-learners learn mutual cooperation. On the other hand, Q-learning is

not able to maintain the highest payoff possible from cooperation in self-play. The

reason is although each agent tries to stabilize at one of the actions (to reach its

steady-state), the exploration mechhanism within the algorithm sometimes make it

hard for both agents to maintain a certain action pair.


3.6 Summary

From the results presented within this chapter, we find that, although both Q-

learning and GIGA-WoLF perform well under certain situations, there are situa-

tions in which the algorithms do not learn effectively. Furthermore, these algo-

rithms sometimes take a long time to converge. This motivates us to work on

deducing new algorithms that are able to adapt within such dynamic systems. In

the following chapter, we start discussing the structure of the suggested algorithm

and potential variations that could enhance it.


Vs. GIGA−WoLF Vs. Q−learning0.5

1

1.5

2

2.5

3

Opponents (Vs.)

Aver

age

payo

ff (1

0 ru

ns, e

ach

run

100,

000

step

s)

Final payoffs of GIGA−WoLF and Q−learning in Prisoners dilemma

GIGA−WoLFQ−learning

Vs. GIGA−WoLF Vs. Q−learning0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Opponents (Vs.)

Aver

age

payo

ff (1

0 ru

ns, e

ach

run

100,

000

step

s)

Final payoffs of GIGA−WoLF and Q−learning in Cooperation game



4

4.5

5

5.5

6

6.5

7

7.5

Opponents (Vs.)

Aver

age

payo

ff (1

0 ru

ns, e

ach

run

100,

000

step

s)

Final payoffs of GIGA−WoLF and Q−learning in Chicken game



0.25

0.3

0.35

0.4

0.45

0.5

0.55

Opponents (Vs.)

Aver

age

payo

ff (1

0 ru

ns, e

ach

run

100,

000

step

s)

Final payoffs of GIGA−WoLF and Q−learning in Shapleys game


Figure 3.1: Payoffs of GIGA-Wolf and Q-Learning within selected games.

CHAPTER 4

Learning using Genetic Algorithms

In this chapter, we discuss the performance of a basic genetic algorithm (GA) al-

gorithm in repeated matrix games. In addition, we present several suggested mod-

ifications to this basic algorithm and show how they may affect the performance of

the GA. We first describe the basic GA structure, and the modifications we apply to

it. We then demonstrate the performance of these algorithms against GIGA-WoLF,

Q-Learning and in self-play.

We initially define a set of parameters that are used within our algorithms, we

tried to maximize the number of steps taken in order to get a better understanding

of the learning trends. These parameters include: the total number of steps (Ns)

and the number of generations (NG), both determine the time range through which

the agent can learn through playing against another agents. We set NG = 100 gen-

erations and Ns = 100,000 (for easier manipulation of the calculations required).

Once we know NG and Ns, we get a trade-off between the number of chromo-

somes within a population (Nc), and the number of steps that each chromosome

47

CHAPTER 4. LEARNING USING GENETIC ALGORITHMS 48

plays against the opponent (Nsc). Equation 4.1 shows the resulting trade-off.

NG =Ns

Nsc×NG(4.1)

Therefore, by setting the total number of steps (Ns) to 100,00, and by fixing the

number of generations to 100 (in order to have an acceptable number of generations

through which we compare our results), we get that Nsc×Nc=1000.

Through initial experimentations, we set Nsc to different values including 50,

100, and 200, which caused Nc to be 20, 10, and 5 respectively. This analysis

shows us that decreasing the number of chromosomes within a population reduces

randomization, which can cause the population lean towards a local-optima in cer-

tain situations. At the same time, by increasing the members in the population,

although it will allow us more exploration, evaluating the population consumes

more time. That is why we settled for a population of 20 chromosomes, which

appears to be reasonable in our settings.

In the following section, we will describe the main structure of the algorithm

to get a better understanding of how we incorporate these modifications, as well as

comprehend the results and analysis.

4.1 Algorithm structure

We analyze a GA typically used in similar problems [3], with slight modifications

to the selection function. Before we start, we introduce some of the variables

used within our work. Table 4.1 shows the common parameters used within the

algorithms.

As mentioned within the literature review, GA starts with the initialization of a

new population of chromosomes (Pop). Each chromosome C represents the strat-

egy followed by the player in response to the current history (his) of both the player


and its opponent. An example of the structure of the chromosome within the pris-

oner’s dilemma game can be seen in Figure 4.1.

AIn this structure, each bit in the chromosome represents the action to be taken

for a particular history (specificed by the position) of joint action. In our case, we

used the last three actions taken by both the agent and its opponent, as has been

done in some past work [3]. In order to determine which action to take according to

a set of history steps (his), we convert the number his from the base Na to decimal

base in order to identify the bit location of the step to be taken. This conversion is

made as follows:

Ap =3

∑i=1

(Na)i×hisi (4.2)

Representation VariablePm Mutation RatePc Crossover RatePe Elitism RateC Chromosome (Strategy)f FitnessP ParentCh ChildAp Position of the gene to determine action to be takenBp Best Chromosome in Previous GenerationAvp Average fitness of chromosomes in current generationBavp Best average payoff over generationsNc Number of chromosomes per generationNG Number of generationsNa Number of actions available for each playerNs Total number of chromosomesNsc Number of steps per chromosomePop Current populationhis Current history of actionsOvf Overall fitness of chromosomeg Gene within a chromosome (bit)Entrop Entropy of a gene

Table 4.1: Variables used within the algorithms


Figure 4.1: Chromosome structure

Using this equation, we can identify the position of the action to be taken in

response to a certain history within the chromosome. For example, if my history

was CCCCDC (which represents that my moves in the past three stages were Co-

operate, except the last stage was defect, while the opponent cooperated all the

time). This history can be encoded as a binary number 000010, which we convert

to decimal base. This means that the bit g at position Ap=2 within the chromosome

shows the action to be taken given this history. Take note that the binary encoded

history can be set in other bases depending on the number of actions available. For

example in a 3-action game, we will be working with history encoded in the ternary

numeric base.

After the initialization of the population (in this experiment we have 20 chro-

mosomes within the population), we start running these random chromosomes

against the opponent player to “evaluate” them. Our evaluation here is based on

averaging the reward taken by each chromosome within each step against the op-

ponent. Each chromosome plays for 50 steps against the opponent. The reward is

then averaged to compute the fitness of the chromosome.

Following the evaluation, we sort the population according to fitness f. Here

our selection process starts, with the higher fitness at the top. We keep the elite

(top) chromosomes, whose number is defined by the elitism rate Pe for the follow-

ing generation, and apply mutation and crossover on the best two chromosomes.

The same sequence is repeated again with the new population Ch until a stopping


condition is satisfied or the game ends.

In our algorithm, we used the traditional default parameter values (mutation,

crossover ,and elitism rates), which was commonly used in previous research [3], [64].

These values are set as: Pm = 0.01, Pc = 0.5, and Pe = 0.3, with the crossover point

at the middle of the chromosome.

In the following sections, we present the different variations of the genetic

algorithms applied to the basic GA. We also discuss our findings, which lead us to

the idea of integrating the human input to the genetic algorithm.

4.1.1 Basic genetic algorithm

The basic structure of the genetic algorithm that has been previously explained is

detailed in Algorithm 4.1. We can see that the algorithm starts with going through

each chromosome of the population. It computes the fitness function by playing

against the opponent. After passing through all the chromosomes, we sort them in

the population using the fitness values. We then apply the elitism, crossover and

mutation functions to generate the new population. We repeat the process until we

reach the designated number of generations.

Algorithm 4.1 Basic genetic algorithmfor i = 1 to Gen do

Run population Pop against opponentfor each C in population Pop do

Set fend forSort the population depending on value fRun elitism using Pe, crossover using Pc and mutate using Pm

end for


Algorithm 4.2 Genetic algorithm with history propagationfor i = 1 to Gen do


Set fOv f ← ( f +Ov f )/2

end forSort the population depending on value Ov fRun elitism using Pe, crossover using Pc and mutate using Pm

end for

4.1.2 Genetic algorithm with history propagation

Normally, GA sorts the chromosomes within its population based on the fitness

acquired within the current generation. Over time, this may lead the algorithm

to reach a short-sighted solution (as it goes through a local minima that it cannot

escape). To avoid this, the fitness of the chromosomes used over multiple genera-

tions is propagated from one generation to the next using what is called the overall

fitness (Ov f ). This variation of the algorithm is outlined in Algorithm 4.2.

4.1.3 Genetic algorithm with stopping condition

The previously explained GA is unstable as we cannot set the stopping condition

to be the number of generations. To achieve more stability, we tested another GA

with a stopping condition. If the fitness remains stable for a certain number of

generations (in our experiments, we set it to 3), the GA seizes to evolve. The

condition is described in algorithm 4.3.

4.1.4 Genetic algorithm with dynamic parameters’ setting

One commonly used method in the literature is to dynamically set different pa-

rameters within the GA, specifically the mutation and elitism rates based on the

performance of the chromosomes. If the fitness is increasing from one generation


Algorithm 4.3 Genetic algorithm with stopping conditionfor i = 1 to Gen do


Set fOv f ← f +Ov f

end forSort the population depending on value Ov fif Ov f of Bp = Ov f of P then

Break from the loopend ifRun elitism using Pe, crossover using Pc and mutate using Pm

end for

to another, we decrease the mutation rate in an attempt to stabilize the algorithm

at its current fitness. If the average fitness of the population is deteriorating, we

increase the mutation rate in order to try to generate a more fit population (see

Algorithm 4.4).

4.1.5 Genetic algorithm with dynamic parameters’ setting and stop-

ping condition

This variation combines a good exploration methodology with a stopping condi-

tion. The structure of this combined variation is shown in Algorithm 4.5.


Algorithm 4.4 Genetic algorithm with dynamic parameters’ settingfor i = 1 to Gen do


Set fend forSort the population depending on value Ov fif Avp≥ Bavp then

Decrease mutation rate to 0.01Bavp← Avp

elseif Avp≤ Bavp then

if Bavp−Avp≥ 20% thenif Bavp−Avp≥ 40% then

Increase the mutation rate to 0.4else

Increase the mutation rate to 0.2end if

elseKeep mutation rate the same

end ifi← i+ k

end ifend ifRun elitism using Pe, crossover using Pc and mutate using Pm

end for

4.2 Results and analysis

In this section, we present the results from the conducted experiments. As previ-

ously mentioned in chapter 3, we run our experiments on four different games:

1. Prisoner’s dilemma (Table 4.2).

2. Cooperation game (Table 4.3).

3. Chicken game (Table 4.4).

4. Shapley’s game (Table 4.5).


Algorithm 4.5 Genetic algorithm with dynamic parameters’ setting and stoppingcondition

for i = 1 to Gen doRun population Pop against opponentfor each C in population Pop do

Set fend forSort the population depending on value Ov fif Ov f of Bp = Ov f of P then

Break from the loopend ifif Avp≥ Bavp then

Decrease mutation rate to 0.01Bavp← Avp

elseif Avp≤ Bavp then

if Bavp−Avp≥ 20% thenif Bavp−Avp≥ 40% then

Increase the mutation rate to 0.4else

Increase the mutation rate to 0.2end if

elseKeep mutation rate the same

end ifi← i+ k

end ifend ifRun elitism using Pe, crossover using Pc and mutate using Pm

end for

In presenting these results, we compare the performance of the algorithmic

variations discussed in the previous section.

4.2.1 Genetic algorithms vs. GIGA-WoLF

Our first set of results show the performance of the different variations of GA

against the opponent GIGA-WoLF. Overall, we can see from Figure 4.2 that the

performance of the Dynamic GA with mutation excels in comparison to the other

variations of GA with a smaller variation, which ensures stability. Figures 4.3, 4.4,




a bA 4,4 0,0B 0,0 2,2

Table 4.3: Payoff of a fully cooperative matrix game.

and 4.5 show that the same properties are valid if we consider each game seper-

ately. We see that stopping condition on its own, although it provides stability, is

not able to overcome the local-optima in some situations (as no extensive explo-

ration of the strategy space occurs). Thus, dynamic mutation in addition to the

stopping condition enhances the exploration most effectively.

From Figure 4.6, we can see a slight improvement in the performance in Shap-

ley’s game when we add the propagation of history, where both versions of the GA

(with and without propagation) achieve a payoff that is near the Nash value(0.33).

However, in contradiction with the general trend of results, even though the mod-

ifications slightly enhanced the final payoff achieved, none of the algorithms con-

verges. This is shown due to the wide margin within the final payoffs). In this case,

pure dynamic GA (without the stopping condition) performs better.

4.2.2 Genetic algorithms Vs. Q-learning

Figure 4.7 shows the average final payoffs of the various GAs against Q-learning

across all games. The figure shows that the algorithms perfoming the best are the

Swerve StraightSwerve 6,6 4,7Straight 7,4 2,2

Table 4.4: Payoff matrix for chicken game.


a b cA 0,0 0,1 1,0B 1,0 0,0 0,1C 0,1 1,0 0,0

Table 4.5: Payoff matrix of Shapley’s game.

Basic W/ prop. W/ stop. Dynamic Dynamic W/ stop.

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

Aver

age

final

pay

off

GA vs. GIGA−WoLF (all games)

Figure 4.2: Effect of variations on GA on final payoffs against GIGA-WoLF (allgames).

basic GA (without propagation) and the Dynamic GA with mutation. In figure 4.8,

we can see that some variations (such as history propagation) were able to reach

higher payoffs in certain circumstances.

However, we analyzed a few samples of resulting populations’ chromosomes of

both basic GA and dynamic GA with mutation in prisoner’s dilemma (Figure 4.12),

as well the margin of variation in the final payoffs. We saw that while the basic GA

results into a random population (not conforming to a certain pattern of strategies),

the Dynamic GA was able within certain iterations to conform to a cooperation

policy with Q-learning. We can also see from Figure 4.13, which shows the average

payoff awarded to the algorithm over time, that GA with propagation of history

shows a better learning trend in the end.

In contrast to what we expected, history propagation slightly decreased the

performance of the algorithm (wider margin of variation in figure 4.9). In chicken

(Figure 4.10), we can see that GA is unable to adapt to a bullying strategy in this



0.5

0.6

0.7

0.8

0.9

1

Aver

age

final

pay

off

GA vs. GIGA−WoLF in prisoners dilemma

Figure 4.3: Effect of variations on GAon final payoffs against GIGA-WoLF inprisoner’s dilemma.


2

2.5

3

3.5

4

Aver

age

final

pay

off

GA vs. GIGA−WoLF in Cooperation game

Figure 4.4: Effect of variations on GA onfinal payoffs against GIGA-WoLF in co-operation game.


4.5

5

5.5

6

6.5

7

Aver

age

final

pay

off

GA vs. GIGA−WoLF in Chicken game

Figure 4.5: Effect of variations on GAon final payoffs against GIGA-WoLF inChicken game.

Basic W/ prop. W/ stop. Dynamic Dynamic W/ stop.0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Aver

age

final

pay

off

GA vs. GIGA−WoLF in Shapleys game

Figure 4.6: Effect of variations on GAon final payoffs against GIGA-WoLF inShapley’s game.

case. Best performances can actually be seen in case of the basic GA and the

dynamic GA. This is a very unexpected result given that dynamicity confuses slow-

learning algorithms such as Q-learning, making it more difficult to adapt.

In Shapley’s game (Figure 4.11), the GA with the highest performance was the

one with the propagation of history. What was expected and viewed from the re-

sults is the fact that algorithms with stopping conditions (not continuously learning

and mutating with a higher rate) are ones that performed the worst. This confirms

the need of an adaptable mutation rate in order to enhance the performance of the

algorithm. However, by implementing the dynamic mutation technique, we can

see in figures an improvement in the performance (Figure 4.14). It was noticed

that using the stopping condition on its own (without implementing the dynamic



2

2.2

2.4

2.6

2.8

3

3.2

Aver

age

final

pay

off

GA vs. Q−learning (all games)

Figure 4.7: Effect of variations on GA on final payoffs against Q-learning (allgames).

mutation) lead to the stability at an ucceptable local minima.

4.2.3 Genetic algorithms in self play

From Figure 4.15 shows that the dynamic GA with a stopping condition has a better

overall performance than its counterparts, both in terms of the average final payoff

and the margin of the final payoff.

However, this is not universally true in specific games(Figure 4.16-4.19), dif-

ferent variations do not affect the performance of the algorithms substantially in

the prisoner’s dilemma. With analyzing the set of policies achieved, we find that

almost in all self-play cases, the algorithms learn to alternate between defection

and cooperation.

None of the variations were able to achieve continuous cooperation int the

cooperation game(Figure 4.17). The best performer was the GA with the Dynamic

GA with stopping condition. On the other hand, none of the GA variations were

able to stabilize at a satisfactory payoff except for the dynamic GA with stopping

condition in both Chicken and Shapley’s (Figures 4.18 and 4.19). The variation

with only a stopping position gave an unacceptable performance. This may be due

the fact that it converged prematurely.



1

1.2

1.4

1.6

1.8

2

Aver

age

final

pay

off

GA vs. Q−learning in prisoners dilemma

Figure 4.8: Effect of variations on GA onfinal payoffs against Q-learning in pris-oner’s dilemma.


2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

Aver

age

final

pay

off

GA vs. Q−learning in Cooperation game

Figure 4.9: Effect of variations on GA onfinal payoffs against Q-learning in a Co-operation game.


4

4.5

5

5.5

6

6.5

7

Aver

age

final

pay

off

GA vs. Q−learning in Chicken game

Figure 4.10: Effect of variations on GAon final payoffs against Q-learning inChicken game.


0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

Aver

age

final

pay

off

GA vs. Q−learning in Shapleys game

Figure 4.11: Effect of variations on GAon final payoffs against Q-learning inShapley’s.

Figure 4.12: Sample of the chromosomes generated vs. Q-learning in prisoner’sdilemma.


0 10 20 30 40 50 60 70 80 90 1000.5

1

1.5

2

2.5

Numberof generations

Aver

age

payo

ff pe

r gen

erat

ion

GA vs. Q−learning in prisoners dilemma game

Without history propagationWith history propagation

Figure 4.13: Effect of history propagation on GA against Q-learning in Prisonersdilemma. Values shows are the average payoff per generation.

0 20 40 60 80 1001

1.5

2

2.5

3

3.5

4


Ave

rage

pay

off p

er g

ener

atio

n

GA vs. Q−learning in cooperation game

With history propagationWith Stopping conditionDynamic GADynamic GA with stopping condition

Figure 4.14: Effect of variations on GA against Q-learning in cooperation game(Average payoff per generation).


2.5

3

3.5

Aver

age

final

pay

off

GA in self−play (all games)

Figure 4.15: Effect of variations on GA on final payoffs in self play (all games)


4.3 Conclusions

From the results, we can deduce that our final variation (GA with history propa-

gation and stopping condition) performed the best on average. We also concluded

that a dynamic mutation rate (varying from time to time) is very effective in order

to enhance the performance of the GA. Also, we can see a variation in the perfor-

mance of GA and its variations from one iteration to another (shown by the wide

margin within the final payoffs in some cases). This shows us that, as we are using

a random initial population, the population at the beginning of the game affects the

performance of the GA drastically.


1.5

2

2.5

3

3.5

Aver

age

final

pay

off

GA in self−play in prisoners dilemma

Figure 4.16: Effect of variations on GAon final payoffs in prisoner’s dilemma


1.5

2

2.5

3

3.5

4

Aver

age

final

pay

off

GA in self−play in Cooperation game

Figure 4.17: Effect of variations on GAon final payoffs in a Cooperation game


4.5

5

5.5

6

6.5

Aver

age

final

pay

off

GA in self−play in Chicken game

Figure 4.18: Effect of variations on GAon final payoffs in Chicken game


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Aver

age

final

pay

off

GA in self−play in Shapleys game

Figure 4.19: Effect of variations on GAon final payoffs in Shapley’s game

CHAPTER 5

Interactive genetic algorithms

Through this chapter, we propose a method for using human input to improve the

performance for a GA in repeated matrix games. Furthermore, we will show the

results of running the resulting algorithm against the same set of opponents within

the same set of games as in the previous experiments.

As we have studied and analyzed in the previous chapter, slight changes within

the population structure, either initially or throughout the learning process, can

have a huge effect on the performance of the learning procedure within this learning

algorithm. We suggest the usage of human input as a changing factor in these

environments. So called interactive genetic algorithms are very widely used in

various applications, usually when it is easy for the user to visualize and evaluate

the performance. In this chapter, we describe and evaluate different variations of

the interactive GA in repeated matrix games.

As mentioned in the literature review, interactive genetic algorithms (IGA) are

the focus of a lot of current research, especially when systems include more qual-

63

CHAPTER 5. INTERACTIVE GENETIC ALGORITHMS 64

itative criteria of judgement that require potential human interference. Currently,

the main focus of interactive genetic algorithms is within more visual applications,

where the user can show his “qualitative” satisfaction with the performance of the

algorithm. Such evaluations can be used as a fitness function.

Motivated by this idea, we designed a framework through which human can

receive and give feedback about the performance of the GA. This feedback should

provide, through minimal interactions, the necessary information in order to en-

hance the algorithm’s performance and adapt it quickly to any changes in the sur-

rounding the environment.

We deduced from the results of the previous chapter two characteristics that

affect the performance of GAs within repeated matrix games: population initial-

ization and mutation and variable mutation rates.

Our framework focuses primarily on using human input to improve the popu-

lation. However, we also ran some experiments to show the possible benefit for of

the human to vary the algorithm’s mutation rate dynamically.

5.1 Human input framework

The human input framework is outlined in Figure 5.1. This framework should

enable the user to both give and recieve feedback from the algorithm throughout

the game. We also show a graphical user interface for interaction purposes in

figure 5.2. Within the following sections, we describe the detailed structure of the

framework. We do so by discussinng the main steps of our framework in turn.

5.1.1 Evaluate the population

Our main concern is how to allow the user to effectively change the chromosome

population. Initially, the population is generated randomly initially. After the initial


Evaluate the generated chromosomes (strategies) against the opponent to calculate the

experimental fitness

If cycle for human input

started NO

Calculate entropy for every bit (representing the action at a

certain history combination) through the whole population

depending on the action and fitness value

Select the r bits with the highest entropy

For each of the selected r bits, show to the human the action taken by the

algorithm within the previous generation for the corresponding history, in

addition to statistics about the action fitness

Take suggested input for the selected bits and match them with the action

taken within the previous generation for the corresponding history

For each chromosome, if there is at least one match between

human feedback and action previously taken for all bits

Add the chromosome to

the new populationYes

If variance < v

Replace value of

bit with a random

value

Leave value of bit

as it is

For each bit ,If bit is one

of the r bits with highest

entropies

NO

Yes

NO

After going through all bits in chromosome

After going

through all

chromosomes

within the

population

2- Select set

of histories

4-Generate

new

population

using human

input

1-Evaluate

the

population

3-Generate

statistics for

selected

histories

Figure 5.1: Human input framework.


Feedback from

algorithm regarding

previous population

User input (suggestions

of actions for selected

history patterns)

Environment

information

(payoff matrix)

Selected history

patterns (using

entropy)

Human input (algorithm

parameters) Learning rate (average payoff

per generation)

State of algorithm (Busy/

Current number of stages

played)

Figure 5.2: Designed graphical user interface.

population plays against the opponent, it will be evaluated, sorted, mutated and

crossovered (the same sequence as in the normal GA).

5.1.2 Select set of histories

Since it is considered extensive work for the user to evaluate and alter each aspect

of the strategy, we select a particular part of the strategy for the user to evaluate.

To do this, we go through all the chromosomes, and we select the bits that have the

highest entropy Entrop within the chromosomes. We chose the entropy property,

as we want to select the bits within the population that have the the most variation,

and therefore represents the indecisiveness of the population. While calculating

the entropy, we give more value (weight) with higher fitness as follows:


Entropy = log(−p)× p (5.1)

Where p represents the probability of occurence of a certain action Pa following a

certain history pattern, multiplied by a factor of the fitness of the chromosome f

within which this action occured divided by the total fitness of the whole population

FitnessTotal . The following equation shows how probability p is computed:

p = Pa×(

fFitnessTotal

)2

(5.2)

5.1.3 Generate statistics for selected histories

After calculating the entropy and deciding on which bits (history patterns) corre-

spond to uncertain behavior, we calculate for each chromosome the average, mini-

mum and maximum fitness of the chromosomes that contain a certain value for an

action at this bit g.

The human takes feedback about the performance of the population from the

algorithm regarding the population as a set of statistics about the performance of

the algorithm in the previous generation. These statistics include:

1. Selected history patterns: History patterns with highest entropy Entrop is

calculated based on the action taken) over the population.

2. Percentage of the population that performs a certain action in response to the

selected history patterns.

3. Average fitness of chromosomes that perform a certain action in response to

the selected history patterns.

4. Average fitness of the previous population.


5.1.4 Generating a new population from human input

In response to the statistics presented, the human suggests a strategy for the par-

ticular history presented to him. The GA uses this information to generate its

population for the subsequent generation. This is done by matching the suggested

strategies to the strategies expressed in the chromosomes of the previous popula-

tion. For each chromosome, if there is a match between the strategy used by the

chromosome for at least one of the selected history patterns and the strategy sug-

gested by the human input for that history pattern, we move the chromosome as it

is to the following population.

In chromosomes in which no matches occur, we can either replace the strate-

gies in the chromosome used at the selected history patterns with the strategies

suggested by the human within this situation or replace the strategies with random

generated strategies. In order to choose whether we follow the human’s suggestion

or randomly replace the existing action, we depend on what we define as variance

v. The calculation of variance gives us an idea about the inner statistics within a

certain population, and whether their are major gaps in fitness between higher and

lower fitness chromosomes.

In case we have a (relatively) low variation in the fitness between the best and

the worst chromosomes, we follow the human input as there is a good probability

that the human suggestions enhances the performance the overall performance of

the population. Otherwise, we replace the non-matching actions randomly in order

to give space for more exploration and to not be completely reliable on the human

input (as no guarantee about its quality). Variance is declared using the following

equation:

v = ∑chromosomes

( f −mean) (5.3)


where f is the fitness of the chromosome and mean is the mean of the fitnesses

of all the chromosomes within the population. If variance is not implemented, we

will use the human suggested strategy as replacement by default (instead of random

replacement. The purpose of variance is to try to enhance the exploration of the

algorithm outside the suggestions of the human input.

After generating the new population, we repeat the process by evaluating this

population against the opponent. The process of feedback and input can be re-

peated with a certain defined rate. Here, we chose that it is to be repeated every

four generations so we can experiment the effect on a constant rate.

5.2 Interactive genetic algorithms: Six variations

We evaluated six variations of our IGA algorithm. These variations differ in the

implementation of certain factors including:

1. Using different (larger) mutation rates in the dynamic mutation (we multi-

plied the rates normally used by 2).

2. Using variance that depends only the current fitness of the chromosomes.

3. Using variance that depends on propagating fitness of the chromosomes.

4. Using stopping condition that depends only the current fitness of the chromo-

somes( default is that it depends on propagating fitness of the chromosomes).

Table 5.1 summarizes the algorithms we implemented, including which of the vari-

ations exist in each algorithm.

5.2.1 Effect of input quality on the performance of GA

In order to intensify our study of the effect of the human input on the GA perfor-

mance, we considered how the quality of the human input may affect the perfor-


Variation Larger mutation Variance Variance W/prop.Stop. cond.W/o prop.

Basic IGAIGA2

√

IGA3√

IGA4√

IGA5√ √

IGA6√ √

Table 5.1: Properties of the different variations of IGA algorithms.

Game Acceptable human input Unacceptable human inputPrisoner’s Dilemma TIT-FOR-TAT Inverse TIT-FOR-TATCooperation Game Always play A Always play BChicken Game Always straight and TFT Always swerveShapley’s Game Random Pure strategy

Table 5.2: Acceptable and unacceptable human inputs for the selected 2-agentmatrix games.

mance of the GA. For each game, we identified what we call “Acceptable” and

“Unacceptable” human input. This helps us to better understand how the qualities

of human input changes the performance of GA. In Table 5.2, we summarize the

possible qualities of human input for each of the games used in our simulations.

Note that in our selections of what is considered as “acceptable” and “unaccept-

able” input, we faced the dilemma of having different criteria of quality that varies

depending on the opponent faced. Therefore, the performance of a certain type of

human input is situational (depending on the opponent faced). We tried to iden-

tify well performing human input through analyzing the population generated by

running the algorithms from the previous chapter. We will notice that in some sit-

uations, human input did not perform as expected. More details on that will be

expressed in the results section.

In order to get a better understanding of the table, we will explain the set of

strategies mentioned. In the case of the prisoner’s dilemma, previous research has

shown that TFT is a robust strategy [4]. TFT behaves as follows: Start with coop-


eration. Subsequently, copy the opponents’ action from the previous time step. For

the “Unacceptable” input in the prisoners’ dilemma, we inversed the TIT-FOR-TAT

strategy (defect if opponent cooperates and cooperate if the opponent defects). This

human input should be misleading enough to be considered an “Unacceptable” hu-

man input.

In the cooperation game, we assume that an acceptable strategy would be to

always play A, while While always playing B would be less desirable. Moving on

to the chicken game. From the previous chapter, we found that almost all of our op-

ponents (except in self-play), can be easily manipulated to swerve if our algorithm

persist on going straight, therefore we chose one of our acceptable inputs as going

straight. We also added a TFT strategy as an acceptable human input in order to

see if mutual swerving can be achieved, and also to show how the performance of

each human input is dependent on the opponent faced. Unacceptable human input

was defined as always swerving.

Finally we discuss the human input quality in case of Shapley’s game. In this

case, we assumed that a random strategy might be suitable to be considered ac-

ceptable as a result of the analysis from the population generated in previous ex-

periments (from the previous chapter), because it maximizes the minimum payoff.

However it does not lead to amutually benificial solution. We chose the unaccept-

able strategy to be a pure strategy, which we chose to be taking the first possible

action (Playing action A).


We now present the results from our experiments using the same settings as in the

experiments reported in the previous chapter. We first show the effect of basic

human input on the performance of GA. We then discuss the effect of different


variations on the human input on the performance of IGA. Finally, we show the

impact of the human input quality on the performance of IGA.

In the results, we rely on two evaluation criteria in order to compare the per-

formance of the algorithms. These metrices include:

• Average final payoff: This represents the average of the final payoffs awarded

to the algorithm over the 10 conducted iterations.

• Consistency: This represents how consistent are the final payoffs over sev-

eral iterations. This can be viewed from the figures as the range between

the maximum and the minimum payoff value reach over the 10 iterations,

excluding the outliers. An example is shown in figure 5.3.

• Convergence: This term is used when a figure shows average payoff of pop-

ulations overtime. A better convergenceshows that the average payoff stabi-

lizes at a good value over time.

Figure 5.3: Evaluation metrices example


Without human input With human input

2.8

2.85

2.9

2.95

3

3.05

3.1

Aver

age

final

pay

off

IGA vs. GIGA−WoLF (all games)

Figure 5.4: Effect of human input on basic IGA against GIGA-WoLF (all games).

5.3.1 Interactive genetic algorithms

We first discuss the effect of implementing the basic IGA. In these results, we show

the change in performance when we use an acceptable human input on the dynamic

GA with mutation (final algorithm reached in last chapter). As done previously,

we will show the results for the algorithm against each opponent within each of the

selected games. Note that in case of Chicken, we will display the results regarding

two types of human input (TFT and bullying). The human input to be used within

the following sections will be the one with the highest overall performance for each

algorithm seperately.

Interactive genetic algorithms vs. GIGA-WoLF

In the previous chapter, we showed that, against GIGA-WoLF, the dynamic GA

with stopping condition was able to stablizie within a few generations in some of

the games. On the other hand, it sometimes was unable to explore the whole solu-

tiona space, leading to a more “local” solution. Overall, we can see from Figure 5.4

that human input has not contributed to a major enhancement against GIGA-WoLF

except in the cooperation game (Figure 5.6) and Chicken (Figure 5.7). The selected

input in that case was bullying (showing stability).



0.97

0.975

0.98

0.985

0.99

0.995

Aver

age

final

pay

off

IGA vs. GIGA−WoLF in prisoners dilemma

Figure 5.5: Effect of basic human inputon the performance (final payoff) of GAvs. GIGA-WoLF in prisoner’s dilemma.


3.92

3.93

3.94

3.95

3.96

3.97

3.98

3.99

Aver

age

final

pay

off

IGA vs. GIGA−WoLF in Cooperation game

Figure 5.6: Effect of basic human inputon the performance (final payoff) of GAvs. GIGA-WoLF in a cooperation game.

Without human input With human input (TFT)With human input (Bully)

6

6.2

6.4

6.6

6.8

7

Aver

age

final

pay

off

IGA vs. GIGA−WoLF in Chicken game

Figure 5.7: Effect of basic human inputon the performance (final payoff) of GAvs. GIGA-WoLF in chicken game.


0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

Aver

age

final

pay

off

IGA vs. GIGA−WoLF in Shapleys game

Figure 5.8: Effect of basic human inputon the performance (final payoff) of GAvs. GIGA-WoLF in Shapley’s game.

Interactive genetic algorithms vs. Q-learning

In the previous chapter, we observed that the average performance against Q-

learning in some of the games (especially prisoner’s dilemma) needed vast im-

provements. In addition, we noticed that most of the populations generated were

not converging to an acceptable set of chromosomes. In this section, we will show

the difference in performance with and without human input.

From the previous results and from Figure 5.10, we observe that, on average,

acceptable human input has a positive effect on the GA. Although IGA is not con-

sistent, but maximum and minimum payoff achieved still have higher values for

IGA than GA.


0 10 20 30 40 50 60 70 80 90 1000.2

0.25

0.3

0.35

0.4

0.45

0.5


Aver

age

payo

ff pe

r gen

erat

ion

IGA vs. GIGA−WoLF in Shapleys game

Without human inputWith human input

Figure 5.9: Effect of basic human input on GA against GIGA-WoLF in Shapley’s(Average payoff per generation).

Without human input With human input2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2

Aver

age

final

pay

off

IGA vs. Q−learning (all games)

Figure 5.10: Effect of basic human input on GA on final payoffs against Q-learning(all games).

We can also see a drastic improvement in figure 5.12 between the performance

of GA with and without the human input. This can be reasoned as the human input

motivates a stronger reputation from the GA side, and accordingly the Q-learner is

able to adapt quicker, and in some situation is able to reach cooperation as shown in

the sample population in Figure 5.11. In Chicken (Figure 5.14), we can see that as

a result of having the Q-learner settle to swerving, a human input that encourages

bullying is able to take the straight action much easier. That is why in this situation

we chose the bullying input as our acceptable input.

Figure 5.13 shows the performance in the cooperation game. Using human


Figure 5.11: A sample of the chromosomes generated from basic IGA vs. Q-learning in prisoner’s dilemma.

input actually decreases the performance, which is shown through having low con-

sistency in the final payoffs. This shows an instability in the output population of

the GA. This can be attributed back to the fact that the human input in some case

has a slower effect than what is needed to motivate Q-learning to learn the right

action, and leads to confusion in taking the decision instead.


1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

Aver

age

final

pay

off

IGA vs. Q−learning in prisoners dilemma

Figure 5.12: Effect of basic human inputon the performance (final payoff) of GAvs. Q-learning in prisoner’s dilemma.


2.8

3

3.2

3.4

3.6

3.8

Aver

age

final

pay

off

IGA vs. Q−learning in Cooperation game

Figure 5.13: Effect of basic human inputon the performance (final payoff) of GAvs. Q-learning in a cooperation game.


4.4

4.6

4.8

5

5.2

5.4

5.6

5.8

6

Aver

age

final

pay

off

IGA vs. Q−learning in Chicken game

Figure 5.14: Effect of basic human inputon the performance (final payoff) of GAvs. Q-learning in chicken game.


0.22

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

Aver

age

final

pay

off

IGA vs. Q−learning in Shapleys game

Figure 5.15: Effect of basic human inputon the performance (final payoff) of GAvs. Q-learning in shapley’s game.



2.8

2.9

3

3.1

3.2

3.3

Aver

age

final

pay

off

IGA in self−play (all games)

Figure 5.16: Effect of basic human input on GA on final payoffs in self-play (allgames).

0 10 20 30 40 50 60 70 80 90 1003.5

4

4.5

5

5.5

6

6.5

7

Number of generations

Aver

age

payo

ff pe

r gen

erat

ion

IGA in self−play in Chicken game

Without human inputWith TFT human inputWith Bully human input

Figure 5.17: Effect of basic human input on GA in self-play in chicken game(Average payoff per generation).

Interactive genetic algorithms in self-play

In self-play, the results with the dynamic GA with stopping condition was not

satisfactory in cooperation games. Additionally, the population generated while

playing the prisoner’s dilemma did not converge as desired. However, across all

the games (Figure 5.16), human input produced a higher (though not statistically

different) performance.

In prisoner’s dilemma (Figure 5.18), we can see that although the human input

is more consistent, the final payoffs from the IGA has a slightly worse performance

on average. This was not expected, especially in the case of self-play, as both



1.9

2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

Aver

age

final

pay

off

IGA in self−play in prisoners dilemma

Figure 5.18: Effect of basic human inputon the performance (final payoff) of GAin self-play in prisoner’s dilemma.


2.8

3

3.2

3.4

3.6

3.8

4

Aver

age

final

pay

off

IGA in self−play in Cooperation game

Figure 5.19: Effect of basic human inputon the performance (final payoff) of GAin self-play in a cooperation game.


5

5.5

6

6.5

Aver

age

final

pay

off

IGA in self−play in Chicken game

Figure 5.20: Effect of basic human inputon the performance (final payoff) of GAin self-play in chicken game.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Aver

age

final

pay

off

IGA in self−play in Shapleys game

Figure 5.21: Effect of basic human inputon the performance (final payoff) of GAin self-play in shaply’s game.

algorithms are directed by the same TFT human input. Even though, the algorithms

are able to reach cooperation in some iterations. We can also see a slightly worse

performance using our chosen human input in Shapley’s (Figure 5.21).

Figure 5.19 shows how the average of the final payoffs rewarded to the algo-

rithm is better and consistent. The algorithm reaches near perfect coordination

between the agents. From Figures 5.20 and 5.17, we chose our acceptable input to

be also the one that motivates always bullying.

5.3.2 Modificiations on interactive genetic algorithms

In this section, we display the results concerning the different variations imple-

mented on GA that we explained earlier. We show how these variation may effect


Basic IGA IGA 2 IGA 3 IGA 4 IGA 5 IGA 6

2.75

2.8

2.85

2.9

2.95

3

3.05

3.1

Ave

rage

fina

l pay

off

IGA variations vs. GIGA−WoLF (all games)

Figure 5.22: Effect of variations on human input on GA against GIGA-WoLF (allgames).

0 20 40 60 80 1000.5

0.6

0.7

0.8

0.9

1

1.1


Ave

rag

e p

ayo

ff p

er

ge

ne

ratio

n

IGA variations vs. GIGA−WoLF in prisoners dilemma

Basic IGAIGA 2IGA 3IGA 4IGA 5IGA 6

Figure 5.23: Effect of variations on IGA against GIGA-WoLF in prisonersdilemma (Average payoff per generation).

the human interaction potential. This gives us an idea about different parameters

that the human can adjust according to his knowledge of the environment in order

to enhance the performance of the algorithm.

Against GIGA-WoLF

We can see in Figure 5.22 that overall the basic GA has the highest performance,

followed by IGA6, which is more consistent. From the results displayed in Fig-

ure 5.23, we can see that applying different mutation rates (IGA2) improves the

performance immensly in the prisoner’s dilemma. We can see also that using a


Basic IGA IGA 2 IGA 3 IGA 4 IGA 5 IGA 60.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Aver

age

final

pay

off

IGA variations vs. GIGA−WoLF in prisoners dilemma

Figure 5.24: Effect of variations on IGAon final payoffs vs. GIGA-WoLF in pris-oner’s dilemma.


3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

4

Aver

age

final

pay

off

IGA variations vs. GIGA−WoLF in Cooperation game

Figure 5.25: Effect of variations on IGAon final payoffs vs. GIGA-WoLF in a co-operation game.


5.8

6

6.2

6.4

6.6

6.8

7

Aver

age

final

pay

off

IGA variations vs. GIGA−WoLF in Chicken game

Figure 5.26: Effect of variations on IGAon final payoffs vs. GIGA-WoLF inChicken game.


0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Aver

age

final

pay

off

IGA variations vs. GIGA−WoLF in Shapleys game

Figure 5.27: Effect of variations on IGAon final payoffs vs. GIGA-WoLF inShapley’s game.

propagating variance (IGA4) negatively impacts the performance of the algorithm.

This is expected due to the fact that propagating variance eliminates an important

factor for speeding up the learning progress, which is how we merge both the con-

sideration of the history, as well as the current payoff within the generation.

Figure 5.24 shows us that IGA3 (propagating history without a propagating

variance) has the highest performance, also it shows a good convergence trend.

Both are expected for the same reason mentioned earlier concerning the variance.

Against Q-learning

Here, we compare the various forms of IGA against Q-learning. Overall, basic IGA

had a better average performance than the other variations (figure 5.28). IGA6 has



1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

Aver

age

final

pay

off

IGA variations vs. Q−learning (all games)

Figure 5.28: Effect of variations on IGA on final payoffs vs. Q-learning (allgames).

the least variance in comparsion to other modifications.

The basic IGA is excelling in comparison to the other modifications (as shown

in figure 5.29) in the prisoner’s dilemma, same in shapley’s game (Figure 5.32).

We can see that the original set of mutation rates chosen for the algorithm is more

suitable in this case. In addition, adding the variance factor, which makes the

algorithm take into consideration the within generation variance did not improve

the performance considering the average final payoffs. However, under certain

circumstances, it was able to decrease the margin of the final payoffs achieved,

which means more stability in terms of convergence over different sets of iterations.

In Chicken (Figure 5.31), the performance agrees with this trend.

Figure 5.30 shows that IGA2 performs better than basic GA in a cooperation

game. We can also see that basic IGA, IGA3 and IGA4 perform better than others.

Therefore algorithms with propagation of history tends to perform better.

Changing the set of mutation rates also improved the performance in Shapley’s

(Figure 5.27). It also reduces the variance in the cooperation game (Figure 5.25).

It can be noticed that IGA2 has a an outlier that learns to coordinate with the

opponent, but at the action having the lower payoff. This proves that the set of

mutation rates suggested are not suitable for all games and all opponents.



1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

Aver

age

final

pay

off

IGA variations vs. Q−learning in prisoners dilemma

Figure 5.29: Effect of variations on IGAon final payoffs vs. Q-learning in pris-oner’s dilemma.


2.8

3

3.2

3.4

3.6

3.8

Aver

age

final

pay

off

IGA variations vs. Q−learning in Cooperation game

Figure 5.30: Effect of variations on IGAon final payoffs vs. Q-learning in a coop-eration game.


2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

Aver

age

final

pay

off

IGA variations vs. Q−learning in Chicken game

Figure 5.31: Effect of variations on IGAon final payoffs vs. Q-learning in chickengame.


0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Aver

age

final

pay

off

IGA variations vs. Q−learning in Shapleys game

Figure 5.32: Effect of variations on IGAon final payoffs vs. Q-learning in Shap-ley’s game.

Self-play

Overall, basic IGA seems to behave slightly better than its counterparts(Figure 5.33).

As in the case when played against Q-learning incooperation game, IGA2 per-

formed the best in the cooperation game (Figure 5.35).

In prisoner’s dilemma, Figure 5.34 shows that IGA6 is more consistent and has

the highest average payoff. This is the first time that IGA6 was considered as one

of the best performers within prisoner’s dilemma. This can be returned due to the

fact that IGA6 does not consider history propagation, which agrees with our results

from the previous chapter, where we showed that history propagation in self-play

affects the performance negatively.



2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

Aver

age

final

pay

off

IGA variations in self−play (all games)

Figure 5.33: Effect of variations on IGA on final payoffs in self-play (all games).

5.3.3 The effect of human input quality on interactive genetic algo-

rithms

Against Giga-Wolf

We can see that, overall games, the performance with acceptable human input is

much better than unacceptable human input (Figure 5.38). An improvement in

convergence in various games is seen in Figures 5.43 and 5.44.

Recall that our model of an acceptable human input for the game chicken was to

motivate bullying. Figure 5.41 shows that both input types produce similar average

payoffs, but the unacceptable input is more consistent. An explaination to this

would be that experimenting ”swerving” with GIGA-WoLF in the beginning, made

the algorithm realize that it is potentially a not convenient solution if the other

algorithm was ”bullying”. The algorithm with acceptable input was even able to

motivate the GIGA-WoLF to swerve with it (mutual swerving) at one point of the

time.

Against Q-learning

On average, it is prefered to use acceptable human input (Figure 5.45). An unac-

ceptable input gives an inconsistent reputation and confuses Q-learning. Therefore



1.5

2

2.5

3

Aver

age

final

pay

off

IGA variations in self−play in prisoners dilemma

Figure 5.34: Effect of variations on IGAon final payoffs in self-play in prisoner’sdilemma.


2.8

3

3.2

3.4

3.6

3.8

4

Aver

age

final

pay

off

IGA variations in self−play in Cooperation game

Figure 5.35: Effect of variations on IGAon final payoffs in self-play in a coopera-tion game.


2

3

4

5

6

7

Aver

age

final

pay

off

IGA variations in self−play in Chicken game

Figure 5.36: Effect of variations on IGAon final payoffs in self-play in chickengame.


0.15

0.2

0.25

0.3

0.35

0.4

0.45

Aver

age

final

pay

off

IGA variations in self−play in Shapleys game

Figure 5.37: Effect of variations on IGAon final payoffs in self-play in shapley’sgame.

the algorithm is unable to achieve mutual cooperation, or even exploit Q-learning.

This agrees with the results within individual games (Figures 5.46- 5.50).

Self play

As was the case in previous results, overall performance in self play is much better

with acceptable (figure 5.51). Figure 5.54 shows yet another unexplainable result in

chicken, although the algorithm is encouraged to bully using the acceptable human

input, it is able to reach the swerving solution with more consistency.


Acceptable human input Unacceptable human input

2.88

2.9

2.92

2.94

2.96

2.98

3

3.02

3.04

3.06

3.08

Aver

age

final

pay

off

IGA input quality vs. GIGA−WoLF (all games)

Figure 5.38: Human input quality and its effect on final payoffs of IGA againstGIGA-WoLF (all games).


0.75

0.8

0.85

0.9

0.95

1

Aver

age

final

pay

off

IGA input quality vs. GIGA−WoLF in prisoners dilemma

Figure 5.39: Human input quality and itseffect on final payoffs of IGA vs. GIGA-WoLF in prisoner’s dilemma.


3.5

3.55

3.6

3.65

3.7

3.75

3.8

3.85

3.9

3.95

4

Aver

age

final

pay

off

IGA input quality vs. GIGA−WoLF in Cooperation game

Figure 5.40: Human input quality and itseffect on final payoffs of IGA vs. GIGA-WoLF in a cooperation game.


6.6

6.65

6.7

6.75

6.8

6.85

6.9

6.95

7

Aver

age

final

pay

off

IGA input quality vs. GIGA−WoLF in Chicken game

Figure 5.41: Human input quality and itseffect on final payoffs of IGA vs. GIGA-WoLF in chicken game.


0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Aver

age

final

pay

off

IGA input quality vs. GIGA−WoLF in Shapleys game

Figure 5.42: Human input quality and itseffect on final payoffs of IGA vs. GIGA-WoLF in shapley’s game.


0 20 40 60 80 1000.5

0.6

0.7

0.8

0.9

1

1.1


Ave

rage

pay

off p

er g

ener

atio

n

IGA input quality vs. GIGA−Wolf in prisoners dilemma

Acceptable human inputUnacceptable human input

Figure 5.43: Human input quality and its effect on IGA vs. GIGA-WoLF in pris-oners dilemma (Average payoff per generation).

0 20 40 60 80 1001.5

2

2.5

3

3.5

4


Ave

rage

pay

off p

er g

ener

atio

n

IGA input quality vs. GIGA−Wolf in Cooperation game


Figure 5.44: Effect of human input quality and its effect on IGA against GIGA-WoLF in cooperation (Average payoff per generation).


2

2.2

2.4

2.6

2.8

3

3.2

Aver

age

final

pay

off

IGA input quality vs. Q−learning (all games)

Figure 5.45: Human input quality and its effect on final payoffs of IGA vs. Q-learning (all games).


0 20 40 60 80 1000

0.5

1

1.5

2

2.5

3


Ave

rage

pay

off p

er g

ener

atio

n

IGA input quality vs. GIGA−Wolf in prisoners dilemma


Figure 5.46: Human input quality and its effect on final payoffs of IGA vs. Q-learning in prisoners dillema (per generation).


1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

Aver

age

final

pay

off

IGA input quality vs. Q−learning in prisoners dilemma

Figure 5.47: Human input quality and itseffect on final payoffs of IGA vs. Q-learning in prisoner’s dilemma.


2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

Aver

age

final

pay

off

IGA input quality vs. Q−learning in Cooperation game

Figure 5.48: Human input quality and itseffect on final payoffs of IGA vs. Q-learning in a cooperation game.


3

3.5

4

4.5

5

5.5

6

Aver

age

final

pay

off

IGA input quality vs. Q−learning in Chicken game

Figure 5.49: Human input quality and itseffect on final payoffs of IGA vs. Q-learning in chicken game.

Acceptable human input Unacceptable human input0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Aver

age

final

pay

off

IGA input quality vs. Q−learning in Shapleys game

Figure 5.50: Human input quality and itseffect on final payoffs of IGA vs. Q-learning in shapley’s game.



2.2

2.3

2.4

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2

Aver

age

final

pay

off

IGA input quality in self−play (all games)

Figure 5.51: Human input quality and its effect on final payoffs of IGA in self-play(all games).


1.8

2

2.2

2.4

2.6

2.8

Aver

age

final

pay

off

IGA input quality in self−play in prisoners dilemma

Figure 5.52: Human input quality and itseffect on final payoffs of IGA in self-playin prisoner’s dilemma.

Acceptable human input Unacceptable human input2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

Aver

age

final

pay

off

IGA input quality in self−play in Cooperation game

Figure 5.53: Human input quality and itseffect on final payoffs of IGA in self-playin a cooperation game.


2

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

Aver

age

final

pay

off

IGA input quality in self−play in Chicken game

Figure 5.54: Human input quality and itseffect on final payoffs of IGA in self-playin chicken game.

Acceptable human input Unacceptable human input0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

Aver

age

final

pay

off

IGA input quality in self−play in Shapleys game

Figure 5.55: Human input quality and itseffect on final payoffs of IGA in self-playin shapley’s game.


5.4 Conclusions

In this chapter, we described a framework for using human input to enhance the

performance of GAs. Through this framework, we tested different variations of

how human input can be handles, as well as experimented with different quality

cases of simulated human input. These experiments was in order to show how

variation in human input affects the performance of GA. From these results we

reached several conclusions, which can be summarized as following:

• Human input, even though it may not enhance the final payoff in all the

situations, it shows consistency per several iterations.

• Variations of IGA performed differently within different games and against

different opponents. Fitness that propagates through generations (propaga-

tion of history) is not always the best option. Further investigation about

balancing how previous and current payoffs affects the evaluation of the

chromosomes can be taken into consideration as future work.

• Human input quality is variable over different games and opponents. In our

simulations, our knowledge of the games and the opponents contributed into

the effectiveness of the human input. This shows that an expert human will

enhance the performance of an IGA more than a novice.

In the next chapter, we discuss the usage of basic IGA on a larger scale version

of the prisoner’s dilemma in order to discuss how well IGA performance propa-

gates.

CHAPTER 6

IGA in N-player matrix games

To better explore the effectiveness of our methodology in more complex systems,

we repeated our experiments in a three-player prisoner’s dilemma game. In doing

so, we compare the performance of GA with dynamic mutation and a stopping

condition versus basic IGA. We also analyze the effect of the quality of human

input on IGA in this game. We first describe the N-player prisoner’s dilemma.

Note that out focus is on the iterative N-player prisoner’s dilemma (NIPD) and not

the one-shot game.

6.1 N-player prisoner’s dilemma

N-player dilemmas provide a formal representation of many real world social dilem-

mas [23], including power markets [63]. In the n-player prisoner’s dilemma, also

known as the Tragedy of the Common (Hardin, 1968), a land (the commons) is

freely available for farmers to use for grazing cattle. For any individual farmer,

90

CHAPTER 6. IGA IN N-PLAYER MATRIX GAMES 91

Figure 6.1: Payoff matrix of the 3-player prisoner’s dilemma.

it is advantageous to use this resource rather than their own land. However, if all

farmers adopt the same reasoning, the commons will be over-used and soon will be

of no use to any of the participants, resulting in an outcome that is undesirable for

all farmers. The payoff matrix for a three-player variation of this game is shown in

Figure 6.1.

In this game, there is a linear relationship between the fraction of cooperators

and the utility received by a game participant (Figure 6.2). Importantly, the payoff

received for a defection is higher than for a cooperation. The utility for defection

dominates the payoff for cooperation in all cases.

Yao and Darwen [6] were among the first to use an evolutionary approach for

the NIPD. In their initial work, they showed that when the number of players in an

NIPD game increases, it becomes more difficult for cooperation to emerge. This is

mainly due to the fact that a player is unable to clearly distinguish who the defec-

tors and who the cooperators are. In the iterative prisoner’s dilemma (IPD) game,

a player can easily reciprocate against a defector in a one-to-one situation, and

therefore discourage defection. However, in the NIPD, retaliation against a defec-

tor means punishment to everyone else in the game, including the cooperators.


Figure 6.2: Relationship between the fraction of cooperators and the utility re-ceived by a game participant.

6.2 Strategy representation

We have decided to adopt the representation of strategies developed by Yao and

Darwen [6] as it is simpler and easier to implement than other techniques. Under

this representation, a history of l rounds for an agent can be represented as the

combination of the following:

• l bits to represent the agent’s l previous actions.

• l * log2 N bits to represent the number of cooperators in the previous l rounds

among the agent’s N−1 group members, where N is the group size.

In our representation of the chromosome, ‘1’ indicates a defection and ‘0’ a co-

operation. In our implementation, we have limited the number of previous actions

in memory to three (i.e. l = 3). In the case of a society of N=4 agents, the history

for an agent would therefore be 3 + 3 * log2 4 = 9 bits long based on the above

representation scheme. For example, an agent in our NIPD model could have a

history as follows: 110 11 01 10. Here, the first three bits from the left are the

agent’s previous three actions. From this we can see that the agent defected in the


Figure 6.3: Final payoffs of selected opponents in 3-player prisoner’s dilemma.

last two rounds and cooperated the round before that. The two-bit sets after the

first three bits represent the number of cooperators in the last three rounds from

the agent’s group members. This agent’s history indicates that there were 3, 1 and

2 cooperators in the agent’s group in the last three rounds. In order to see if our

experiments agree with performance of previous work, we studied the performance

of GIGA-WoLF and Q-Learning in 3-player prisoner’s dilemma (Figure 6.3). This

agrees with results presented in previous work, especially when related to the per-

formance of Q-Learning [9], [46], [76].

6.3 Human input

In this game, we define acceptable human input as follows: If all agents cooperated

in the previous round, then cooperate; otherwise defect. On the other hand, the un-

acceptable input defects if the other two players cooperated in the previous round;

otherwise, it cooperates. This definition of acceptable and unacceptable was deter-


mined as a first hypothesis, though it is not successful under certain conditions.


Figure 6.4 shows a very vast potential for cooperation in different variations of the

GA in self-play. We can see that human input that encourages cooperation can lead

to direct cooperation. In case of an input that encourages defection, the algorithms

maybe able to learn fast enough to reach cooperation, but under certain instances,

mutual defection may prevail (as the algorithms follow the suggestions from the

human input).

Figure 6.5 shows the results in the situation where we have two GA agents in

the presence of GIGA-WoLF as the third agent. We observe no major differences

in terms of the final payoffs.

In the three variations of GA, the algorithms were not able to conform to coop-

eration (except within certain iterations), which is why the final payoff represented

leans towards the defection case. This can be due to the fact that the algorithms

use a more complex structure to represent the world. Thus, it may need a longer of

period of time in order to learn to cooperate (but this hypothesis was not verified

over a longer period of time). Instead, it converges to what may seem easier (de-

fection). The human input (acceptable, which encourages cooperation), can lead

to a successful cooperation, but at the same time, it may encourage other agents to

retaliate (which happens most of the time).

Figure 6.6 conforms with what we expected in the case in which IGA interacts

with two GIGA-WoLF agents. The “unaccpetable” human input performed the

worse, as GIGA-WoLF learns to defect as fast as possible. Thus, unacceptable

human input suggests cooperation in this case, which leads to a slower convergence

to defection from the GA side.


No human input Acceptable human inputUnacceptable human input

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Aver

age

final

pay

off

IGA in complete self−play

Figure 6.4: Effect of human input on the performance of GA in the 3-player pris-oner’s dilemma in self-play.


0.5

1

1.5

2

2.5

3

3.5

4

4.5

Aver

age

final

pay

off

IGA vs. 1−player IGA and 1−player GIGA−WoLF

Figure 6.5: Effect of human input on the performance of GA in the 3-player pris-oner’s dilemma with 1-player as GIGA-WoLF and 1-player in self-play.

In case of the presence of a Q-learning agent (Figure 6.7), we notice a more

consistency within the final payoffs achieved. It is noticeable as well that the GA

with its different variations is able to converge easily with the existence of Q-

learning. GA is sometimes even able to easily exploit the Q-learning agent, which

mainly occurs in the presence of a human input that encourages defection under

certain situations “unacceptable human input”. We can see also a major improve-

ment in the performance of GA in figure 6.8, which shows the results in the pres-

ence of two Q-learning agents. This shows that GA with acceptable human input

establishes a reputation strong enough to lead Q-learning agents to cooperate.


No human input Acceptable human inputUnacceptable human input1.9

1.91

1.92

1.93

1.94

1.95

1.96

1.97

1.98

1.99

Aver

age

final

pay

off

IGA vs. 2−players GIGA−WoLF

Figure 6.6: Effect of human input on the performance of GA in 3-player prisoner’sdilemma with 2-players as GIGA-WoLF.


4.8

4.9

5

5.1

5.2

5.3

5.4

5.5

5.6

Aver

age

final

pay

off

IGA vs. 1−player IGA and 1−player Q−learning

Figure 6.7: Effect of human input on the performance of GA in 3-player prisoner’sdilemma with 1-player as Q-learning and 1-player in self-play.

A situation that is quiet interesting is when we have three heterogeneous agents

(one representing GA, Q-learning and GIGA-WoLF). Cooperation is not the ideal

method in case of the presence of GIGA-WoLF as an agent in the game (Fig-

ure 6.9). If our algrithm is directed to defect (“Unacceptable human input”), GIGA-

WoLF and GA can exploit Q-learning (in certain situations), gaining higher payoff

than going to cooperation.



4

4.5

5

5.5

6

Aver

age

final

pay

off

IGA vs. 2−players Q−learning

Figure 6.8: Effect of human input on the performance of GA in 3-player prisoner’sdilemma with 2-players as Q-learning.

No human input Acceptable human inputUnacceptable human input1.5

2

2.5

3

3.5

4

4.5

5

Ave

rage

fina

l pay

off

IGA vs. 1−player GIGA−WoLF and 1−player Q−learning

Figure 6.9: Effect of human input on the performance of GA in 3-player prisoner’sdilemma with 1-player as Q-learning and 1-player as GIGA-WoLF.

6.5 Conclusions

A trend that was noticed and applies for almost for all the results within the fol-

lowing section can be summarized within the following:

• A more complex human input simulator (or in reality a more experienced

human) is required for more complex environments (Figures 6.3- 6.9).

• GA can adapt within larger environments (even without human interaction)

in comparison to other tested opponents.


• The existence of a wider variety of opponents may lead to degradation in

performance, especially with the presence of non-suitable human input.

CHAPTER 7

Conclusions and Future work

7.1 Conclusions

In this thesis, we discussed and presented the results regarding the usage of interac-

tive genetic algorithms as a learning mechanism in repeated games. We conducted

several sets of experiments in order to discover the applicability of using genetic al-

gorithms within multi-agent systems (more specifically in repeated matrix games),

and how human input may affect its performance. Our evaluation was mainly based

on the analysis of the final payoffs rewarded, the learning rate (average payoff per

generation) and the population generated through the learning process. Our experi-

ments were done on four different type of two-player matrix games, and a 3-player

version of the prisoner’s dilemma. The experiments were conducted in the pres-

ence of several opponents. These opponents included GIGA-WoLF, Q-Learning

and also the presence of an identical algorithm.

We evaluated a number of variations of GAs, including:

99

CHAPTER 7. CONCLUSIONS AND FUTURE WORK 100

• Modifying the existing fitness function to be based on propagating history

from previous generations.

• Using a stopping condition in which the algorithm stops evolving when the

fitness of the highest fitness chromosome stops changing.

• Updating the mutation rate over time based on the performance of a popula-

tion within a generation (dynamic mutation).

• Merging the stopping condition with dynamic mutation.

Our results showed that the use of history propagation decreases the random-

ness in the population generated, and therefore decreases the variations in the aver-

age payoff the algorithm achieves per generation. This can be very is particularly

visible in self play, where, although the algorithm does not always achieve bet-

ter payoffs, the average payoff per generation and the chromosome population are

more consistent.

We also studied the effect of having both the stopping condition and the dy-

namic mutation. Our results shows that, with the exception of Chicken, the use of

the stopping condition can drive the algorithm to converge pre-maturely to a local

optimal point. Dynamic mutation, however, helps the algorithm maintain a certain

amount of exploration, though in certain cases, such as against GIGA-WoLF, dy-

namic mutation without a stopping condition can lead to instability in the learning

trend of the algorithm.

After studying the different variations that have been previously suggested in

order to enhance the performance of the genetic algorithm in dynamic environ-

ments, we analyzed the idea of how human input can benefit and improve the per-

formance of genetic algorithms in repeated matrix games.

In the beginning, we implemented a framework that allows the human to give

feedback to the dynamic GA with stopping condition (which we chose as our base-


line). Through this framework, we desired to enhance the performance of the al-

gorithm while requiring minimal human interaction.

Our results showed that human input provided to the GA using our framework

led to an improvement in the GAs perforamcne in a number of games, inlcud-

ing prisoner’s dilemma and the cooperation game, given effective human input.

Furthermore, the simulated human input we used improved the GAs performance

against opponents with a slower learning rate (such as Q-learning), as it motivates

the opponents by conveying a better reputation.

In other games, such as Chicken, we had to experiment to know what can rep-

resent an effective human input. This turned out to be the inverse of what we

expected. We expected that a bullying input would enhance its performance. How-

ever, in contrast to our expectations, having a “swerve” input enhances the perfor-

mance effectively. A reason for this can be that the human input helps to “deceive”

the opponent into swerving, then it bullies it by the end, but no major enhancement

in the performance can be noticed in this case.

In addition to the previous experiments, we tested the effect of different han-

dling techniques of the human input and how they effect the performance of the

GA against its opponents. The different techniques in handling the human input

can be summarized as following:

• The basic handling technique, were we discover the bits with the highest

entropy within the population, then request the human feedback regarding

the history related to these bits.

• The basic handling technique with modified mutation rates. This study shows

us the effect of varying the mutation rate and how it can affect the results.

The results shows us that specific changes in the mutation rates used can en-

hance the performance in certain situations. This helps us conclude that a


human user can potentially be used to specify the mutation rate (according

to the human’s knowledge of the environment) to enhance the performance

of IGA.

• Incorporating what we call variance in order to take into consideration the

performance of each chromosome with respect to the overall population.

This helps us to overcome actions that leads to a decrease in fitness in chro-

mosomes with lower fitness values within the population.

Results showed a vast difference in the performance of the algorithms where

some of the algorithms perform better in certain situations, but not in others. Over-

all, the basic IGA had the highest performance. This led us to follow the basic

handling technique in our three-player experiment.

Finally we studied the performance of GAs in a three-player environment. This

study was conducted to determine how well our framework (and GAs in general)

perform on a larger scale. These experiments, were performed in the 3-player pris-

oner’s dilemma world, showed that GAs performance depends on the opponents,

and the quality of human input. Also, we showed that a GA with dynamic mutation

and stopping condition performs well in this game.

From the previous discussions, we can conclude the following:

• GAs are an effective learning technique in static systems.

• GAs, when improved through implementing a dynamic mutation technique

(where mutation rate changes depending on previous performance), can adapt

well enough in dynamic systems, except in the presence of other intelligent

agents of slower learning rate, such as Q-learning.

• Effective human input improves the performance of GAs in multi-agent (dy-

namic systems in general). Unacceptable input on the otherhand decreases


the performance, though the IGA can still learn acceptable strategies. In

general, the more effective human input is, the better the performance of the

algorithm.

• The studied forms of GAs and IGAs maintains its effectiveness in environ-

ments with higher numbers of players.

These results confirm our initial hypothesis that we derived through our litera-

ture review, which enforces the idea that GAs, can be used as a dynamic learning

technique. Also they motivate us to experiment more complex means of handling

the human input, as well as studying GAs in more complex multi-agent systems.

7.2 Future work

During the course of this research, we discovered a number of areas that deserve

further investigation. This research area is vast and may cover other potential ideas.

However, we highlight the areas of potential future work related to the topics dis-

cussed:

1. Test the effectiveness of interactive genetic algorithms in more complex en-

vironments against a wider range of opponents.

2. Experiment with other techniques used for improving the adaptability of ge-

netic algorithms within dynamic systems, including:

(a) Maintaining multiple interactive populations through the generations,

and then maintaining the population that exhibits the best performance.

The population is then compared to every new population generated

through any of the following generations in order to choose the strate-

gies with the best performance from one of these populations [55].


(b) Varying the size of the population depending on the rate of exploration

we want to exhibit within the algorithm over time [82].

3. Develop more complex techniques to overcome the effect of unacceptable

human input without degrading the performance of the algorithm.

4. Explore the effectiveness of human input with various interaction rates. This

is in order to reach the most effective rate of interaction.

5. Study the use of actual human input on GAs and discuss their effectiveness

using participants with various backgrounds.

Bibliography

[1] A. Alharbi, W. Rand, and R. Riolo. Understanding the semantics of the ge-

netic algorithm in dynamic environments. In Applications of Evolutionary

Computing, Lecture Notes in Computer Science. Springer Berlin / Heidel-

berg.

[2] F. Alkemade, H. La Poutre, and H. Amman. On social learning and robust

evolutionary algorithm design in economic games. The 2005 IEEE Congress

on Evolutionary Computation, 2005.

[3] R. Axelrod. The Complexity of Cooperation: Agent-Based Models of Com-

petition and Collaboration. Princeton University Press, 1st printing edition,

1997. ISBN 0691015678.

[4] R. Axelrod. The Evolution of Cooperation: Revised Edition. Basic Books,

revised edition, 2006. ISBN 0465005640.

[5] C. Babayigit, P. Rocha, and T.K. Das. A two-tier matrix game approach for

obtaining joint bidding strategies in ftr and energy markets. IEEE Transac-

tions on Power Systems, 25, 2010.

105

BIBLIOGRAPHY 106

[6] M. Babbar-Sebens and B. Minsker. A case-based micro interactive genetic

algorithm(cbmiga) for interactive learning and search: Methodology and

application to groundwater monitoring design. Environmental modelling

and software, 25, 2010.

[7] M. Baioletti, A. Milani, V. Poggioni, and S. Suriani. Interactive dynamic

production by genetic algorithms. Proceedings of the 25th conference on

Proceedings of the 25th IASTED International Multi-Conference: Artificial

Intelligence and Applications, 2007.

[8] S. Bapple. Consumer energy report, 2010. URL http://www.

consumerenergyreport.com.

[9] A. Bonarini, A. Lazaric, E.M. De Cote, and M. Restelli. Improving co-

operation among self-interested reinforcement learning agents. In ECML

workshop on Reinforcement learning in non-stationary environments, Porto,

Portugal. Citeseer, 2005.

[10] M. Bowling. Convergence and no-regret in multiagent learning. In Advances

in neural information processing systems 17: proceedings of the 2004 con-

ference, page 209. The MIT Press, 2005. ISBN 0262195348.

[11] M. Bowling and M. Veloso. Scalable learning in stochastic games. In AAAI

Workshop on Game Theoretic and Decision Theoretic Agents, 2002.

[12] J. Branke. Evolutionary approaches to dynamic optimization problems-

updated survey. 2001.

[13] F.M.T. Brazier, B.M. Dunin-keplicz, and N.R. Jennings. Desire: Modelling

multi-agent systems in a compositional formal framework. International

Journal of Cooperative Information Systems, 1997.

http://www.consumerenergyreport.com

http://www.consumerenergyreport.com

BIBLIOGRAPHY 107

[14] A. Brintrup, J. Ramsden, and A. Tiwari. An interactive genetic algorithm-

based framework for handling qualitative criteria in design optimization.

Comput. Ind., 58, April 2007.

[15] G.W. Brown. Iterative solution of games by fictitious play. Activity analysis

of production and allocation, 13(1):374–376, 1951.

[16] R.E. Brown. Impact of smart grid on distribution system design. Power and

Energy Society General Meeting - Conversion and Delivery of Electrical

Energy in the 21st Century, IEEE, 2008.

[17] C. Camerer and Russell Sage Foundation. Behavioral game theory: Ex-

periments in strategic interaction, volume 9. Princeton University Press

Princeton, NJ, 2003.

[18] E. Cantu-Paz. A summary of research on parallel genetic algorithms. 1995.

[19] J.L. Casti. Five Golden Rules: Great Theories of 20th-Century Mathematics

and Why They Matter. John Wiley and Sons, 1996.

[20] T.D.H. Cau. A co-evolutionary approach to modeling the behavior of partici-

pants in competitive electricity markets. Power Engineering Society Summer

Meeting, 2, 2002.

[21] J. Chaves-Gonzlez, N. Otero-Mateo, M. Vega-Rodrguez, J. Snchez-Prez,

and J. Gmez-Pulido. Game implementation: An interesting strategy to teach

genetic algorithms. Dept. Informtica, Univ. Extremadura, Escuela Politc-

nica, Campus Universitario s/n, 10071 Cceres, Spain. B. Fernndez-Manjn

et al. (eds.), Computers and Education: E-learning, From Theory to Prac-

tice, Springer, 2007.

[22] S.H. Chen, A.J. Jakeman, and J.P. Norton. Artificial intelligence techniques:

BIBLIOGRAPHY 108

An introduction to their use for modelling environmental systems. Math.

Comput. Simul., 78, 2008.

[23] R. Chiong and M. Kirley. Co-evolutionary learning in the N-player iterated

prisoners dilemma with a structured environment. Artificial Life: Borrowing

from Biology, pages 32–42, 2009.

[24] S. Choi. Short-term power demand forecasting using information technol-

ogy based data mining method. In Computational Science and Its Applica-

tions - ICCSA 2006, Lecture Notes in Computer Science. Springer Berlin /

Heidelberg, 2006.

[25] J. Crandall. Learning successful strategies in repeated general-sum games,

2005.

[26] J. Crandall. Just add pepper: Advancing from repeated matrix games to

repeatedstochastic game. In progress, 2011.

[27] J.W. Crandall, M.A. Goodrich, and L. Lin. Encoding intelligent agents for

uncertain, unknown, and dynamic tasks: From programming to interactive

artificial learning. AAAI Spring Symposium: Agents that Learn from Human

Teachers, 2009.

[28] J.W. Crandall, M. H. Altakrori, and Y. M. Hassan. Learning by demonstra-

tion in repeated stochastic games. In proceedings of AAMAS, 2010.

[29] D. Curran, C. O’Riordan, and H. Sorensen. Evolutionary and lifetime learn-

ing in varying nk fitness landscape changing environments: an analysis of

both fitness and diversity. In Proceedings of the 22nd national conference

on Artificial intelligence - Volume 1. AAAI Press, 2007.

[30] J. Dalton. Genetic algorithms, january 2007. Lecture notes, New castle

university.

BIBLIOGRAPHY 109

[31] H. Dawid. On the convergence of genetic learning in a double auction mar-

ket. Journal of Economic Dynamics and Control, 23:1545–1569, 1999.

[32] G. Dhiman and T.S. Rosing. Dynamic power management using machine

learning. International Conference on Computer Aided Design, 2006.

[33] G. Dozier, B. Carnahan, C. Seals, L. Kuntz, and S. Fu. An interactive dis-

tributed evolutionary algorithm (idea) for design. IEEE international con-

ference on Systems, Man and Cybernatics, 1, 2005.

[34] P. Fader and J. Hauser. Implicit coalitions in generalized prisoners dilemma.

journal of conflict resolution, 2001.

[35] C. et al. Fand. On adaptive emergence of trust behavior in the game of stag

hunt. Group Decision and Negotiation, 2002.

[36] D. Fudenberg and D. K. Levine. The Theory of Learning in Games (Eco-

nomic Learning and Social Evolution). The MIT Press, 1998. ISBN

0262061945.

[37] Center for smart energy Global environmetn fund. The emerging smart grid

:investment and entrepreneurial potential in the electric power grid of the

future. 2005.

[38] A. Gomes, C.H. Antunes, and A.G. Martins. A multiple objective approach

to direct load control using an interactive evolutionary algorithm. IEEE

transactions on power systems, 22, 2007.

[39] D. Gong and J. Yuan. Interactive genetic algorithms for optimization of

problems with multiple modes and implicit performance indices. Proceed-

ings of the Sixth International Conference on Intelligent Systems Design and

Applications, 2, 2006.

BIBLIOGRAPHY 110

[40] Guo-sheng Hao, Dun-Wei Gong, and Yong-Qing Huang. Interactive genetic

algorithms based on estimation of user’s most satisfactory individuals. In

Proceedings of the Sixth International Conference on Intelligent Systems

Design and Applications - Volume 03, 2006.

[41] Y.M. Hassan, S. Ahmed, and J.W. Crandall. Interactive artificial learning in

multi-agent systems. Workshop on Learning for Human-Robot Interaction

Modeling, RSS, 2010.

[42] T. Hedden and J. Zhang. What do you think i think you think, strategic

reasoning in matrix games. Cognition, 85, 2002.

[43] M. Hekkenberg, R.M.J. Benders, H.C. Moll, and A.J.M. Schoot. Indications

for a changing electricity demand pattern: The temperature dependence of

electricity demand in the netherlands. Energy Policy, 37, 2009.

[44] J. Holland and J. Miller. Artificial adaptive agents in economic theory. Amer-

ican Economic Review, 81, 1991.

[45] J. H. Holland. Genetic algorithms, scientific american, 1992. URL http:

//www.fortunecity.com/emachines/e11/86/algo.html.

[46] E. Howley and J. Duggan. The Evolution of Agent Strategies and Sociability

in a Commons Dilemma. 2009.

[47] C.F. Huang, S. Bieniawski, D.H. Wolpert, and C.E.M. Strauss. A compar-

ative study of probability collectives based multi-agent systems and genetic

algorithms. In Proceedings of the 2005 conference on Genetic and evolu-

tionary computation, pages 751–752. ACM, 2005. ISBN 1595930108.

[48] G. Hug-Glanzmann. Coordination of intermittent generation with storage,

demand control and conventional energy sources. In Bulk Power System

Dynamics and Control (iREP) - VIII (iREP), 2010 iREP Symposium, 2010.

http://www.fortunecity.com/emachines/e11/86/algo.html

http://www.fortunecity.com/emachines/e11/86/algo.html

BIBLIOGRAPHY 111

[49] I. Ige. Genetic algorithms. 2000.

[50] A. Jafari, A. Greenwald, D. Gondek, and G. Ercal. On no-regret learn-

ing, fictitious play, and nash equilibrium. In ICML ’01 Proceedings of the

Eighteenth International Conference on Machine Learning, pages 226–233,

2001.

[51] Y. Jia-hai, Y. Shun-kun, and H. Zhao-guang. A multi-agent trading platform

for electricity contract market. In Power Engineering Conference, 2005.

IPEC 2005. The 7th International, pages 1024–1029. IEEE, 2005. ISBN

9810557027.

[52] L. Kaelbling, M. Littman, and A. Moore. Reinforcement learning: A survey.

Journal of Artificial Intelligence Research, 4:237–285, 1996.

[53] M.A.S. Kamal and J. Muratab. Coordination in multiagent reinforcement

learning systems by virtual reinforcement signals. International Journal of

Knowledge-based and Intelligent Engineering Systems, 11, 2007.

[54] A.A. Keller. Genetic search algorithms to fuzzy multiobjective games: a

mathematica implementation. In Proceedings of the 10th WSEAS interna-

tional conference on Applied computer science, pages 351–359. World Sci-

entific and Engineering Academy and Society (WSEAS), 2010.

[55] L. Klinkmeijer. A serial population genetic algorithm for dynamic optimiza-

tion problems.

[56] M. Knudson and K. Tumer. Coevolution of heterogeneous multi-robot

teams. Proceedings of the 12th annual conference on Genetic and evolu-

tionary computation, 2010.

BIBLIOGRAPHY 112

[57] A. Kosorukoff. Human based genetic algorithm. In Systems, Man, and Cy-

bernetics, 2001 IEEE International Conference on, volume 5, pages 3464–

3469. IEEE, 2001. ISBN 0780370872.

[58] Sandia National Laboratories. Genetic algorithms, evolutionary program-

ming and genetic programming. 1997.

[59] J. Lagorse, D. Pairea, and A. Miraouia. A multi-agent system for energy

management of distributed power sources. Renwable Eenergy, 35, 2009.

[60] J.Y. Lee and S.B. Cho. Sparse fitness evaluation for reducing user burden in

interactive genetic algorithm. IEEE International Fuzzy Systems Conference

Proceedings, 1999.

[61] K. Lee and R. Baldick. Solving three-player games by the matrix approach

with application to an electric power market. Power Systems, IEEE Trans-

actions on, 18, 2003.

[62] K. Lee and R. Baldick. Tuning of discretization in bimatrix game approach

to power system market analysis. Power Systems, IEEE Transactions on,

18, 2003.

[63] K.H. Lee and R. Baldick. Solving three-player games by the matrix ap-

proach with application to an electric power market. Power Systems, IEEE

Transactions on, 18(4):1573–1580, 2003. ISSN 0885-8950.

[64] A.J. Lockett, C.L. Chen, and R. Miikkulainen. Evolving explicit opponent

models in game playing. GECCO, 2007.

[65] G.F. Luger. Artificial Intelligence: Structures and Strategies for Com-

plex Problem Solving. Addison Wesley, 5th edition edition, 2004. ISBN

0321263189.

BIBLIOGRAPHY 113

[66] H.H. Lund, O. Miglino, L. Pagliarini, A. Billard, and A. Ijspeert. Evolu-

tionary robotics: a childrens game. Evolutionary Computation Proceedings,

1998. IEEE World Congress on Computational Intelligence, 1998.

[67] Y. Ma, E.F. Bompard, R. Napoli, and J. Chuanwen. Modeling the strategic

bidding of the producers in competitive electricity markets with the watkins

q (lambda) reinforcement learning. International Journal of Emerging Elec-

tric Power Systems, 2006.

[68] J. A. R. Marshall and J. E. Rowe. Viscous populations and their support for

reciprocal cooperation. Artif. Life, 9, 2003.

[69] T. Masterton and D. Topiwala. AMulti-Agent Traffic Light Optimisation

and Coordination: Using Genetic Algorithms and Multi-Agent Systems to

Synchronise Intelligent Traffic Lights Across Entire Regions. White paper,

Thales group, 2008.

[70] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette. Evolutionary algorithms

for reinforcement learning. Journal of Artificial Intelligence Research, 11:

241–276, 1999.

[71] K. Moriyama. Utility based q-learning to maintain cooperation in prisoner’s

dilemma games. In Intelligent Agent Technology, 2007. IAT 2007. IEEE

WIC ACM International Conference on, pages 146 –152, 2007.

[72] K. Moriyama. Learning-rate adjusting q-learning for prisoner’s dilemma

games. In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT

’08. IEEE/WIC/ACM International Conference on, volume 2, pages 322 –

325, 2008.

[73] V. Nanduri and T.K. Das. Game theoretic approach for generation capac-

ity expansion in restructured power markets. In Power and Energy Society

BIBLIOGRAPHY 114

General Meeting - Conversion and Delivery of Electrical Energy in the 21st

Century, 2008 IEEE, 2008.

[74] J.F. Nash. Non-cooperative games. Annals of Mathematics, 1951.

[75] J.V. Neumann. Theory of parlor game. 1928.

[76] C. ORiordan, A. Cunningham, and H. Sorensen. Emergence of Cooperation

in N-player games on small world networks. Artificial Life, 11:436, 2008.

[77] C.H. Papadimitriou. Algorithms, games, and the internet. In Proceedings of

the 33rd Annual ACM Symposium on the Theory of Computing, 2001.

[78] I. Praca, C. Ramos, Z. Vale, and M. Cordeiro. Mascem: a multiagent system

that simulates competitive electricity markets. Intelligent Systems, IEEE, 18,

2003.

[79] C. Prins. A simple and effective evolutionary algorithm for the vehicle rout-

ing problem. Computers & Operations Research, 31(12):1985–2002, 2004.

ISSN 0305-0548.

[80] P.P. Reddy and M.M. Veloso. Learning behaviors of multiple autonomous

agents in smart grid markets. Association fo Advancement of Artificial In-

telligence, 2011.

[81] S.S. Reddy, P. Praveen, and M.S. Kumari. Micro genetic algorithm based

optimal power dispatch in multinode electricity market. International Jour-

nal of Recent Trends in Engineering, 2009.

[82] J. Ren, D.W. Gong, X.Y. Sun, J. Yuan, and M. Li. Interactive genetic al-

gorithms with variational population size. Emerging intelligent computing

technology and applications, with aspects of artificial intelligence, 2009.

BIBLIOGRAPHY 115

[83] C.W. Richter and G.B. Sheble. Genetic algorithm evolution of utility bid-

ding strategies for the competitive marketplace. IEEE transactions on power

systems, 13, 1998.

[84] S. Sankar. The overview of genetic algorithms and evaluation process for

the solution of optimization problems. 2007.

[85] Y. Semet and M. Schoenauer. On the benefits of inoculation, an example

in train scheduling. In Proceedings of the 8th annual conference on Ge-

netic and evolutionary computation, pages 1761–1768. ACM, 2006. ISBN

1595931864.

[86] J. Shapiro. Genetic algorithms in machine learning. Springer-Verlag New

York, Inc., 2001. ISBN 3-540-42490-3.

[87] R. R. Sharapov and A. V. Lapshin. Convergence of genetic algorithms.

Pattern Recognition and Image Analysis, 2006.

[88] M. Shibuya, H. Kita, and S. Kobayashi. Integration of multi objective inter-

active genetic algorithm and its application to animation design. Proc. Of

IEEE Systems, Man and Cybernetics, 1999.

[89] M. Shibuya, H. Kita, and S. Kobayashi. Integration of multi-objective and

interactive genetic algorithms and its application to animation design. IEEE

SMC ’99 Conference Proceedings. IEEE International Conference on Sys-

tems, Man, and Cybernetics, 1999.

[90] Y. Shoham and K. Leyton-Brown. Multiagent systems: Algorithmic, game-

theoretic, and logical foundations. Cambridge university press, 2009.

[91] Y. Shoham, R. Powers, and T. Grenager. Multi-agent reinforcement learn-

ing: a critical survey. Technical report, Stanford University, 2003.

BIBLIOGRAPHY 116

[92] R. Sikora and V. Sachdev. Learning bidding strategies with autonomous

agents in environments with unstable equilibrium. Decision Support Sys-

tems, 46, 2008.

[93] A. Simoes and E. Costa. Using gas to deal with dynamic environments:

A comparative study of several approaches based on promoting diversity.

In Proceedings of the Genetic and Evolutionary Computation Conference,

GECCO ’02, 2002.

[94] A. Singh, B. Minsker, and H. Takagi. Interactive genetic algorithms for

inverse groundwater modeling: issues with human fatigue and prediction

models. In Proceedings of the American Society of Civil Engineers (ASCE)

Environmental and Water Resources Institute (EWRI) World Water and En-

vironmental Resources Congress 2005 and Related Symposia, 2005.

[95] J.R. Stimpson. Satisficing and cooperation in multiagent social dilemmas.

masters thesis, brigham young university. 2002.

[96] A.L. Thomaz and C. Breazeal. Teachable robots: Understanding human

teaching behavior to build more effective robot learners. Artificial Intelli-

gence, 2008.

[97] P. Vytelingum, S. D. Ramchurn, T. D. Voice, A. Rogers, and N. R. Jen-

nings. Trading agents for the smart electricity grid. The Ninth International

Conference on Autonomous Agents and Multiagent Systems, 2010.

[98] L. Waltman, N. van Eck, R. Dekker, and U. Kaymak. Economic modeling

using evolutionary algorithms: the effect of a binary encoding of strategies.

Journal of Evolutionary Economics, 2010.

[99] J. Wang, F. Wen, R. Yang, Y. Ni, and F.F. Wu. Towards the development

of an appropriate regulation mechanism for maintenance scheduling of gen-

BIBLIOGRAPHY 117

erating units in electricity market environment. Power Engineering Society

General Meeting, IEEE, 2004.

[100] C.J.C.H. Watkins and P. Dayan. Q-learning. Machine learning, 8(3):279–

292, 1992. ISSN 0885-6125.

[101] T. Weise. Global optimization algorithms theory and application , 2008.

[102] R.P. Wiegand. An Analysis of Cooperative Coevolutionary Algorithms. PhD

thesis, George Mason University, 2003.

[103] J. Wu and R. Axelrod. How to cope with noise in the iterated prisoners

dilemma. Journal of Conflict Resolution, 1994.

[104] M. Wunder, M. Littman, and M. Babes. Classes of multiagent q-learning

dynamics with e-greedy exploration. International conference of machine

learning, 2010.

[105] A.W.L. Yaoa, S.C. Chib, and C.K. Chena. Development of an integrated

grey fuzzy-based electricity management system for enterprises. Energy,

30, 2005.

[106] N. Yu and C.C. Liu. Multi-agent systems and electricity markets: State-

of-the-art and the future. In Power and Energy Society General Meeting

- Conversion and Delivery of Electrical Energy in the 21st Century, 2008

IEEE, 2008.

[107] W. Zhong, J. Liu, M. Xue, and L. Jiao. A multiagent genetic algorithm

for global numerical optimization. Systems, Man, and Cybernetics, Part B:

Cybernetics, IEEE Transactions on, 34(2):1128–1141, 2004. ISSN 1083-

4419.

applicability of interactive genetic algorithms to multi-agent systems: experiments on games used in...

Technology

masdar institute of

repeated matrix games

games structure

suggested algorithms

evolutionary algorithms

multiagent systems

types of matrix games

stochastic games