coevolutionary reinforcement learning for othello and ... 3/ctdl.pdf · ctdl for othello ctdl( )...

Coevolutionary Reinforcement Learningfor Othello and Small-Board Go

Marcin Szubert Wojciech Jaskowski Krzysztof Krawiec

Institute of Computing SciencePoznan University of Technology

April 12, 2010

Introduction Methods Experimental Results Summary and Conclusions

Outline

1 IntroductionInspirationMotivation and Objectives

2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning

3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go

4 Summary and Conclusions

Coevolutionary Reinforcement Learning for Othello and Small-Board Go 2 / 41 M. Szubert, W.Jaskowski, K.Krawiec


Outline







Inspiration — Samuel’s Checkers Player

“Some Studies in Machine Learning Using the Game of Checkers”– A. L. Samuel, IBM Journal of Research and Development, 1959

Games provide a convenient vehicle for the development of learningprocedures as contrasted with a problem taken from life, since many of thecomplications of detail are removed.

Arthur Lee Samuel

MAX

MIN MIN

MAX

MINMIN

Previously visited state

Polynomial value computed

Ply number 1

2

Learning methods based onShannon’s minimax procedure:

Rote Learning – handcraftedscoring polynomialLearning by Generalization –polynomial modification



Inspiration — Different Views of Samuel’s Work

Samuel was one of the first to make effective useof heuristic search methods and of what we wouldnow call temporal difference learning.

Richard Sutton & Andrew Barto

To elaborate the analogy with evolutionarycomputation, Samuel’s procedure can be called acoevolutionary algorithm with two populations ofsize 1, asynchronous population updates, anddomain-specific, deterministic variation operators.

Anthony Bucci



Inspiration — Lucas & Runarsson

“Temporal Difference Learning Versus Co-Evolution for AcquiringOthello Position Evaluation” – S. Lucas and T. Runarsson, IEEESymposium on Computational Intelligence and Games, 2006

Why is this work interesting?

Learning with little a priori knowledgeComparison of coevolutionary algorithmand non-evolutionary approachComplementary advantages of thesemethods revealed by experimental results



Outline







Motivation

Past observations

Temporal Difference Learning (TDL) is much faster

Coevolutionary Learning (CEL) can eventually produce betterstrategies if parameters are tuned properly

Is it possible to combine the advantages of TDL and CELin a single hybrid algorithm?

We propose Coevolutionary Temporal Difference Learning(CTDL) method and evaluate it on board games of Othelloand Small-Board Go.

We also incorporate a simple Hall of Fame (HoF) archive.



Objectives

Objective: Learn game-playing strategy represented by theweights of a weighted piece counter (WPC).

1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00-0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10-0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00

f (b) =8×8∑i=1

wibi

The emphasis throughout all of these studies has been on learning techniques. Thetemptation to improve the machine’s game by giving it standard openings or otherman-generated knowledge of playing techniques has been consistently resisted.

Arthur Lee Samuel



Outline







Coevolutionary Algorithm

Coevolution in nature

Reciprocally induced evolutionary change between two or moreinteracting species or populations.

The simplest variant of one-population generationalcompetitive coevolutionary algorithm:

18 2 Coevolution

Algorithm 1 Basic scheme of a generational evolutionary algorithm

1: P ! createRandomPopulation()2: evaluatePopulation(P)3: while ¬terminationCondition() do4: S ! selectParents(P)5: P ! recombineAndMutate(S)6: evaluatePopulation(P)7: end while8: return getFittestIndividual(P)

The family of EA is composed of a few methods that di!er slightly in technical de-tails, but all match the basic scheme presented in Algorithm 1. The most importantdi!erence between these methods concerns so called representation which defines amapping from phenotypes onto a set of genotypes and specifies what data structuresare employed in this encoding. Phenotypes are objects forming solutions to theoriginal problem, i.e., points of the problem space of possible solutions. Genotypes,on the other hand, are used to denote points in the evolutionary search space whichare subject to genetic operations. The process of genotype-phenotype decoding isintended to model natural phenomenon of embryogenesis. More detailed descriptionof these terms can be found in [Weise 09].

Returning to di!erent dialects of EA, candidate solutions are represented typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finitestate machines in classical Evolutionary Programming (EP) [Fogel 95] and trees inGenetic Programming (GP) [Koza 92]. A certain representation might be preferableif it makes encoding solutions to a given problem more natural. Obviously, geneticoperations of recombination and mutation must be adapted to chosen representa-tion. For example, crossover in GP is usually based on exchanging subtrees betweencombined individuals.

The most significant advantage of EA lies in their flexibility and adaptability tothe given task. This may be explained by their metaheuristic character of “blackbox” that makes only few assumptions about the underlying objective function whichis the subject of optimization. EA are claimed to be robust problem solvers showingroughly good performance over a wide range of problems, as reported by Goldberg[Goldberg 89]. Especially the combination of EA with problem-specific heuristicsincluding local-search based techniques, often make possible highly e"cient opti-mization algorithms for many areas of application. Such hybridization of EA isgetting popular due to their capabilities in handling real-world problems involvingnoisy environment, imprecision or uncertainty. The latest state-of-the-art method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].

What mainly distinguishes coevolution from standard EA?context-sensitive evaluation phaseno objective fitness =⇒ no guarantee of progress



Coevolutionary Fitness Assignment

Common interaction patterns:

12 2 Coevolution


P ! createRandomPopulation()evaluatePopulation(P)while ¬terminationCondition() doS ! selectParents(P)P ! recombineAndMutate(S)evaluatePopulation(P)

end whilereturn getFittestIndividual(P)

The family of EA is composed of a few methods that di!er slightly in technicaldetails, but all can be realized with the basic scheme presented in Algorithm 1. Themost important di!erence between these methods concerns so called representationwhich defines a mapping from phenotypes onto a set of genotypes and specifies whatdata structures are employed in this encoding. Phenotypes are objects forming so-lutions to the original problem, i.e. points of the problem space of possible solutions.Genotypes, on the other hand, are used to denote points in the evolutionary searchspace which are subject to genetic operations. The process of genotype-phenotypedecoding is intended to model natural phenomenon of embryogenesis. More detaileddescription of these abstractions can be found in [Weise 09].

Returning to di!erent dialects of EA, candidate solutions are represented typi-cally by strings over a finite alphabet (usually binary) in Genetic Algorithms (GA)[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finitestate machines in classical Evolutionary Programming (EP) [Fogel 95] and trees inGenetic Programming (GP) [Koza 92]. A certain representation might be prefer-able if it makes encoding solutions to particular problem more natural. Obviously,genetic operations of recombination and mutation must be adapted to choosen rep-resentation. For example, crossover in GP is usually based on exchanging of subtreesbetween combined individuals.

The most significant advantage of EA lies in their flexibility and adaptability tothe given task. This may be explained by their metaheuristic character of “blackbox” that makes only few assumptions about the underlying objective function whichis a subject to optimization. Another benefit is that EA are claimed to be robustproblem solvers showing roughly good performance over a wide range of problems,as reported by Goldberg [Goldberg 89].

Especially the combination of EA with problem-specific heuristics includinglocal-search based techniques, often make possible highly e"cient optimization al-gorithms for many areas of application. Such hybridization of EA is getting populardue to their capabilities in handling real-world problems involving noisy environ-ment, imprecision or uncertainty. The latest state-of-the-art methodologies in Hy-brid Evolutionary Algorithms are described in [Grosan 07].

1

14 2 Coevolution

P

(a) Round-robin in one population - !

P1 P

2

(b) Round-robin in two populations

Fig. 2.1: Round-robin tournament interaction scheme

ing competitions between accordingly coupled pairs is the dominant computationalrequirement of the evolution process, the competition topology is an important con-sideration. Di!erent types of topologies were proposed and discussed by Angelineand Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].

Round-robin tournament which is illustrated in Figure 2.1 is a common approachresulting in the most accurate evaluation. In this pattern each member of eachpopulation interact with every other individual which can serve as a partner. Thisrequires n(n " 1)/2 competitions in a single-population of P1 members (as shownin Figure 2.1a) and nm competitions in a two-population setup, where P2 andm are s.However, such approach is computationally expensive, especially for largepopulations. Therefore, more e"cient patterns of interactions

Single Elimination Tournament (SET) Tournament interaction scheme is illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary algo-rithms with only one population (or solely inter-population). Alternatively, it can beextended to be use However, basing on this concept an extension of inter-populationtournament could be designed.

2.2.3 Coevolution vs Evolution in Practice

A question arises: when and why shall we prefer coevolution rather than traditionalevolutionary algorithm. Machine learning. We will consider how EA can be usedfor problem which naturally requires coevolution.

Moreover, this inconsistency of EA with natural evolution leads to a seriousproblem if, in contrast to optimization, there is no objective function intrinsic to a

14 2 Coevolution

P

(a) Round-robin in one population - !

P1 P

2

(b) Round-robin in two populations

Fig. 2.1: Round-robin tournament interaction scheme

ing competitions between accordingly coupled pairs is the dominant computationalrequirement of the evolution process, the competition topology is an important con-sideration. Di!erent types of topologies were proposed and discussed by Angelineand Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].

Round-robin tournament which is illustrated in Figure 2.1 is a common approachresulting in the most accurate evaluation. In this pattern each member of eachpopulation interact with every other individual which can serve as a partner. Thisrequires n(n " 1)/2 competitions in a single-population of P1 members (as shownin Figure 2.1a) and nm competitions in a two-population setup, where P2 andm are s.However, such approach is computationally expensive, especially for largepopulations. Therefore, more e"cient patterns of interactions

Single Elimination Tournament (SET) Tournament interaction scheme is illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary algo-rithms with only one population (or solely inter-population). Alternatively, it can beextended to be use However, basing on this concept an extension of inter-populationtournament could be designed.

2.2.3 Coevolution vs Evolution in Practice

A question arises: when and why shall we prefer coevolution rather than traditionalevolutionary algorithm. Machine learning. We will consider how EA can be usedfor problem which naturally requires coevolution.

Moreover, this inconsistency of EA with natural evolution leads to a seriousproblem if, in contrast to optimization, there is no objective function intrinsic to a

How to aggregate interaction results into single fitness value?calculate sum of all interaction outcomesuse competitive fitness sharingproblem of measurement



Coevolutionary Archive

Maintaining historical players in the Hall of Fame (HoF)archive for breeding and evaluating purposes.

Evaluation phase flowchart:

18 2 Coevolution


P ! createRandomPopulation()evaluatePopulation(P)while ¬terminationCondition() doS ! selectParents(P)P ! recombineAndMutate(S)evaluatePopulation(P)





18 2 Coevolution


P ! createRandomPopulation()evaluatePopulation(A)while ¬terminationCondition() doS ! selectParents(P)P ! recombineAndMutate(S)evaluatePopulation(P)





1. Play round robin tournament between population members

2. Randomly select archival individuals to act as opponents

3. Select the best-of-generation individual and add it to the archive



Outline







Successes of Reinforcement Learning

Reinforcement Learning ideas have been independentlyvalidated in many different application areas.

RL application areas

Process Control23%

Other8%

Finance4%

Autonomic Computing6% Traffic

6%Robotics

13%

Resource Management18%

Networking21%

Survey by Csaba Szepesvari of 77 recent application papers, based on an IEEE.org search for the keywords “RL” and “application”

signal processingnatural language processing

web servicesbrain-computer interfaces

aircraft controlengine control

bio/chemical reactors

sensor networksroutingcall admission controlnetwork resource management

power systemsinventory controlsupply chainscustomer service

mobile robots, motion control, Robocup, visionstoplight control, trains, unmanned vehicles

load balancingmemory management

algorithm tuning

option pricingasset management



The Reinforcement Learning Paradigm

Reinforcement Learning (RL)

Machine learning paradigm focused on solving problems in whichan agent interacts with an environment by taking actions andreceiving rewards at discrete time steps. The objective is to findsuch a decision policy that maximizes cumulative reward.

Agent

Environment

2. action at

3. reward rt1. state st

4. learn on the basis of < st , at , rt , st+1 >

In Othello:

agent =⇒ player

environment =⇒ game

state =⇒ board state

action =⇒ legal move

reward =⇒ game result



Key Ideas of Reinforcement Learning

Agent’s goal is to learn policy π : S 7→ A that maximizes theexpected return Rt (the function of future rewards rt+1, rt+2, ...)

cumulative discounted return Rt =∑∞

k=0 γk rt+k

delayed rewards – temporal credit assignment problem

RL methods specify how the agent changes itspolicy as a result of experience.

trial and error search

exploration-exploitation trade-off

All efficient methods for solving sequentialdecision problems estimate value function asan intermediate step V π(s) = Eπ[Rt |st = s]

! V

evaluation

improvement

V "V!

!"greedy(V)

*V!*



Prediction Learning Problem

Experience-outcome sequence : s1, s2, s3, ..., sT ; z

Sequence of predictions : V (s1),V (s2),V (s3), ...,V (sT ) of z

Supervised learning:

V (st) = V (st) + α[z − V (st)]

road

30

35

40

45

Predictedtotal

traveltime

leaving

officeexiting

highway

2ndary home arrive

Situation

actual outcome

reachcar street home

Temporal difference learning:

V (st) = V (st)+α[V (st+1)−V (st)]

actualoutcome

Situation

30

35

40

45

Predictedtotaltraveltime

roadleaving

officeexiting

highway

2ndary home arrivereachcar street home



Prediction Learning Problem =⇒ Policy Evaluation

Sample experience following a policy π : s1, r1, s2, r2, ..., sT , rT

Sequence of estimates of V π(st) : V (s1),V (s2), ...,V (sT )

Monte-Carlo method:

V (st) = V (st) + α[Rt − V (st)]

Rt =T−t∑k=0

γk rt+k

TD(0) method:

V (st) = V (st) + α[Rt(1) − V (st)]

Rt(1) = rt + γV (st+1)



Prediction Learning Problem =⇒ Policy Evaluation

Sample experience following a policy π : s1, r1, s2, r2, ..., sT , rTSequence of estimates of V π(st) : V (s1),V (s2), ...,V (sT )

TD(λ) method:

Rt(n) =

n−1∑k=0

γk rt+k + γnV (st+n)

Rt(λ) = (1− λ)

∞∑n=1

λn−1Rt(n)

1!"

weight given to

the 3-step return

decay by "

weight given to

actual, final return

t T

Time

Weight

total area = 1

Monte-Carlo method:

V (st) = V (st)+α[Rt−V (st)]

Rt =T−t∑k=0

γk rt+k

TD(0) method:

V (st) = V (st)+α[Rt(1)−V (st)]

Rt(1) = rt + γV (st+1)



Gradient-Descent Temporal Difference Learning

Tabular TD(λ) does not address the issue of generalization

Value function Vt represented as a parameterized functionalform with modifiable parameter vector wt

Weights update rule for gradient-descent TD(λ):

wt+1 = wt + α[Rt(λ) − Vt(st)]∇wt Vt(st)

TDL applied to Othello/Go:

Position evaluation function f (bt) used to compute Pt

Learning through modifying WPC weights vectorTraining data obtained by self-play



Outline







Coevolutionary Temporal Difference Learning

Coevolutionary Temporal Difference Learning

A hybrid of coevolutionary search with reinforcement learning thatworks by interlacing one-population competitive coevolution withtemporal difference learning.

Population of players is subject to alternating learning phases:

TDL phase – each population member plays k games with itselfCEL phase – a single round-robin tournament betweenpopulation members (and, optionally, also archival individuals)

Each TDL-CEL cycle is followed by standard stages of fitnessassignment, selection and recombination

CEL performs exploration of the solution space, while TDL isresponsible for its exploitation by means of local search



Outline







Performance vs. Random Othello Player

0.5

0.6

0.7

0.8

0.9

0 10 20 30 40

pro

babili

ty o

f w

innin

g

games played (x 100 000)

CTDL + HoF

CTDL

TDL

CEL + HoF

CEL



Performance vs. Heuristic Player

0

0.1

0.2

0.3

0.4

0.5

0 10 20 30 40

pro

babili

ty o

f w

innin

g


CTDL + HoF

CTDL

TDL

CEL + HoF

CEL



Relative Performance Progress Over Time

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

0 10 20 30 40

poin

ts in tourn

am

ents


CTDL + HoF

CTDL

TDL

CEL + HoF

CEL



Experiment with Negative Learning Rate

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 500 1000 1500 2000 2500 3000 3500 4000 4500

pro

babili

ty o

f w

innin

g

games played (x 1000)

TDL + CEL + HoF

TDL + CEL + HoF, negative learning rate



Best Evolved Othello WPC

1.02 -0.27 0.55 -0.10 0.08 0.47 -0.38 1.00-0.13 -0.52 -0.18 -0.07 -0.18 -0.29 -0.68 -0.44 0.55 -0.24 0.02 -0.01 -0.01 0.10 -0.13 0.77-0.10 -0.10 0.01 -0.01 0.00 -0.01 -0.09 -0.05 0.05 -0.17 0.02 -0.04 -0.03 0.03 -0.09 -0.05 0.56 -0.25 0.05 0.02 -0.02 0.17 -0.35 0.42-0.25 -0.71 -0.24 -0.23 -0.08 -0.29 -0.63 -0.24 0.93 -0.44 0.55 0.22 -0.15 0.74 -0.57 0.97



Outline







TD(λ) Performance vs. Go Heuristic Player

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 4 8 12 16 20

pro

babili

ty o

f w

innin

g

games played (x 100,000)

λ = 0.0

λ = 0.4

λ = 0.8

λ = 0.9

λ = 0.95

λ = 1.0



Performance vs. Go Heuristic Player

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 4 8 12 16 20

pro

babili

ty o

f w

innin

g


CTDL + HoF

CTDL

TDL

CEL + HoF

CEL



Performance vs. Average Liberty Player

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 4 8 12 16 20

pro

babili

ty o

f w

innin

g


CTDL + HoF

CTDL

TDL

CEL + HoF

CEL



Relative Performance Progress Over Time

4000

5000

6000

7000

8000

9000

10000

11000

12000

0 4 8 12 16 20

poin

ts in tourn

am

ents


CTDL + HoF

CTDL

TDL

CEL + HoF

CEL



Outline







Summary

CTDL benefits from mutually complementary characteristicsof both constituent methods.

Retains unsupervised character – useful when the knowledgeof the problem domain is unavailable or expensive to obtain.

Interesting biological interpretation of CTDL as an analogy toLamarckian Evolution Theory.

There is a need of further investigation of CTDL in thecontext of other challenging problems.



Future Work

Employing a more complex learner architecture than a WPC

Using CTDL with two-population coevolution, with solutionsand tests bred separately (learner-teacher paradigm)

Including more advanced archive methods like LAPCA or IPCA

Changing TDL phase character to influence only evaluationprocess (it would model the Baldwin Effect)



Thank You


coevolutionary reinforcement learning for othello and ... 3/ctdl.pdf · ctdl for othello ctdl( )...

Documents