coevolutionary reinforcement learning for othello and ... 3/ctdl.pdf · ctdl for othello ctdl( )...
TRANSCRIPT
Coevolutionary Reinforcement Learningfor Othello and Small-Board Go
Marcin Szubert Wojciech Jaskowski Krzysztof Krawiec
Institute of Computing SciencePoznan University of Technology
April 12, 2010
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 2 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 3 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 4 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Inspiration — Samuel’s Checkers Player
“Some Studies in Machine Learning Using the Game of Checkers”– A. L. Samuel, IBM Journal of Research and Development, 1959
Games provide a convenient vehicle for the development of learningprocedures as contrasted with a problem taken from life, since many of thecomplications of detail are removed.
Arthur Lee Samuel
MAX
MIN MIN
MAX
MINMIN
Previously visited state
Polynomial value computed
Ply number 1
2
Learning methods based onShannon’s minimax procedure:
Rote Learning – handcraftedscoring polynomialLearning by Generalization –polynomial modification
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 5 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Inspiration — Different Views of Samuel’s Work
Samuel was one of the first to make effective useof heuristic search methods and of what we wouldnow call temporal difference learning.
Richard Sutton & Andrew Barto
To elaborate the analogy with evolutionarycomputation, Samuel’s procedure can be called acoevolutionary algorithm with two populations ofsize 1, asynchronous population updates, anddomain-specific, deterministic variation operators.
Anthony Bucci
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 6 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Inspiration — Lucas & Runarsson
“Temporal Difference Learning Versus Co-Evolution for AcquiringOthello Position Evaluation” – S. Lucas and T. Runarsson, IEEESymposium on Computational Intelligence and Games, 2006
Why is this work interesting?
Learning with little a priori knowledgeComparison of coevolutionary algorithmand non-evolutionary approachComplementary advantages of thesemethods revealed by experimental results
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 7 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 8 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Motivation
Past observations
Temporal Difference Learning (TDL) is much faster
Coevolutionary Learning (CEL) can eventually produce betterstrategies if parameters are tuned properly
Is it possible to combine the advantages of TDL and CELin a single hybrid algorithm?
We propose Coevolutionary Temporal Difference Learning(CTDL) method and evaluate it on board games of Othelloand Small-Board Go.
We also incorporate a simple Hall of Fame (HoF) archive.
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 9 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Objectives
Objective: Learn game-playing strategy represented by theweights of a weighted piece counter (WPC).
1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00-0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.05 0.01 0.02 0.01 0.01 0.02 0.01 0.05 0.10 0.01 0.05 0.02 0.02 0.05 0.01 0.10-0.25 -0.25 0.01 0.01 0.01 0.01 -0.25 -0.25 1.00 -0.25 0.10 0.05 0.05 0.10 -0.25 1.00
f (b) =8×8∑i=1
wibi
The emphasis throughout all of these studies has been on learning techniques. Thetemptation to improve the machine’s game by giving it standard openings or otherman-generated knowledge of playing techniques has been consistently resisted.
Arthur Lee Samuel
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 10 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 11 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 12 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Coevolutionary Algorithm
Coevolution in nature
Reciprocally induced evolutionary change between two or moreinteracting species or populations.
The simplest variant of one-population generationalcompetitive coevolutionary algorithm:
18 2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
1: P ! createRandomPopulation()2: evaluatePopulation(P)3: while ¬terminationCondition() do4: S ! selectParents(P)5: P ! recombineAndMutate(S)6: evaluatePopulation(P)7: end while8: return getFittestIndividual(P)
The family of EA is composed of a few methods that di!er slightly in technical de-tails, but all match the basic scheme presented in Algorithm 1. The most importantdi!erence between these methods concerns so called representation which defines amapping from phenotypes onto a set of genotypes and specifies what data structuresare employed in this encoding. Phenotypes are objects forming solutions to theoriginal problem, i.e., points of the problem space of possible solutions. Genotypes,on the other hand, are used to denote points in the evolutionary search space whichare subject to genetic operations. The process of genotype-phenotype decoding isintended to model natural phenomenon of embryogenesis. More detailed descriptionof these terms can be found in [Weise 09].
Returning to di!erent dialects of EA, candidate solutions are represented typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finitestate machines in classical Evolutionary Programming (EP) [Fogel 95] and trees inGenetic Programming (GP) [Koza 92]. A certain representation might be preferableif it makes encoding solutions to a given problem more natural. Obviously, geneticoperations of recombination and mutation must be adapted to chosen representa-tion. For example, crossover in GP is usually based on exchanging subtrees betweencombined individuals.
The most significant advantage of EA lies in their flexibility and adaptability tothe given task. This may be explained by their metaheuristic character of “blackbox” that makes only few assumptions about the underlying objective function whichis the subject of optimization. EA are claimed to be robust problem solvers showingroughly good performance over a wide range of problems, as reported by Goldberg[Goldberg 89]. Especially the combination of EA with problem-specific heuristicsincluding local-search based techniques, often make possible highly e"cient opti-mization algorithms for many areas of application. Such hybridization of EA isgetting popular due to their capabilities in handling real-world problems involvingnoisy environment, imprecision or uncertainty. The latest state-of-the-art method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].
What mainly distinguishes coevolution from standard EA?context-sensitive evaluation phaseno objective fitness =⇒ no guarantee of progress
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 13 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Coevolutionary Fitness Assignment
Common interaction patterns:
12 2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
P ! createRandomPopulation()evaluatePopulation(P)while ¬terminationCondition() doS ! selectParents(P)P ! recombineAndMutate(S)evaluatePopulation(P)
end whilereturn getFittestIndividual(P)
The family of EA is composed of a few methods that di!er slightly in technicaldetails, but all can be realized with the basic scheme presented in Algorithm 1. Themost important di!erence between these methods concerns so called representationwhich defines a mapping from phenotypes onto a set of genotypes and specifies whatdata structures are employed in this encoding. Phenotypes are objects forming so-lutions to the original problem, i.e. points of the problem space of possible solutions.Genotypes, on the other hand, are used to denote points in the evolutionary searchspace which are subject to genetic operations. The process of genotype-phenotypedecoding is intended to model natural phenomenon of embryogenesis. More detaileddescription of these abstractions can be found in [Weise 09].
Returning to di!erent dialects of EA, candidate solutions are represented typi-cally by strings over a finite alphabet (usually binary) in Genetic Algorithms (GA)[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finitestate machines in classical Evolutionary Programming (EP) [Fogel 95] and trees inGenetic Programming (GP) [Koza 92]. A certain representation might be prefer-able if it makes encoding solutions to particular problem more natural. Obviously,genetic operations of recombination and mutation must be adapted to choosen rep-resentation. For example, crossover in GP is usually based on exchanging of subtreesbetween combined individuals.
The most significant advantage of EA lies in their flexibility and adaptability tothe given task. This may be explained by their metaheuristic character of “blackbox” that makes only few assumptions about the underlying objective function whichis a subject to optimization. Another benefit is that EA are claimed to be robustproblem solvers showing roughly good performance over a wide range of problems,as reported by Goldberg [Goldberg 89].
Especially the combination of EA with problem-specific heuristics includinglocal-search based techniques, often make possible highly e"cient optimization al-gorithms for many areas of application. Such hybridization of EA is getting populardue to their capabilities in handling real-world problems involving noisy environ-ment, imprecision or uncertainty. The latest state-of-the-art methodologies in Hy-brid Evolutionary Algorithms are described in [Grosan 07].
1
14 2 Coevolution
P
(a) Round-robin in one population - !
P1 P
2
(b) Round-robin in two populations
Fig. 2.1: Round-robin tournament interaction scheme
ing competitions between accordingly coupled pairs is the dominant computationalrequirement of the evolution process, the competition topology is an important con-sideration. Di!erent types of topologies were proposed and discussed by Angelineand Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].
Round-robin tournament which is illustrated in Figure 2.1 is a common approachresulting in the most accurate evaluation. In this pattern each member of eachpopulation interact with every other individual which can serve as a partner. Thisrequires n(n " 1)/2 competitions in a single-population of P1 members (as shownin Figure 2.1a) and nm competitions in a two-population setup, where P2 andm are s.However, such approach is computationally expensive, especially for largepopulations. Therefore, more e"cient patterns of interactions
Single Elimination Tournament (SET) Tournament interaction scheme is illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary algo-rithms with only one population (or solely inter-population). Alternatively, it can beextended to be use However, basing on this concept an extension of inter-populationtournament could be designed.
2.2.3 Coevolution vs Evolution in Practice
A question arises: when and why shall we prefer coevolution rather than traditionalevolutionary algorithm. Machine learning. We will consider how EA can be usedfor problem which naturally requires coevolution.
Moreover, this inconsistency of EA with natural evolution leads to a seriousproblem if, in contrast to optimization, there is no objective function intrinsic to a
14 2 Coevolution
P
(a) Round-robin in one population - !
P1 P
2
(b) Round-robin in two populations
Fig. 2.1: Round-robin tournament interaction scheme
ing competitions between accordingly coupled pairs is the dominant computationalrequirement of the evolution process, the competition topology is an important con-sideration. Di!erent types of topologies were proposed and discussed by Angelineand Pollack [Angeline 93], Panait and Luke [Panait 02] and Sims [Sims 94b].
Round-robin tournament which is illustrated in Figure 2.1 is a common approachresulting in the most accurate evaluation. In this pattern each member of eachpopulation interact with every other individual which can serve as a partner. Thisrequires n(n " 1)/2 competitions in a single-population of P1 members (as shownin Figure 2.1a) and nm competitions in a two-population setup, where P2 andm are s.However, such approach is computationally expensive, especially for largepopulations. Therefore, more e"cient patterns of interactions
Single Elimination Tournament (SET) Tournament interaction scheme is illus-trated in figure 2.2. This type of interactions is dedicated to coevolutionary algo-rithms with only one population (or solely inter-population). Alternatively, it can beextended to be use However, basing on this concept an extension of inter-populationtournament could be designed.
2.2.3 Coevolution vs Evolution in Practice
A question arises: when and why shall we prefer coevolution rather than traditionalevolutionary algorithm. Machine learning. We will consider how EA can be usedfor problem which naturally requires coevolution.
Moreover, this inconsistency of EA with natural evolution leads to a seriousproblem if, in contrast to optimization, there is no objective function intrinsic to a
How to aggregate interaction results into single fitness value?calculate sum of all interaction outcomesuse competitive fitness sharingproblem of measurement
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 14 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Coevolutionary Archive
Maintaining historical players in the Hall of Fame (HoF)archive for breeding and evaluating purposes.
Evaluation phase flowchart:
18 2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
P ! createRandomPopulation()evaluatePopulation(P)while ¬terminationCondition() doS ! selectParents(P)P ! recombineAndMutate(S)evaluatePopulation(P)
end whilereturn getFittestIndividual(P)
The family of EA is composed of a few methods that di!er slightly in technical de-tails, but all match the basic scheme presented in Algorithm 1. The most importantdi!erence between these methods concerns so called representation which defines amapping from phenotypes onto a set of genotypes and specifies what data structuresare employed in this encoding. Phenotypes are objects forming solutions to theoriginal problem, i.e., points of the problem space of possible solutions. Genotypes,on the other hand, are used to denote points in the evolutionary search space whichare subject to genetic operations. The process of genotype-phenotype decoding isintended to model natural phenomenon of embryogenesis. More detailed descriptionof these terms can be found in [Weise 09].
Returning to di!erent dialects of EA, candidate solutions are represented typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finitestate machines in classical Evolutionary Programming (EP) [Fogel 95] and trees inGenetic Programming (GP) [Koza 92]. A certain representation might be preferableif it makes encoding solutions to a given problem more natural. Obviously, geneticoperations of recombination and mutation must be adapted to chosen representa-tion. For example, crossover in GP is usually based on exchanging subtrees betweencombined individuals.
The most significant advantage of EA lies in their flexibility and adaptability tothe given task. This may be explained by their metaheuristic character of “blackbox” that makes only few assumptions about the underlying objective function whichis the subject of optimization. EA are claimed to be robust problem solvers showingroughly good performance over a wide range of problems, as reported by Goldberg[Goldberg 89]. Especially the combination of EA with problem-specific heuristicsincluding local-search based techniques, often make possible highly e"cient opti-mization algorithms for many areas of application. Such hybridization of EA isgetting popular due to their capabilities in handling real-world problems involvingnoisy environment, imprecision or uncertainty. The latest state-of-the-art method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].
18 2 Coevolution
Algorithm 1 Basic scheme of a generational evolutionary algorithm
P ! createRandomPopulation()evaluatePopulation(A)while ¬terminationCondition() doS ! selectParents(P)P ! recombineAndMutate(S)evaluatePopulation(P)
end whilereturn getFittestIndividual(P)
The family of EA is composed of a few methods that di!er slightly in technical de-tails, but all match the basic scheme presented in Algorithm 1. The most importantdi!erence between these methods concerns so called representation which defines amapping from phenotypes onto a set of genotypes and specifies what data structuresare employed in this encoding. Phenotypes are objects forming solutions to theoriginal problem, i.e., points of the problem space of possible solutions. Genotypes,on the other hand, are used to denote points in the evolutionary search space whichare subject to genetic operations. The process of genotype-phenotype decoding isintended to model natural phenomenon of embryogenesis. More detailed descriptionof these terms can be found in [Weise 09].
Returning to di!erent dialects of EA, candidate solutions are represented typi-cally by strings over a finite (usually binary) alphabet in Genetic Algorithms (GA)[Holland 62], real-valued vectors in Evolution Strategies (ES) [Rechenberg 73], finitestate machines in classical Evolutionary Programming (EP) [Fogel 95] and trees inGenetic Programming (GP) [Koza 92]. A certain representation might be preferableif it makes encoding solutions to a given problem more natural. Obviously, geneticoperations of recombination and mutation must be adapted to chosen representa-tion. For example, crossover in GP is usually based on exchanging subtrees betweencombined individuals.
The most significant advantage of EA lies in their flexibility and adaptability tothe given task. This may be explained by their metaheuristic character of “blackbox” that makes only few assumptions about the underlying objective function whichis the subject of optimization. EA are claimed to be robust problem solvers showingroughly good performance over a wide range of problems, as reported by Goldberg[Goldberg 89]. Especially the combination of EA with problem-specific heuristicsincluding local-search based techniques, often make possible highly e"cient opti-mization algorithms for many areas of application. Such hybridization of EA isgetting popular due to their capabilities in handling real-world problems involvingnoisy environment, imprecision or uncertainty. The latest state-of-the-art method-ologies in Hybrid Evolutionary Algorithms are reviewed in [Grosan 07].
1. Play round robin tournament between population members
2. Randomly select archival individuals to act as opponents
3. Select the best-of-generation individual and add it to the archive
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 15 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 16 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Successes of Reinforcement Learning
Reinforcement Learning ideas have been independentlyvalidated in many different application areas.
RL application areas
Process Control23%
Other8%
Finance4%
Autonomic Computing6% Traffic
6%Robotics
13%
Resource Management18%
Networking21%
Survey by Csaba Szepesvari of 77 recent application papers, based on an IEEE.org search for the keywords “RL” and “application”
signal processingnatural language processing
web servicesbrain-computer interfaces
aircraft controlengine control
bio/chemical reactors
sensor networksroutingcall admission controlnetwork resource management
power systemsinventory controlsupply chainscustomer service
mobile robots, motion control, Robocup, visionstoplight control, trains, unmanned vehicles
load balancingmemory management
algorithm tuning
option pricingasset management
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 17 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
The Reinforcement Learning Paradigm
Reinforcement Learning (RL)
Machine learning paradigm focused on solving problems in whichan agent interacts with an environment by taking actions andreceiving rewards at discrete time steps. The objective is to findsuch a decision policy that maximizes cumulative reward.
Agent
Environment
2. action at
3. reward rt1. state st
4. learn on the basis of < st , at , rt , st+1 >
In Othello:
agent =⇒ player
environment =⇒ game
state =⇒ board state
action =⇒ legal move
reward =⇒ game result
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 18 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Key Ideas of Reinforcement Learning
Agent’s goal is to learn policy π : S 7→ A that maximizes theexpected return Rt (the function of future rewards rt+1, rt+2, ...)
cumulative discounted return Rt =∑∞
k=0 γk rt+k
delayed rewards – temporal credit assignment problem
RL methods specify how the agent changes itspolicy as a result of experience.
trial and error search
exploration-exploitation trade-off
All efficient methods for solving sequentialdecision problems estimate value function asan intermediate step V π(s) = Eπ[Rt |st = s]
! V
evaluation
improvement
V "V!
!"greedy(V)
*V!*
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 19 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Prediction Learning Problem
Experience-outcome sequence : s1, s2, s3, ..., sT ; z
Sequence of predictions : V (s1),V (s2),V (s3), ...,V (sT ) of z
Supervised learning:
V (st) = V (st) + α[z − V (st)]
road
30
35
40
45
Predictedtotal
traveltime
leaving
officeexiting
highway
2ndary home arrive
Situation
actual outcome
reachcar street home
Temporal difference learning:
V (st) = V (st)+α[V (st+1)−V (st)]
actualoutcome
Situation
30
35
40
45
Predictedtotaltraveltime
roadleaving
officeexiting
highway
2ndary home arrivereachcar street home
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 20 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Prediction Learning Problem =⇒ Policy Evaluation
Sample experience following a policy π : s1, r1, s2, r2, ..., sT , rT
Sequence of estimates of V π(st) : V (s1),V (s2), ...,V (sT )
Monte-Carlo method:
V (st) = V (st) + α[Rt − V (st)]
Rt =T−t∑k=0
γk rt+k
TD(0) method:
V (st) = V (st) + α[Rt(1) − V (st)]
Rt(1) = rt + γV (st+1)
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 21 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Prediction Learning Problem =⇒ Policy Evaluation
Sample experience following a policy π : s1, r1, s2, r2, ..., sT , rTSequence of estimates of V π(st) : V (s1),V (s2), ...,V (sT )
TD(λ) method:
Rt(n) =
n−1∑k=0
γk rt+k + γnV (st+n)
Rt(λ) = (1− λ)
∞∑n=1
λn−1Rt(n)
1!"
weight given to
the 3-step return
decay by "
weight given to
actual, final return
t T
Time
Weight
total area = 1
Monte-Carlo method:
V (st) = V (st)+α[Rt−V (st)]
Rt =T−t∑k=0
γk rt+k
TD(0) method:
V (st) = V (st)+α[Rt(1)−V (st)]
Rt(1) = rt + γV (st+1)
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 22 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Gradient-Descent Temporal Difference Learning
Tabular TD(λ) does not address the issue of generalization
Value function Vt represented as a parameterized functionalform with modifiable parameter vector wt
Weights update rule for gradient-descent TD(λ):
wt+1 = wt + α[Rt(λ) − Vt(st)]∇wt Vt(st)
TDL applied to Othello/Go:
Position evaluation function f (bt) used to compute Pt
Learning through modifying WPC weights vectorTraining data obtained by self-play
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 23 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 24 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Coevolutionary Temporal Difference Learning
Coevolutionary Temporal Difference Learning
A hybrid of coevolutionary search with reinforcement learning thatworks by interlacing one-population competitive coevolution withtemporal difference learning.
Population of players is subject to alternating learning phases:
TDL phase – each population member plays k games with itselfCEL phase – a single round-robin tournament betweenpopulation members (and, optionally, also archival individuals)
Each TDL-CEL cycle is followed by standard stages of fitnessassignment, selection and recombination
CEL performs exploration of the solution space, while TDL isresponsible for its exploitation by means of local search
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 25 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 26 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 27 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Performance vs. Random Othello Player
0.5
0.6
0.7
0.8
0.9
0 10 20 30 40
pro
babili
ty o
f w
innin
g
games played (x 100 000)
CTDL + HoF
CTDL
TDL
CEL + HoF
CEL
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 28 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Performance vs. Heuristic Player
0
0.1
0.2
0.3
0.4
0.5
0 10 20 30 40
pro
babili
ty o
f w
innin
g
games played (x 100 000)
CTDL + HoF
CTDL
TDL
CEL + HoF
CEL
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 29 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Relative Performance Progress Over Time
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
0 10 20 30 40
poin
ts in tourn
am
ents
games played (x 100 000)
CTDL + HoF
CTDL
TDL
CEL + HoF
CEL
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 30 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Experiment with Negative Learning Rate
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 500 1000 1500 2000 2500 3000 3500 4000 4500
pro
babili
ty o
f w
innin
g
games played (x 1000)
TDL + CEL + HoF
TDL + CEL + HoF, negative learning rate
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 31 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Best Evolved Othello WPC
1.02 -0.27 0.55 -0.10 0.08 0.47 -0.38 1.00-0.13 -0.52 -0.18 -0.07 -0.18 -0.29 -0.68 -0.44 0.55 -0.24 0.02 -0.01 -0.01 0.10 -0.13 0.77-0.10 -0.10 0.01 -0.01 0.00 -0.01 -0.09 -0.05 0.05 -0.17 0.02 -0.04 -0.03 0.03 -0.09 -0.05 0.56 -0.25 0.05 0.02 -0.02 0.17 -0.35 0.42-0.25 -0.71 -0.24 -0.23 -0.08 -0.29 -0.63 -0.24 0.93 -0.44 0.55 0.22 -0.15 0.74 -0.57 0.97
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 32 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 33 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
TD(λ) Performance vs. Go Heuristic Player
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 4 8 12 16 20
pro
babili
ty o
f w
innin
g
games played (x 100,000)
λ = 0.0
λ = 0.4
λ = 0.8
λ = 0.9
λ = 0.95
λ = 1.0
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 34 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Performance vs. Go Heuristic Player
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 4 8 12 16 20
pro
babili
ty o
f w
innin
g
games played (x 100,000)
CTDL + HoF
CTDL
TDL
CEL + HoF
CEL
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 35 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Performance vs. Average Liberty Player
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 4 8 12 16 20
pro
babili
ty o
f w
innin
g
games played (x 100,000)
CTDL + HoF
CTDL
TDL
CEL + HoF
CEL
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 36 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Relative Performance Progress Over Time
4000
5000
6000
7000
8000
9000
10000
11000
12000
0 4 8 12 16 20
poin
ts in tourn
am
ents
games played (x 100,000)
CTDL + HoF
CTDL
TDL
CEL + HoF
CEL
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 37 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Outline
1 IntroductionInspirationMotivation and Objectives
2 MethodsCoevolutionReinforcement LearningCoevolutionary Reinforcement Learning
3 Experimental ResultsCTDL for OthelloCTDL(λ) for Small-Board Go
4 Summary and Conclusions
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 38 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Summary
CTDL benefits from mutually complementary characteristicsof both constituent methods.
Retains unsupervised character – useful when the knowledgeof the problem domain is unavailable or expensive to obtain.
Interesting biological interpretation of CTDL as an analogy toLamarckian Evolution Theory.
There is a need of further investigation of CTDL in thecontext of other challenging problems.
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 39 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Future Work
Employing a more complex learner architecture than a WPC
Using CTDL with two-population coevolution, with solutionsand tests bred separately (learner-teacher paradigm)
Including more advanced archive methods like LAPCA or IPCA
Changing TDL phase character to influence only evaluationprocess (it would model the Baldwin Effect)
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 40 / 41 M. Szubert, W.Jaskowski, K.Krawiec
Introduction Methods Experimental Results Summary and Conclusions
Thank You
Coevolutionary Reinforcement Learning for Othello and Small-Board Go 41 / 41 M. Szubert, W.Jaskowski, K.Krawiec