monte-carlo go reinforcement learning experimentsgames/go/seminar/notes/...reinforcement learning...
Post on 30-Dec-2020
13 Views
Preview:
TRANSCRIPT
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Monte-Carlo Go Reinforcement LearningExperiments
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006
Presented by Markus Enzenberger.Go Seminar, University of Alberta.
November 16, 2006
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
BackgroundIndigo project
Monte-Carlo GoBasic Monte-Carlo GoMonte-Carlo Go with specific knowledge
Reinforcement Learning and Monte-Carlo GoThe randomness dimension
ExperimentsExperiment 1: On program vs. populationExperiment 2: relative difference at MC level
Conclusion
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Indigo project
Indigo project
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Basic Monte-Carlo Go
Basic Monte-Carlo Go
I General MC model (Abramson)
I all-moves-as-first heuristic (Brugmann)
I Standard deviation, random games, 9×9 ≈ 35
I Precision: 1 point
I 1000 games ≡ 68 % confidence
I 4000 games ≡ 95 % confidence
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Basic Monte-Carlo Go
Progressive pruning
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Basic Monte-Carlo Go
Advantages of MC
I Bouzy, Helmstetter: MC programs on par with Indigo2002
I MC takes advantage of faster computers
I Heavily knowledge based programs less robust
I Global view (avoids errors due to decomposition)
I Easy to develop
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Monte-Carlo Go with specific knowledge
Indigo2003
Two ways of associating Go knowledge with MC
I Pre-select moves for MC
I Insert knowledge in random games(“pseudo-random” := non-uniform probability)
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Monte-Carlo Go with specific knowledge
Two Modules of Indigo2003
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Monte-Carlo Go with specific knowledge
Pseudo-Random (PR) player
I Capture-escape urgency: fill last liberty of one-liberty stringurgency linear in string size
I Cut-connect urgency: 3×3 patterns(smaller: insufficient; larger: lower urgency)
I Total urgency is sum of both urgencies
I Probability of move proportional to total urgency
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Monte-Carlo Go with specific knowledge
MC (Manual) vs MC(Zero)
I Manual: urgencies assigned to 3×3 patterns assigned byhuman expert
I correspond to Go concepts such as cut and connect
I contains non-zero and high values only very sparsely
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Reinforcement Learning and Monte-Carlo Go
I Q-learning for learning action values
I p1 better than p2 (at random level)does not necessarily mean thatMC (p1) better than MC (p2)(atMClevel)
I Randomization vs. determinism
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
The randomness dimension
The randomness dimension
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Experiment 1: On program vs. population
Experiment 1: One program vs. population
I Patterns have action values Q ∈]− 1,+1[I Urgency:
U = (1 + Q
1− Q)k
I Value updates after playing a block of b games:Qplay = Qplay + λbQlearn
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Experiment 1: On program vs. population
Experiment 1a: one unique learning program
I RLPR � Zero
I RLPR < Manual
I MC(RLPR) � MC(Manual)
I Highly dependent on initial conditions
I Q-values learnt in first block,later only these Q-values are increased
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Experiment 1: On program vs. population
Experiment 1b: a population of learning programs
I Evolutionary computing
I Test against Zero and Manual
I Select and duplicate best programs according to some scheme
Starting population = Zero
I RLPR = Zero + 180
I RLPR = Manual - 50
I MC(RLPR) � MC(Manual)
Starting population = Manual
I RLPR = Zero + 180
I RLPR = Manual + 50
I MC(RLPR) = MC(Manual) - 20
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Experiment 1: On program vs. population
I RLPR programs have a tendency to determinism
I Using EC lowers the problem but does not remove itcompletely
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Experiment 2: relative difference at MC level
Experiment 2: relative difference at MC level
I MC(RLPR) playes agains itself again
I Update rule to avoid determinismfor a pair of patterns a, b:
exp(C (Va − Vb)) = ua/ub
δ = Qa − Qb − C (Va − Vb)
Qa = Qa − αδ
Qb = Qb − αδ
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Experiment 2: relative difference at MC level
9×9
I MC(RLPR) = MC(Manual) + 3
19×19 (learning on 9×9)
I MC(RLPR) = MC(Manual) - 30
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
Background Monte-Carlo Go Reinforcement Learning and Monte-Carlo Go Experiments Conclusion
Conclusion
I Determinism was identified as obstacle
I Evolutionary computing was used to avoid determinism(strong dependence on intial conditions)
I Relative differences were used to avoid determinism
Paper by Bruno Bouzy, Guillaume Chaslot, IEEE 2006 Monte-Carlo Go Reinforcement Learning Experiments
top related