term project: reinforcement learning applied to othellogtucker/finalpresentation.pdf · what is...

19
Term Project: Reinforcement learning applied to Othello GEORGE TUCKER Othello GEORGE TUCKER

Upload: others

Post on 20-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Term Project:Reinforcement learning applied to

Othello

G E O R G E T U C K E R

Othello

G E O R G E T U C K E R

Page 2: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Othello

Page 3: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

What is reinforcement learning?

After a sequence of actions get a rewardq gPositive or negative

Temporal credit assignment problemDetermine credit for the reward

Temporal Difference MethodsTemporal Difference MethodsTD-lambda

Q-learning (TD(0))

Page 4: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Contrast to Conventional Strategies

Most methods use an evaluation function

Use minimax/alpha-beta search

Hand-designed feature detectorsgEvaluation function is a weighted sum

So why TD learning?Does not need hand coded features

li iGeneralization

Page 5: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Temporal Difference Learning

Page 6: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Temporal Difference Learning

Page 7: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Key Observation

If we let

at time step t. Then at time step t+1,p p ,

Page 8: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Temporal Difference Learning

If we let λ = 0, then, we get, , g

Widrow-Hoff rule

Makes Yt closer to Yt+1t t+1

Page 9: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Disadvantage

Requires lots of trainingq g

Self-playShort-term pathologies

Randomization

Page 10: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Setup

Board: 64 element vector4+1 = black

0 = empty

hi-1 = white

Corresponds to human representation

NetworkNetwork64 inputs

30 hidden nodes – Sigmoid activation

Single outputGoal: predict final game score

Page 11: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Setup

TD lambda learninggLambda = 0.3

Learning rate = 0.005

R d i d Reward is endgame score

Move selectionEvaluate every legal 1 ply moveEvaluate every legal 1 ply move

Choose randomly with exponential weight

Page 12: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Player Handling

Two Neural Networks

Board inversionOn white’s move, invert board and score

Faster and superior learning

Page 13: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Training Data

Recall:Random play

Fixed opponent

D b lDatabase play

Self-play

I focused on:Database playDatabase play

Self-play

Page 14: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Opponent

Java Othellowww.luthman.nu/Othello/Othello.html

Variable levels corresponding to ply depth

Used as benchmark

Trained against

Page 15: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Database Training

Logistello databaseg120,000 games

Fastless than 30 minutes to train on the full set

Wins 10% games against a 1 ply opponent

0.4

0.5

0.6

0.1

0.2

0.3

0

0 20000 40000 60000 80000 100000 120000

Page 16: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Self-play

Extremely slow improvementy pEven after nearly 2,000,000 iterations almost no improvement

O l i % f i t l tOnly wins 1% of games against 1 ply opponent

0.5

0.6

0.3

0.4

0.5

0

0.1

0.2

0

0 100000 200000 300000 400000 500000

Page 17: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Two Ply Opponent

Opponent looks ahead one ply and chooses the best pp p ymove

Much slower by a factor of 6 or more

0.5

0.6

0.7

0.3

0.4

0.5

0

0.1

0.2

0 20000 40000 60000 80000 100000 120000

Page 18: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Website

For source code and reference materialwww.cs.hmc.edu/~gtucker/othello.html

Page 19: Term Project: Reinforcement learning applied to Othellogtucker/finalPresentation.pdf · What is reinforcement learning? yAfter a seqqguence of actions get a reward {Positive or negativeyTemporal

Conclusions

Board inversion should definitely be usedInitially, at least self-play is poorDatabase play significantly improves networkAsymmetric self-play is far superior to standard self-playPlaying a fixed opponent may be bestPlaying a fixed opponent may be best

Future WorkFuture WorkAdd in additional feature detectorsInvestigate more advanced depth play