implementation of dot & box game agent using reinforcement

UNIVERSITY OF TENNESSEE

Implementation of Dot & Box Game Agent using Reinforcement Learning

Technique

Submitted By: Md.

Badruddoja Majumder (Student ID# 000427702) Gangotree

Chakma (StudentID# 000428264)

1. Introduction:

Machine learning is a great tool to solve any maze or analytical problem with efficiency. It is useful when

usual algorithms are too complex to be developed and it is also useful in such cases where the

environment isn’t stagnant. Adaptability gives it a very important characteristics to solve upcoming

challenges. In our project we have utilized two separate algorithms of machine learning to create a Dot

and Box game.

2. Background of the project:

The project is based on Dots & boxes, which is a simple game usually played with paper and pencil. It was

first published by Édouard Lucas in 1889. Typically the game board consists of nine squares, each with

four dots on each side as shown in figure 1a. However game boards of arbitrary size can be formed. The

game is mostly played by two players but there is no upper limit on the amount of players that can

participate. Between two adjacent dots a player can draw either a vertical or horizontal line if there isn’t

a line between them already. Each time four lines form a small box, a point is rewarded to the player who

draw the last line in that box. When a player has finished a box, he must draw another line. The player

with the most points, when all boxes in the grid have been completed, wins. If the game ends in a tie, the

player who drew the first line loses.

We have implemented the Dot and Box game using two of the reinforcement learning algorithms. One of

them is Monte Carlo Method and the other one is Q learning method.

2.1. Monte Carlo Method:

2.2. Q Learning Method: Q-learning is a reinforcement learning algorithm. It was developed by Chris Watkins

developed in 1989. The main advantage of this algorithm is that it is simple and thus easy to

implement; the actual algorithm consists of only one line of code. According to Q learning

method almost all problems can be organized into several situations called states. For

instance in our each different set of lines make up a different state and there are certain legal

moves which are called actions here. The simplest form of Q-learning stores a value for each

state-action pair in a matrix or table. The algorithm first checks the Q-value of every state it

has the possibility to get to in one step. It then takes the maximum of these future values and

incorporates them into the current Q-value. When feedback is given, the only value that is

updated is the Q-value corresponding to the state-action pair that gave the feedback in

question. The algorithm for updating Q-values is shown below,

𝑄(𝑠𝑡, 𝑎𝑡) ← 𝑄(𝑠𝑡, 𝑎𝑡) + 𝛼 × [𝑟𝑡+1 + 𝛾max𝑎

𝑄 (𝑠𝑡+1, 𝑎𝑡) − 𝑄(𝑠𝑡, 𝑎𝑡)] [1]

Here, st and at corresponds to the state and action at a given time. The fact that only one

value is updated when feedback is given, gives Q-learning an interesting property. It takes a

while for the feedback to propagate backwards through the matrix. The next time a certain

state-action pair is evaluated, the value will be updated with regards to the states it can lead

to directly. The algorithm then finds one best solution or stable solution to which it converges

through each iteration.

3. Project Design:

3.1. Design objective:

Dots & boxes is a traditional board game which is normally played by two people at a time.

Often one may not have anybody to play with, thus requiring a computer player acting

opponent. To make the game enjoyable, the computer player should be appropriately skilled.

Using two reinforcement learning algorithm, called Monte Carlo Method and Q-learning, to

create a computer player, the aim is to analyze the performance and efficiency of this player

when faced against different opponents. The project is designed to give an overview of how

these opponents affect and improve the rate of progress and end result of the reinforcement

learning agent.

3.2. Design Challenges:

One of the most important design challenges was the state space size or the dimension of the

Dot and Box matrix. The states are defined as game configurations, where each line is unique

and can be either drawn or not drawn. In a normal size dots & boxes game, the grid consists

of 16 dots. This means a grid can contain up to 24 lines. The number of states is thus 224 (about

16 million). An action is defined as drawing a line. Thus there are 24 different actions that may

be chosen. Not all lines may be drawn in each state however, as some lines already have been

drawn. The mean number of lines possible to draw in a state is 12. Hence the number of state

action pairs are 224 ∗ 12 (about 200 million). But it is difficult to work with a huge amount of

data. So, we choose the size of the grid to be 9 dots and 4 boxes which reduces the state

action pairs to 212*6 (25 thousand).

Another challenge faced was the exploration phase of the agent. The design lets the agent to

explore all possible states to determine the highest value of the state and thus it provides a

better solution

3.3. Technical Approach:

To solve the problem with reinforcement learning algorithm, we have initialized the states,

agents, actions and also the symmetry by following method.

3.3.1. State: Whole board configuration represents the game’s state. All the lines possible to draw

on the board by connecting any two dots are numbered as 1, 2, 3,……. , 12 (for 3 by 3 board).

Every drawn line is represented as 1 in the corresponding index of a row matrix.

Figure: States

This row matrix is essentially the state for the game. If no lines are drawn yet i.e at the

beginning of the game state, S= [0 0 0 0 0 0 0 0 0 0 0 0]. If all of the lines are already drawn i.e

at the end of the game state, S= [1 1 1 1 1 1 1 1 1 1 1 1]. Total no. of states possible is 212 .

3.3.2. Actions: In a particular game state, the no. of available lines are the action spaces for that

state. If a game is in a state, S=[1 0 1 1 0 0 1 1 1 1 0 0], then the available actions are selecting

line 2,5,6,11 or 12 to draw. At the beginning of the game there are 12 actions possible and at

the end of the game there are no more actions. Average no. of state action pair for this 3 by

3 configuration of the game is 212𝑥6 =24576.

3.3.3. Agents: The game is played by two agents. These agents can be of three types.

i. Random agent

ii. Reinforcement Learning agent

iii. Graphical agent

3.3.4. Symmetry: There are 4 axis of symmetry in the board. Every state has four additional

symmetrical states. If we number all the lines of the board in the above four different ways,

then all of the resulting states will indicate the same states. This is the idea we can leverage

to make the agent learn quickly about the states of the board. Whenever an agent visits a

state it updates all of the symmetrical states as well.

Figure: Symmetry

3.4. Algorithm: The flowchart of working principle in Monte Carlo Method is given below

Figure: Flowchart of Monte Carlo Method

Agent play move

Move=Move+1

Reward>0?

Yes

s

Save Reward

Yes

Yes Move=No. of lines

possible to draw

on the board?

ALl

No End of the episode

Update all states-action value

visited by agent using Monte

Carlo Method

No

Opponent play move

Move=Move+1

Agent played

previous

move?

No

The flowchart of working principle in Q learning method is given below

Figure: Flowchart of Q learning Method

Agent play move

Move=Move+1

Reward>0?

Yes

s

Save Reward

Yes

End of the episode

Update all states-action value

visited by agent using Q learning

method

No

Agent played

previous

move?

No

Opponent play move

Move=Move+1

STEP-1 STEP-2

STEP-3 STEP-4

STEP-5

3.5. A typical Game flow:

4. Experimental Results: Training against self-play: After 20000 training games using self play, agent visited only around

3500 states where it was close to 15000 when played against random opponent and not

leveraging symmetry.

As visited states are very low compared to the overall state-action spaces, the agent does not

know about those states at all. In this case agent’s performance against a strong player is

expected to be low. To verify this, we arranged 10 games where agent played against human

opponent. The result is as follows:

Won: 1

Lose: 7

Draw: 2

Figure: Winning Percentage of agent in every 10 games against human opponent. These games are

played after every training games with random opponent.

Figure 2: State Visited by the agent after every 1000 games.

Figure 3: No. of visited states after every 1000 games played against a copy of itself

implementation of dot & box game agent using reinforcement

Documents