dynamic programming and reinforcement learning applied to tetris game

Dynamic Programming and Reinforcement Learning applied to Tetris game

Suelen Goularte Carvalho

Inteligência Artificial 2015

Tetris

Tetris✓ Board 20 x 10 ✓ 7 types of tetronimos

(pieces)

✓ Move to down, left or right

✓ Rotation pieces

Tetris One-Piece Controller

Player knows: ✓ board ✓ current piece.

Tetris Two-Piece Controller

Player knows: ✓ board ✓ current piece ✓ next piece

Tetris EvaluationOne-Piece Controller

Two-Piece Controller

How many possibilities do we have just here?

Tetris indeed contains a huge number of board configurations.Finding the strategy that maximizes

the average score is an NP-Complete problem!

— Building Controllers for Tetris, 2009

7.0 × 2 ≃ 5.6 × 10199 59

Comp

lexity

Tetris

Tetris is a problems of sequential decision making under uncertainty.

In the context of dynamic programming and stochastic control, the most

important object is the cost-to-go function, which evaluates the expected

future cost from current state.

— Feature-Based Methods for Large Scale Dynamic Programming

7000

30002500

1000 4000Si

5000

7000

30002500

10004000best immediate

reward

Si

immediate rewardfuture reward

13000

9000

immediate reward

vs.

5000

best future reward

best immediate reward

Immediate reward

Future reward

7.0 × 2 ≃ 5.6 × 10199 59

Essentially impossible to compute, or even store, the value of the cost-to-go function at every

possible state.


Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state.


S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n

For example, if the state i represents the number of customers in a queueing

system, a possible and often interesting feature f is defined by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether

a queue is empty or not.



Feature-bases method

S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n


Features:★ Height of the current wall. ★ Number of holes.

H = {0, ..., 20}, L = {0, ..., 200}.

Feature extraction F : S ~ H x L

10 X 20

Feature-bases method

Using a feature-based evaluation function works better

than just choosing the move that realizes the highest

immediate reward.— Building Controllers for Tetris, 2009

Example of features


...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally,

this function should return high values for the good decisions and

low values for the bad ones.


Reinforcement Learning context, algorithms aim at

tuning the weights such that the evaluation function approximates well the

optimal expected future score from each state.


Reinforcement Learning

Reinforcement Learning by The Big Bang Theory

https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C


Imagine disputar um novo jogo cuja as regras você não conhece, depois

de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é

aprendizagem por reforço.

Supervised Learning

input 1 2 3 4 5 6 7 8 ….

output 1 2 9 16 25 36 49 64 ….

y = f(x) -> function approximation

https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo

Map inputs to output

f(x) = x

labels score

s well

2

Unsupervised Learning

xx

x

xx

x

xxx

x

o

o

oo

oo

o

o

f(x) -> clusters description

oo x

xx

xxx

x

xx

x

oo

oo

oo

o oootype

clusters

scores well


Agent

Environment

ActionReward, State

behaviors sco

res well


✓ Agents take actions in an environment and receive rewards

✓ Goal is to find the policy π that maximizes rewards

✓ Inspired by research into psychology and animal learning

Reinforcement Learning ModelGiven:S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function

5000

7000

30002500

10004000Si immediate rewardfuture reward

13000

9000

Find:π(s) = a policy that maximizes

Needs higher computation, processing and memory.

Dynamic Programming

Dynamic Programming

Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and

storing their solutions.

https://en.wikipedia.org/wiki/Dynamic_programming

A G

caminho ótimo

A Bcaminho ótimo

Gcaminho ótimo

Support Property: Optimal Substructure

Fibonacci Sequence

0 1 1 2 3 5 8 13 21

The sum of two numbers before results in the follow number.

0 1 1 2 3 5 8 13 21

f(n) = f(n-1) + f(n-2)Recursive Formula:

v = 0 1 2 3 4 5 6 7 8 n =

Fibonacci Sequence

Fibonacci0 1 1 2 3 5 8 13 210 1 2 3 4 5 6 7 8

f(6) = f(6-1) + f(6-2)f(6) = f(5) + f(4)f(6) = 5 + 3f(6) = 8

v = n =

Fibonacci Sequence - Normal computation6

5 4

4

3 2

2 1

1 0

2 1

2

2 1

3 3

1 0 1 0 1 0

1 0

f(n) = f(n-1) + f(n-2)

6

5 4

4

3 2

2 1

1 0

2 1

2

2 1

3 3

1 0 1 0 1 0

1 0

Fibonacci Sequence - Normal computation

O(n )2

18 of 25 Nodes Are Repeated Calculations!

Dictionary m m[0]=0, m[1]=1

integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2)

return m[n]

Fibonacci Sequence - Dynamic Programming


5

4 3

3

2 1

1 0

2 index value0 1 2 3 4 5

0 1

5

4 3

3

2 1

1 0


0 1 1

1+0=1


5

4 3

3

2 1

1 0


0 1 1 2

1+0=1

1+1=2


5

4 3

3

2 1

1 0


0 1 1 2 3

1+0=1

1+1=2

2+1=3


5

4 3

3

2 1

1 0

2

O(1) memory O(n) running time

index value0 1 2 3 4 5

0 1 1 2 3 51+0=1

1+1=2

2+1=3

3+2=5


100 games played31Some scores from time…

Tsitsiklis and van Roy (1996)

Bertsekas and Tsitsiklis (1996)3200 100 games played

Kakade (2001) appliedwithout specifying how many game scores are averaged though6800

Farias and van Roy (2006)90 games played.4700


Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week.

One-piece controller 56 games played.

Tuned by hand. 660Mil

7,2Mi

Currents best!

Dellacherie (Fahey, 2003)

Dellacherie (Fahey, 2003)


Experiment…

Experiment


Experienced human Tetris player would take

about 3 minutes to eliminate 30 rows.

20 jogadores. 3 jogadas cada. 3 minutos cada jogada.

Experiment cont.

30

Média obtida: 24 score

Jogador 7 (eu) jogada 1

1000 scores ~ 1 row

Experiment cont.

• Média 24 score a cada 3 minutos.

• Ou seja, 5.760 a cada 12h de jogo contínuo.

• Um ser-humano jogando começa a ficar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo.

Experiment cont.

Conclusão…

Dynamic Programming


Tetris

Otimiza a utilização do poder computacional.

Otimiza peso utilizado nas features.

Utiliza feature-based para maximizar o score.

Dúvidas?

Suelen Goularte CarvalhoInteligência Artificial

2015

dynamic programming and reinforcement learning applied to tetris game

Technology