dynamic programming and reinforcement learning applied to tetris game
TRANSCRIPT
![Page 1: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/1.jpg)
Dynamic Programming and Reinforcement Learning applied to Tetris game
Suelen Goularte Carvalho
Inteligência Artificial 2015
![Page 2: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/2.jpg)
Tetris
![Page 3: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/3.jpg)
Tetris✓ Board 20 x 10 ✓ 7 types of tetronimos
(pieces)
✓ Move to down, left or right
✓ Rotation pieces
![Page 4: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/4.jpg)
Tetris One-Piece Controller
Player knows: ✓ board ✓ current piece.
![Page 5: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/5.jpg)
Tetris Two-Piece Controller
Player knows: ✓ board ✓ current piece ✓ next piece
![Page 6: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/6.jpg)
Tetris EvaluationOne-Piece Controller
Two-Piece Controller
![Page 7: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/7.jpg)
How many possibilities do we have just here?
![Page 8: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/8.jpg)
Tetris indeed contains a huge number of board configurations.Finding the strategy that maximizes
the average score is an NP-Complete problem!
— Building Controllers for Tetris, 2009
7.0 × 2 ≃ 5.6 × 10199 59
![Page 9: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/9.jpg)
Comp
lexity
Tetris
![Page 10: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/10.jpg)
Tetris is a problems of sequential decision making under uncertainty.
In the context of dynamic programming and stochastic control, the most
important object is the cost-to-go function, which evaluates the expected
future cost from current state.
— Feature-Based Methods for Large Scale Dynamic Programming
![Page 11: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/11.jpg)
7000
30002500
1000 4000Si
5000
7000
30002500
10004000best immediate
reward
Si
immediate rewardfuture reward
13000
9000
immediate reward
vs.
5000
best future reward
best immediate reward
Immediate reward
Future reward
![Page 12: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/12.jpg)
7.0 × 2 ≃ 5.6 × 10199 59
Essentially impossible to compute, or even store, the value of the cost-to-go function at every
possible state.
— Feature-Based Methods for Large Scale Dynamic Programming
![Page 13: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/13.jpg)
Compact representation alleviate the computational time and space of dynamic programming, which employs an exhaustive look-up table, storing one value per state.
— Feature-Based Methods for Large Scale Dynamic Programming
S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n
![Page 14: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/14.jpg)
For example, if the state i represents the number of customers in a queueing
system, a possible and often interesting feature f is defined by f(0) = 0 and f(i) = 1 if i > 0. Such a feature focuses on whether
a queue is empty or not.
— Feature-Based Methods for Large Scale Dynamic Programming
![Page 15: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/15.jpg)
— Feature-Based Methods for Large Scale Dynamic Programming
Feature-bases method
S {s1, s2, …, sn} V {v1, v2, …, sm}where m < n
![Page 16: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/16.jpg)
— Feature-Based Methods for Large Scale Dynamic Programming
Features:★ Height of the current wall. ★ Number of holes.
H = {0, ..., 20}, L = {0, ..., 200}.
Feature extraction F : S ~ H x L
10 X 20
Feature-bases method
![Page 17: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/17.jpg)
Using a feature-based evaluation function works better
than just choosing the move that realizes the highest
immediate reward.— Building Controllers for Tetris, 2009
![Page 18: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/18.jpg)
Example of features
— Building Controllers for Tetris, 2009
![Page 19: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/19.jpg)
...The problem of building a Tetris controller comes down to building a good evaluation function. Ideally,
this function should return high values for the good decisions and
low values for the bad ones.
— Building Controllers for Tetris, 2009
![Page 20: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/20.jpg)
Reinforcement Learning context, algorithms aim at
tuning the weights such that the evaluation function approximates well the
optimal expected future score from each state.
— Building Controllers for Tetris, 2009
![Page 21: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/21.jpg)
Reinforcement Learning
![Page 22: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/22.jpg)
Reinforcement Learning by The Big Bang Theory
https://www.youtube.com/watch?v=tV7Zp2B_mt8&list=PLAF3D35931B692F5C
![Page 23: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/23.jpg)
Reinforcement Learning
Imagine disputar um novo jogo cuja as regras você não conhece, depois
de aproximadamente uma centena de movimentos, seu oponente anuncia: “Você perdeu!”. Em resumo, isso é
aprendizagem por reforço.
![Page 24: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/24.jpg)
Supervised Learning
input 1 2 3 4 5 6 7 8 ….
output 1 2 9 16 25 36 49 64 ….
y = f(x) -> function approximation
https://www.youtube.com/watch?v=Ki2iHgKxRBo&list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo
Map inputs to output
f(x) = x
labels score
s well
2
![Page 25: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/25.jpg)
Unsupervised Learning
xx
x
xx
x
xxx
x
o
o
oo
oo
o
o
f(x) -> clusters description
oo x
xx
xxx
x
xx
x
oo
oo
oo
o oootype
clusters
scores well
![Page 26: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/26.jpg)
Reinforcement Learning
Agent
Environment
ActionReward, State
behaviors sco
res well
![Page 27: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/27.jpg)
Reinforcement Learning
✓ Agents take actions in an environment and receive rewards
✓ Goal is to find the policy π that maximizes rewards
✓ Inspired by research into psychology and animal learning
![Page 28: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/28.jpg)
Reinforcement Learning ModelGiven:S set of states, A set of actions, T(s, a, s') ~ P(s’ | s, a) transitional model, R reward function
5000
7000
30002500
10004000Si immediate rewardfuture reward
13000
9000
Find:π(s) = a policy that maximizes
![Page 29: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/29.jpg)
Needs higher computation, processing and memory.
![Page 30: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/30.jpg)
Dynamic Programming
![Page 31: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/31.jpg)
Dynamic Programming
Solving problems by breaking it down into simpler subproblems. Solving each subproblems just once, and
storing their solutions.
https://en.wikipedia.org/wiki/Dynamic_programming
![Page 32: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/32.jpg)
A G
caminho ótimo
A Bcaminho ótimo
Gcaminho ótimo
Support Property: Optimal Substructure
![Page 33: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/33.jpg)
Fibonacci Sequence
0 1 1 2 3 5 8 13 21
The sum of two numbers before results in the follow number.
![Page 34: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/34.jpg)
0 1 1 2 3 5 8 13 21
f(n) = f(n-1) + f(n-2)Recursive Formula:
v = 0 1 2 3 4 5 6 7 8 n =
Fibonacci Sequence
![Page 35: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/35.jpg)
Fibonacci0 1 1 2 3 5 8 13 210 1 2 3 4 5 6 7 8
f(6) = f(6-1) + f(6-2)f(6) = f(5) + f(4)f(6) = 5 + 3f(6) = 8
v = n =
![Page 36: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/36.jpg)
Fibonacci Sequence - Normal computation6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
f(n) = f(n-1) + f(n-2)
![Page 37: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/37.jpg)
6
5 4
4
3 2
2 1
1 0
2 1
2
2 1
3 3
1 0 1 0 1 0
1 0
Fibonacci Sequence - Normal computation
O(n )2
![Page 38: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/38.jpg)
18 of 25 Nodes Are Repeated Calculations!
![Page 39: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/39.jpg)
Dictionary m m[0]=0, m[1]=1
integer fib(n) if m[n] == null m[n] = fib(n-1)+ fib(n-2)
return m[n]
Fibonacci Sequence - Dynamic Programming
![Page 40: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/40.jpg)
Fibonacci Sequence - Dynamic Programming
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1
![Page 41: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/41.jpg)
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1 1
1+0=1
Fibonacci Sequence - Dynamic Programming
![Page 42: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/42.jpg)
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1 1 2
1+0=1
1+1=2
Fibonacci Sequence - Dynamic Programming
![Page 43: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/43.jpg)
5
4 3
3
2 1
1 0
2 index value0 1 2 3 4 5
0 1 1 2 3
1+0=1
1+1=2
2+1=3
Fibonacci Sequence - Dynamic Programming
![Page 44: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/44.jpg)
5
4 3
3
2 1
1 0
2
O(1) memory O(n) running time
index value0 1 2 3 4 5
0 1 1 2 3 51+0=1
1+1=2
2+1=3
3+2=5
Fibonacci Sequence - Dynamic Programming
![Page 45: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/45.jpg)
100 games played31Some scores from time…
Tsitsiklis and van Roy (1996)
Bertsekas and Tsitsiklis (1996)3200 100 games played
Kakade (2001) appliedwithout specifying how many game scores are averaged though6800
Farias and van Roy (2006)90 games played.4700
— Building Controllers for Tetris, 2009
![Page 46: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/46.jpg)
Two-piece controller with some original features of which the weights were tuned by hand. Only 1 game was played and this took a week.
One-piece controller 56 games played.
Tuned by hand. 660Mil
7,2Mi
Currents best!
Dellacherie (Fahey, 2003)
Dellacherie (Fahey, 2003)
— Building Controllers for Tetris, 2009
![Page 47: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/47.jpg)
Experiment…
![Page 48: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/48.jpg)
Experiment
— Feature-Based Methods for Large Scale Dynamic Programming
Experienced human Tetris player would take
about 3 minutes to eliminate 30 rows.
![Page 49: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/49.jpg)
20 jogadores. 3 jogadas cada. 3 minutos cada jogada.
Experiment cont.
30
Média obtida: 24 score
![Page 50: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/50.jpg)
Jogador 7 (eu) jogada 1
1000 scores ~ 1 row
Experiment cont.
![Page 51: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/51.jpg)
• Média 24 score a cada 3 minutos.
• Ou seja, 5.760 a cada 12h de jogo contínuo.
• Um ser-humano jogando começa a ficar próximo a performance dos algoritmos, após algumas otimizações, após mais ou menos 8h de jogo contínuo.
Experiment cont.
![Page 52: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/52.jpg)
Conclusão…
![Page 53: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/53.jpg)
Dynamic Programming
Reinforcement Learning
Tetris
Otimiza a utilização do poder computacional.
Otimiza peso utilizado nas features.
Utiliza feature-based para maximizar o score.
![Page 54: Dynamic Programming and Reinforcement Learning applied to Tetris Game](https://reader034.vdocuments.site/reader034/viewer/2022051318/5881e6411a28ab36088b6183/html5/thumbnails/54.jpg)
Dúvidas?
Suelen Goularte CarvalhoInteligência Artificial
2015