reinforcement learning and markov decision processes: a quick introduction hector munoz-avila...
TRANSCRIPT
![Page 1: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/1.jpg)
Reinforcement Learning and Markov Decision Processes: A Quick Introduction
Hector Munoz-Avila
Stephen Lee-Urbanwww.cse.lehigh.edu/~munoz/InSyTe
![Page 2: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/2.jpg)
Outline Introduction
Adaptive Game AI Domination games in Unreal Tournament© Reinforcement Learning
Adaptive Game AI with Reinforcement Learning RETALIATE – architecture and algorithm
Empirical Evaluation Final Remarks – Main Lessons
![Page 3: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/3.jpg)
Introduction
Adaptive Game AI, Unreal Tournament, Reinforcement Learning
![Page 4: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/4.jpg)
Adaptive AI in Games
Without (shipped) Learning With Learning
Non Stochastic Stochastic Offline Online
Symbolic(FOL, etc.)
Scripts HTN Planning
Trained VS Decision Tree
Sub-Symbolic(weights, etc.)
Stored NNs Genetic Alg. RL offline RL online
In this class: Using Reinforcement Learning to accomplish Online Learning of Game AI for Team based First-Person Shooters
In this class: Using Reinforcement Learning to accomplish Online Learning of Game AI for Team based First-Person Shooters
HTNbots: we presented this before
Lee-Urban et al, ICAPS-2007
http://www.youtube.com/watch?v=yO9CcEujJ64
![Page 5: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/5.jpg)
Adaptive Game AI and Learning Learning – Motivation
Combinatorial explosion of possible situations Tactics (e.g., competing team’s tactics) Game worlds (e.g., map where the game is played) Game modes (e.g., domination, capture the flag)
Little time for development Learning – the “Cons”
Difficult to control and predict Game AI Difficult to test
![Page 6: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/6.jpg)
Unreal Tournament© (UT)
Online FPS developed by Epic Games Inc. 1999
Six gameplay modes including team deathmatch and domination games
Gamebots: a client-server architecture for controlling bots started by U.S.C. Information Sciences Institute (ISI)
![Page 7: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/7.jpg)
UT Domination Games
A number of fixed domination locations.
Ownership: the team of last player to step into location
Scoring: a team point awarded for every five seconds location remains controlled
Winning: first team to reach pre-determined score (50)
(top-down view)
![Page 8: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/8.jpg)
Reinforcement Learning
![Page 9: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/9.jpg)
Some Introductory RL Videos http://demo.viidea.com/ijcai09_paduraru_rle/ http://www.youtube.com/watch?v=NR99Hf9Ke2c http://demo.viidea.com/ijcai09_littman_rlrl/
![Page 10: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/10.jpg)
Reinforcement Learning
Agents learn policies through rewards and punishments
Policy - Determines what action to take from a given state (or situation)
Agent’s goal is to maximize returns (example) Tabular Techniques We maintain a “Q-Table”:
Q-table: State × Action value
![Page 11: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/11.jpg)
The DOM Game
Domination Points
Wall
Spawn Points
Lets write on blackboard: a policy for this and a potential Q-table
![Page 12: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/12.jpg)
Example of a Q-TableACTIONS
STA
TE
S
“good” action “bad” action Best action identified so far
For state “EFE” (Enemy controls 2 DOM points)
![Page 13: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/13.jpg)
Reinforcement Learning Problem ACTIONS
STA
TE
S
How can we identify for every state which is the BEST action to take over the long run?
![Page 14: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/14.jpg)
Let Us Model the Problem of Finding the best Build Order for a Zerg Rush as a Reinforcement Learning Problem
![Page 15: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/15.jpg)
Adaptive Game AI with RL
RETALIATE (Reinforced Tactic Learning in Agent-Team Environments)
![Page 16: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/16.jpg)
The RETALIATE Team
Controls two or more UT bots Commands bots to execute actions through the
GameBots API The UT server provides sensory (state and event)
information about the UT world and controls all gameplay
Gamebots acts as middleware between the UT server and the Game AI
UT
GameBots API
RETALIATE
Plug-in Bot Plug-in Bot Plug-in Bot
Opponent Team
Plug-in Bot Plug-in Bot Plug-in Bot
![Page 17: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/17.jpg)
The RETALIATE AlgorithmInit./restore state-
action table & initial state
Begin Game
Observe State
Choose
Random applicable action
Applicable action with max value in state-action table
Execute Action
Calculate reward & update state-action table
Probability Probability 1 –
Game Over?
No Yes
![Page 18: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/18.jpg)
Init./restore state-action table &
initial state
Begin Game
Observe State
Choose
Random applicable action
Applicable action with max value in state-action table
Execute Action
Calculate reward & update state-action table
Probability Probability 1 –
Game Over?
No Yes
Initialization
• Game model: n is the number of domination points (Owner1, Owner2, …, Ownern)
• For all states s and for all actions a • Q[s,a] 0.5
• Actions: m is the number of bots in team (goto1, goto2, …, gotom)
• Team 1• Team 2• …• None
• loc 1• loc 2• …• loc n
![Page 19: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/19.jpg)
Init./restore state-action table &
initial state
Begin Game
Observe State
Choose
Random applicable action
Applicable action with max value in state-action table
Execute Action
Calculate reward & update state-action table
Probability Probability 1 –
Game Over?
No Yes
Rewards and Utilities• U(s) = F( s ) – E( s ),
F(s) is the number of friendly locations E(s) is the number of enemy-controlled locations
• R = U( s’ ) – U( s )
• Standard Q-learning ([Sutton & Barto, 1998]): Q(s, a) ← Q(s, a) + ( R + γ maxa’ Q(s’, a’) – Q(s, a))
![Page 20: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/20.jpg)
Rewards and Utilities• U(s) = F( s ) – E( s ),
F(s) is the number of friendly locations E(s) is the number of enemy-controlled locations
• R = U( s’ ) – U( s )
• Standard Q-learning ([Sutton & Barto, 1998]): Q(s, a) ← Q(s, a) + ( R + γ maxa’ Q(s’, a’) – Q(s, a))
“step-size” parameter was set to 0.2 discount-rate parameter γ was set close to 0.9
Thus, most recent state-reward pairs are considered more important than earlier state-reward pairs
![Page 21: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/21.jpg)
State Information and Actions
x, y, z
Player Scores Team ScoresDomination Loc. Ownership Map TimeLimit Score Limit Max # Teams Max Team SizeNavigation (path nodes…)ReachabilityItems (id, type, location…)Events (hear, incoming…)
SetWalkRunTo
StopJumpStrafe
TurnToRotate Shoot
ChangeWeaponStopShoot
![Page 22: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/22.jpg)
Managing (State x Action) Growth Our Table:
States: ( {E,F,N}, {E,F,N}, {E,F,N} ) = 27 Actions: ( {L1, L2, L3}, …) = 27 27 x 27 = 729 Generally, 3#loc x #loc#bot
Adding health, discretized (high, med, low) States: (…, {h,m,l}) = 27 x 3 = 81 Actions: ( {L1, L2, L3, Health}, … ) = 43 = 64 81 x 64 = 5184 Generally, 3(#loc+1) x (#loc+1)#bot
Number of Locations, size of team frequently varies.
![Page 23: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/23.jpg)
Empirical Evaluation
Opponents, Performance Curves, Videos
![Page 24: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/24.jpg)
The CompetitorsTeam Name Description
HTNBot HTN planning. We discussed this previously
OpportunisticBot Bots go from one domination location to the next. If the location is under the control of the opponent’s team, the bot captures it.
PossesiveBot Each bot is assigned a single domination location that it attempts to capture and hold during the whole game
GreedyBot Attempts to recapture any location that is taken by the opponent
RETALIATE Reinforcement Learning
![Page 25: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/25.jpg)
Summary of Results
Against the opportunistic, possessive, and greedy control strategies, RETALIATE won all 3 games in the tournament. within the first half of the first game, RETALIATE
developed a competitive strategy.
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10
game instances
sco
re
RETALIATE
Opponents
5 runs of 10 games opportunistic
possessive greedy
![Page 26: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/26.jpg)
Summary of Results: HTNBots vs RETALIATE (Round 1)
-10
0
10
20
30
40
50
60
Time
Sco
re
RETALIATE HTNbotsDifference
![Page 27: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/27.jpg)
Summary of Results: HTNBots vs RETALIATE (Round 2)
-10
0
10
20
30
40
50
60
Time
Sco
re
RETALIATE
HTNbots
Difference
![Page 28: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/28.jpg)
Video: Initial Policy
(top-down view)
RETALIATEOpponent
http://www.cse.lehigh.edu/~munoz/projects/RETALIATE/BadStrategy.wmv
![Page 29: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/29.jpg)
Video: Learned Policy
RETALIATEOpponent
http://www.cse.lehigh.edu/~munoz/projects/RETALIATE/GoodStrategy.wmv
![Page 30: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/30.jpg)
Final Remarks
Lessons Learned, Future Work
![Page 31: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/31.jpg)
Final Remarks (1)
From our work with RETALIATE we learned the following lessons, beneficial to any real-world application of RL for these kinds of games: Separate individual bot behavior from team strategies. Model the problem of learning team tactics through a
simple state formulation.
![Page 32: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/32.jpg)
Final Remarks (2)
It is very hard to predict all strategies beforehand As a result, RETALIATE was able to find a weakness and
exploit it to produce a winning strategy that HTNBots could not counter
On the other hand HTNBots produce winning strategies against the other opponents from the beginning while it took RETALIATE half a game in some situations
Tactics emerging from RETALIATE might be difficult to predict, a game developer will have a hard time maintaining the Game AI
![Page 33: Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban munoz/InSyTe](https://reader030.vdocuments.site/reader030/viewer/2022032313/56649e685503460f94b63b37/html5/thumbnails/33.jpg)
Thank you!
Questions?