multiagent (deep) reinforcement learning -...
Post on 12-Mar-2019
217 Views
Preview:
TRANSCRIPT
Reinforcement learningThe agent needs to learn to perform tasks in environment
No prior knowledge about the effects of tasks
Maximized its utility
Mountain Car problem โโฆ Typical RL toy problem
โฆ Agent (car) has three actions โ left, right, none
โฆ Goal โ get up the mountain (yellow flag)
โฆ Weak engine โ cannot just go to the right, needs to gain speed by going downhill first
Reinforcement learningFormally defined using a Markov Decision Process (MDP) (S, A, R, p)
โฆ ๐ ๐ก โ ๐ โ state space
โฆ ๐๐ก โ ๐ด โ action space
โฆ ๐๐ก โ ๐ โ reward space
โฆ ๐(๐ โฒ, ๐|๐ , ๐) โ probability that performing action ๐ in state ๐ leads to state ๐ โฒ and gives reward ๐
Agentโs goal: maximize discounted returns ๐บ๐ก = ๐ ๐ก+1 + ๐พ๐ ๐ก+2 + ๐พ2๐ ๐ก+3โฆ = ๐ ๐ก+1 + ๐พ๐บ๐ก+1
Agent learns its policy: ๐(๐ด๐ก = ๐|๐๐ก = ๐ )โฆ Gives a probability to use action ๐ in state ๐
State value function: ๐๐ ๐ = ๐ธ๐ ๐บ๐ก ๐๐ก = ๐
Action value function: ๐๐ ๐ , ๐ = ๐ธ๐ ๐บ๐ก ๐๐ก = ๐ , ๐ด๐ก = ๐]
Q-LearningLearns the ๐ function directly using the Bellmanโs equations
๐ ๐ ๐ก , ๐๐ก โ 1 โ ๐ผ ๐ ๐ ๐ก , ๐๐ก + ๐ผ(๐๐ก + ๐พmax๐๐ ๐ ๐ก+1, ๐ )
During learning โ sampling policy is used (e.g. the ๐-greedy policy โ use a random action with probability ๐, otherwise choose the best action)
Traditionally, ๐ is represented as a (sparse) matrix
Problemsโฆ In many problems, state space (or action space) is continuous โmust perform some kind of
discretization
โฆ Can be unstable
Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3-4 (1992): 279-292.
Deep Q-Learning๐ function represented as a deep neural network
Experience replayโฆ stores previous experience (state, action, new state,
reward) in a replay buffer โ used for training
Target networkโฆ Separate network that is rarely updated
Optimizes loss function
๐ฟ ๐ = ๐ธ ๐ + ๐พmax๐โฒ๐ ๐ , ๐; ๐๐
โ โ ๐ ๐ , ๐; ๐2
โฆ ๐, ๐โ - parameters of the network and target network
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. โHuman-Level Control through Deep Reinforcement Learning.โ Nature 518, no. 7540 (February 2015): 529โ33. https://doi.org/10.1038/nature14236.
Deep Q-LearningSuccessfully used to play single player Atari games
Complex input states โ video of the game
Action space quite simple โ discrete
Rewards โ changes in game score
Better than human-level performanceโฆ Human-level measured against โexpertโ who
played the game for around 20 episodes of max. 5 minutes after 2 hours of practice for each game.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. โHuman-Level Control through Deep Reinforcement Learning.โ Nature 518, no. 7540 (February 2015): 529โ33. https://doi.org/10.1038/nature14236.
Actor-Critic MethodsThe actor (policy) is trained using a gradient that depends on a critic (estimate of value function)
Critic is a value functionโฆ After each action checks if things have gone better or worse than expected
โฆ Evaluation is the error ๐ฟ๐ก = ๐๐ก+1 + ๐พ๐ ๐ ๐ก+1 โ ๐(๐ ๐ก)
โฆ Is used to evaluate the action selected by actorโฆ If ๐ฟ is positive (outcome was better than expected) โ probability of selecting ๐๐ก should be strengthened (otherwise lowered)
Both actor and critic can be approximated using NNโฆ Policy (๐(๐ , ๐)) update - ฮ๐ = ๐ผ๐ป๐(log๐๐(๐ , ๐))๐(๐ , ๐)
โฆ Value (๐(๐ , ๐)) update - ฮ๐ค = ๐ฝ ๐ ๐ , ๐ + ๐พ๐ ๐ ๐ก+1, ๐๐ก+1 โ ๐ ๐ ๐ก , ๐๐ก ๐ปwq(st, at)
Works in continuous action spaces
Multiagent LearningLearning in multi-agent environments more complex โ need to coordinate with other agents
Example โ level-based foraging (โ)โฆ Goal is to collect all items as fast as possible
โฆ Can collect item, if sum of agent levels is greater than item level
Goals of LearningMinmax profile
โฆ For zero-sum games โ (๐๐ , ๐๐) is minimax profile if ๐๐ ๐๐ , ๐๐ = โ๐๐(๐๐ , ๐๐)
โฆ Guaranteed utility against worst-case opponent
Nash equilibriumโฆ Profile ๐1, โฆ , ๐๐ is Nash equilibrium if โ๐โ๐๐
โฒ: ๐๐ ๐๐โฒ, ๐โ๐ โค ๐๐(๐)
โฆ No agent can improve utility unilaterally deviating from profile (every agent plays best-response to other agents)
Correlated equilibriumโฆ Agents observe signal ๐ฅ๐ with joint distribution ๐ ๐ฅ1, โฆ , ๐ฅ๐ (e.g. recommended action)
โฆ Profile ๐1, โฆ , ๐๐ is correlated equilibrium if no agent can improve its expected utility by deviating from recommended actions
โฆ NE is special type of CE โ no correlation
Goals of LearningPareto optimum
โฆ Profile ๐1, โฆ , ๐๐ is Pareto-optimal if there is not other profile ๐โฒ such that โ๐: ๐๐ ๐โฒ โฅ ๐_๐(๐) and
โ๐: ๐๐ ๐โฒ > ๐๐(๐)
โฆ Cannot improve one agent without making other agent worse
Social Welfare & Fairnessโฆ Welfare of profile is sum of utilities of agents, fairness is product of utilities
โฆ Profile is welfare or fairness optimal if it has the maximum possible welfare/fairness
No-Regretโฆ Given history Ht = ๐0, โฆ , ๐๐กโ1 agent ๐โs regret for not having taken action ๐๐ is
๐ ๐ ๐๐ =
๐ก
๐ข๐ ๐๐,๐โ๐๐ก โ ๐ข๐( ๐๐
๐ก, ๐โ๐๐ก )
โฆ Policy ๐๐ achieves no-regret if โ๐๐: lim๐กโโ
1
๐ก๐ ๐ ๐๐ ๐ป
๐ก โค 0.
Joint Action LearningLearns ๐-values for joint actions ๐ โ ๐ด
โฆ joint action of all agents ๐ = (๐1, โฆ , ๐๐), where ๐๐ is the action of agent ๐
๐๐ก+1 ๐๐ก, ๐ ๐ก = 1 โ ๐ผ ๐๐ก ๐๐ก, ๐ ๐ก + ๐ผ๐ข๐
๐ก
โฆ ๐ข๐๐ก - utility received after joint action ๐๐ก
Uses opponent model to compute expected utilities of actionโฆ ๐ธ ๐๐ = ๐โ๐ ๐ ๐โ๐ ๐
๐ก+1( ๐๐ , ๐โ๐ , ๐ ๐ก+1) โ joint action learning
โฆ ๐ธ ๐๐ = ๐โ๐ ๐ ๐โ๐|๐๐ ๐๐ก+1 (๐๐ , ๐โ๐ , ๐ ๐ก+1) โ conditional joint action learning
Opponent models predicted from history as relative frequencies of action played (conditional frequencies in CJAL)
๐ โ greedy sampling
Policy Hill ClimbingLearn policy ๐๐ directly
Hill-climbing in policy space
โฆ ๐๐๐ก+1 = ๐๐
๐ก ๐ ๐๐ก , ๐๐๐ก + ๐ฟ if ๐๐
๐ก is the best action according to ๐(๐ ๐ก , ๐๐๐ก)
โฆ ๐๐๐ก+1 = ๐๐
๐ก ๐ ๐๐ก , ๐๐๐ก โ
1
๐ด๐ โ1otherwise
Parameter ๐ฟ is adaptive โ larger if winning and lower if losing
Counterfactual Multi-agent Policy GradientsCentralized training and de-centralized execution (more information available in training)
Critic conditions on the current observed state and the actions of all agents
Actors condition on their observed state
Credit assignment โ based on difference rewardsโฆ Reward of agent ๐ ~ the difference between the reward received by the system if joint action ๐ was
used, and reward received if agent ๐ would have used a default actionโฆ Requires assignment of default actions to agents
โฆ COMA โ marginalize over all possible actions of agent ๐
Used to train micro-management of units in StarCraft
Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. โCounterfactual Multi-Agent Policy Gradients.โ ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.
Counterfactual Multi-agent Policy Gradients
Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. โCounterfactual Multi-Agent Policy Gradients.โ ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.
Ad hoc TeamworkTypically whole team of agents provided by single organization/team.
โฆ There is some pre-coordination (communication, coordination, โฆ)
Ad hoc teamworkโฆ Team of agents provided by different organization need to cooperate
โฆ RoboCup Drop-In Competition โ mixed players from different teams
โฆ Many algorithms not suitable for ad hoc teamwork โฆ Need many iterations of game โ typically limited amount of time
โฆ Designed for self-play (all agents use the same strategy) โ no control over other agents in ad hoc teamwork
Ad hoc TeamworkType-based methods
โฆ Assume different types of agents
โฆ Based on interaction history โ compute belief over types of other agents
โฆ Play own actions based on beliefs
โฆ Can also add parameters to types
Other problems in MALAnalysis of emergent behaviors
โฆ Typically no new learning algorithms, but single-agent learning algorithms evaluated in multi-agent environment
โฆ Emergent language โฆ Learn agents to use some language
โฆ E.g. signaling game โ two agents are show two images, one of them (sender) is told the target and can send a message (from fixedvocabulary) to the receiver; both agents receive a positive reward if the receiver identifies the correct image
Learning communicationโฆ Agent can typically exchange vectors of numbers for communication
โฆ Maximization of shared utility by means of communication in partially observable environment
Learning cooperation
Agent modelling agents
References and Further Readingโฆ Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. โCounterfactual Multi-Agent Policy Gradients.โ
ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.
โฆ Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. โHuman-Level Control through Deep Reinforcement Learning.โ Nature 518, no. 7540 (February 2015): 529โ33. https://doi.org/10.1038/nature14236.
โฆ Albrecht, Stefano, and Peter Stone. โMultiagent Learning - Foundations and Recent Trends.โ http://www.cs.utexas.edu/~larg/ijcai17_tutorial/
โฆ Nice presentation about general multi-agent learning (slides available)
โฆ Open AI Gym. https://gym.openai.com/
โฆ Environments for reinforcement learning
โฆ Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. โContinuous Control with Deep Reinforcement Learning.โ ArXiv:1509.02971 [Cs, Stat], September 9, 2015. http://arxiv.org/abs/1509.02971.
โฆ Actor-Critic method for reinforcement learning with continuous actions
โฆ Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. โIs Multiagent Deep Reinforcement Learning the Answer or the Question? A Brief Survey.โ ArXiv:1810.05587 [Cs], October 12, 2018. http://arxiv.org/abs/1810.05587.
โฆ A survey on multiagent deep reinforcement learning
โฆ Lazaridou, Angeliki, Alexander Peysakhovich, and Marco Baroni. โMulti-Agent Cooperation and the Emergence of (Natural) Language.โ ArXiv:1612.07182 [Cs], December 21, 2016. http://arxiv.org/abs/1612.07182.
โฆ Emergence of language in multiagent communication
top related