multiagent (deep) reinforcement learning -...

Post on 12-Mar-2019

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Multiagent (Deep) Reinforcement LearningMARTIN PILรT (MARTIN.PILAT@MFF.CUNI.CZ)

Reinforcement learningThe agent needs to learn to perform tasks in environment

No prior knowledge about the effects of tasks

Maximized its utility

Mountain Car problem โ†’โ—ฆ Typical RL toy problem

โ—ฆ Agent (car) has three actions โ€“ left, right, none

โ—ฆ Goal โ€“ get up the mountain (yellow flag)

โ—ฆ Weak engine โ€“ cannot just go to the right, needs to gain speed by going downhill first

Reinforcement learningFormally defined using a Markov Decision Process (MDP) (S, A, R, p)

โ—ฆ ๐‘ ๐‘ก โˆˆ ๐‘† โ€“ state space

โ—ฆ ๐‘Ž๐‘ก โˆˆ ๐ด โ€“ action space

โ—ฆ ๐‘Ÿ๐‘ก โˆˆ ๐‘… โ€“ reward space

โ—ฆ ๐‘(๐‘ โ€ฒ, ๐‘Ÿ|๐‘ , ๐‘Ž) โ€“ probability that performing action ๐‘Ž in state ๐‘  leads to state ๐‘ โ€ฒ and gives reward ๐‘Ÿ

Agentโ€™s goal: maximize discounted returns ๐บ๐‘ก = ๐‘…๐‘ก+1 + ๐›พ๐‘…๐‘ก+2 + ๐›พ2๐‘…๐‘ก+3โ€ฆ = ๐‘…๐‘ก+1 + ๐›พ๐บ๐‘ก+1

Agent learns its policy: ๐œ‹(๐ด๐‘ก = ๐‘Ž|๐‘†๐‘ก = ๐‘ )โ—ฆ Gives a probability to use action ๐‘Ž in state ๐‘ 

State value function: ๐‘‰๐œ‹ ๐‘  = ๐ธ๐œ‹ ๐บ๐‘ก ๐‘†๐‘ก = ๐‘ 

Action value function: ๐‘„๐œ‹ ๐‘ , ๐‘Ž = ๐ธ๐œ‹ ๐บ๐‘ก ๐‘†๐‘ก = ๐‘ , ๐ด๐‘ก = ๐‘Ž]

Q-LearningLearns the ๐‘„ function directly using the Bellmanโ€™s equations

๐‘„ ๐‘ ๐‘ก , ๐‘Ž๐‘ก โ† 1 โˆ’ ๐›ผ ๐‘„ ๐‘ ๐‘ก , ๐‘Ž๐‘ก + ๐›ผ(๐‘Ÿ๐‘ก + ๐›พmax๐‘Ž๐‘„ ๐‘ ๐‘ก+1, ๐‘Ž )

During learning โ€“ sampling policy is used (e.g. the ๐œ–-greedy policy โ€“ use a random action with probability ๐œ–, otherwise choose the best action)

Traditionally, ๐‘„ is represented as a (sparse) matrix

Problemsโ—ฆ In many problems, state space (or action space) is continuous โ†’must perform some kind of

discretization

โ—ฆ Can be unstable

Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8, no. 3-4 (1992): 279-292.

Deep Q-Learning๐‘„ function represented as a deep neural network

Experience replayโ—ฆ stores previous experience (state, action, new state,

reward) in a replay buffer โ€“ used for training

Target networkโ—ฆ Separate network that is rarely updated

Optimizes loss function

๐ฟ ๐œƒ = ๐ธ ๐‘Ÿ + ๐›พmax๐‘Žโ€ฒ๐‘„ ๐‘ , ๐‘Ž; ๐œƒ๐‘–

โˆ’ โˆ’ ๐‘„ ๐‘ , ๐‘Ž; ๐œƒ2

โ—ฆ ๐œƒ, ๐œƒโˆ’ - parameters of the network and target network

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. โ€œHuman-Level Control through Deep Reinforcement Learning.โ€ Nature 518, no. 7540 (February 2015): 529โ€“33. https://doi.org/10.1038/nature14236.

Deep Q-LearningSuccessfully used to play single player Atari games

Complex input states โ€“ video of the game

Action space quite simple โ€“ discrete

Rewards โ€“ changes in game score

Better than human-level performanceโ—ฆ Human-level measured against โ€œexpertโ€ who

played the game for around 20 episodes of max. 5 minutes after 2 hours of practice for each game.

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. โ€œHuman-Level Control through Deep Reinforcement Learning.โ€ Nature 518, no. 7540 (February 2015): 529โ€“33. https://doi.org/10.1038/nature14236.

Actor-Critic MethodsThe actor (policy) is trained using a gradient that depends on a critic (estimate of value function)

Critic is a value functionโ—ฆ After each action checks if things have gone better or worse than expected

โ—ฆ Evaluation is the error ๐›ฟ๐‘ก = ๐‘Ÿ๐‘ก+1 + ๐›พ๐‘‰ ๐‘ ๐‘ก+1 โˆ’ ๐‘‰(๐‘ ๐‘ก)

โ—ฆ Is used to evaluate the action selected by actorโ—ฆ If ๐›ฟ is positive (outcome was better than expected) โ€“ probability of selecting ๐‘Ž๐‘ก should be strengthened (otherwise lowered)

Both actor and critic can be approximated using NNโ—ฆ Policy (๐œ‹(๐‘ , ๐‘Ž)) update - ฮ”๐œƒ = ๐›ผ๐›ป๐œƒ(log๐œ‹๐œƒ(๐‘ , ๐‘Ž))๐‘ž(๐‘ , ๐‘Ž)

โ—ฆ Value (๐‘ž(๐‘ , ๐‘Ž)) update - ฮ”๐‘ค = ๐›ฝ ๐‘… ๐‘ , ๐‘Ž + ๐›พ๐‘ž ๐‘ ๐‘ก+1, ๐‘Ž๐‘ก+1 โˆ’ ๐‘ž ๐‘ ๐‘ก , ๐‘Ž๐‘ก ๐›ปwq(st, at)

Works in continuous action spaces

Multiagent LearningLearning in multi-agent environments more complex โ€“ need to coordinate with other agents

Example โ€“ level-based foraging (โ†’)โ—ฆ Goal is to collect all items as fast as possible

โ—ฆ Can collect item, if sum of agent levels is greater than item level

Goals of LearningMinmax profile

โ—ฆ For zero-sum games โ€“ (๐œ‹๐‘– , ๐œ‹๐‘—) is minimax profile if ๐‘ˆ๐‘– ๐œ‹๐‘– , ๐œ‹๐‘— = โˆ’๐‘ˆ๐‘—(๐œ‹๐‘– , ๐œ‹๐‘—)

โ—ฆ Guaranteed utility against worst-case opponent

Nash equilibriumโ—ฆ Profile ๐œ‹1, โ€ฆ , ๐œ‹๐‘› is Nash equilibrium if โˆ€๐‘–โˆ€๐œ‹๐‘–

โ€ฒ: ๐‘ˆ๐‘– ๐œ‹๐‘–โ€ฒ, ๐œ‹โˆ’๐‘– โ‰ค ๐‘ˆ๐‘–(๐œ‹)

โ—ฆ No agent can improve utility unilaterally deviating from profile (every agent plays best-response to other agents)

Correlated equilibriumโ—ฆ Agents observe signal ๐‘ฅ๐‘– with joint distribution ๐œ‰ ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘› (e.g. recommended action)

โ—ฆ Profile ๐œ‹1, โ€ฆ , ๐œ‹๐‘› is correlated equilibrium if no agent can improve its expected utility by deviating from recommended actions

โ—ฆ NE is special type of CE โ€“ no correlation

Goals of LearningPareto optimum

โ—ฆ Profile ๐œ‹1, โ€ฆ , ๐œ‹๐‘› is Pareto-optimal if there is not other profile ๐œ‹โ€ฒ such that โˆ€๐‘–: ๐‘ˆ๐‘– ๐œ‹โ€ฒ โ‰ฅ ๐‘ˆ_๐‘–(๐œ‹) and

โˆƒ๐‘–: ๐‘ˆ๐‘– ๐œ‹โ€ฒ > ๐‘ˆ๐‘–(๐œ‹)

โ—ฆ Cannot improve one agent without making other agent worse

Social Welfare & Fairnessโ—ฆ Welfare of profile is sum of utilities of agents, fairness is product of utilities

โ—ฆ Profile is welfare or fairness optimal if it has the maximum possible welfare/fairness

No-Regretโ—ฆ Given history Ht = ๐‘Ž0, โ€ฆ , ๐‘Ž๐‘กโˆ’1 agent ๐‘–โ€™s regret for not having taken action ๐‘Ž๐‘– is

๐‘…๐‘– ๐‘Ž๐‘– =

๐‘ก

๐‘ข๐‘– ๐‘Ž๐‘–,๐‘Žโˆ’๐‘–๐‘ก โˆ’ ๐‘ข๐‘–( ๐‘Ž๐‘–

๐‘ก, ๐‘Žโˆ’๐‘–๐‘ก )

โ—ฆ Policy ๐œ‹๐‘– achieves no-regret if โˆ€๐‘Ž๐‘–: lim๐‘กโ†’โˆž

1

๐‘ก๐‘…๐‘– ๐‘Ž๐‘– ๐ป

๐‘ก โ‰ค 0.

Joint Action LearningLearns ๐‘„-values for joint actions ๐‘Ž โˆˆ ๐ด

โ—ฆ joint action of all agents ๐‘Ž = (๐‘Ž1, โ€ฆ , ๐‘Ž๐‘›), where ๐‘Ž๐‘– is the action of agent ๐‘–

๐‘„๐‘ก+1 ๐‘Ž๐‘ก, ๐‘ ๐‘ก = 1 โˆ’ ๐›ผ ๐‘„๐‘ก ๐‘Ž๐‘ก, ๐‘ ๐‘ก + ๐›ผ๐‘ข๐‘–

๐‘ก

โ—ฆ ๐‘ข๐‘–๐‘ก - utility received after joint action ๐‘Ž๐‘ก

Uses opponent model to compute expected utilities of actionโ—ฆ ๐ธ ๐‘Ž๐‘– = ๐‘Žโˆ’๐‘– ๐‘ƒ ๐‘Žโˆ’๐‘– ๐‘„

๐‘ก+1( ๐‘Ž๐‘– , ๐‘Žโˆ’๐‘– , ๐‘ ๐‘ก+1) โ€“ joint action learning

โ—ฆ ๐ธ ๐‘Ž๐‘– = ๐‘Žโˆ’๐‘– ๐‘ƒ ๐‘Žโˆ’๐‘–|๐‘Ž๐‘– ๐‘„๐‘ก+1 (๐‘Ž๐‘– , ๐‘Žโˆ’๐‘– , ๐‘ ๐‘ก+1) โ€“ conditional joint action learning

Opponent models predicted from history as relative frequencies of action played (conditional frequencies in CJAL)

๐œ– โ€“ greedy sampling

Policy Hill ClimbingLearn policy ๐œ‹๐‘– directly

Hill-climbing in policy space

โ—ฆ ๐œ‹๐‘–๐‘ก+1 = ๐œ‹๐‘–

๐‘ก ๐‘ ๐‘–๐‘ก , ๐‘Ž๐‘–๐‘ก + ๐›ฟ if ๐‘Ž๐‘–

๐‘ก is the best action according to ๐‘„(๐‘ ๐‘ก , ๐‘Ž๐‘–๐‘ก)

โ—ฆ ๐œ‹๐‘–๐‘ก+1 = ๐œ‹๐‘–

๐‘ก ๐‘ ๐‘–๐‘ก , ๐‘Ž๐‘–๐‘ก โˆ’

1

๐ด๐‘– โˆ’1otherwise

Parameter ๐›ฟ is adaptive โ€“ larger if winning and lower if losing

Counterfactual Multi-agent Policy GradientsCentralized training and de-centralized execution (more information available in training)

Critic conditions on the current observed state and the actions of all agents

Actors condition on their observed state

Credit assignment โ€“ based on difference rewardsโ—ฆ Reward of agent ๐‘– ~ the difference between the reward received by the system if joint action ๐‘Ž was

used, and reward received if agent ๐‘– would have used a default actionโ—ฆ Requires assignment of default actions to agents

โ—ฆ COMA โ€“ marginalize over all possible actions of agent ๐‘–

Used to train micro-management of units in StarCraft

Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. โ€œCounterfactual Multi-Agent Policy Gradients.โ€ ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

Counterfactual Multi-agent Policy Gradients

Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. โ€œCounterfactual Multi-Agent Policy Gradients.โ€ ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

Ad hoc TeamworkTypically whole team of agents provided by single organization/team.

โ—ฆ There is some pre-coordination (communication, coordination, โ€ฆ)

Ad hoc teamworkโ—ฆ Team of agents provided by different organization need to cooperate

โ—ฆ RoboCup Drop-In Competition โ€“ mixed players from different teams

โ—ฆ Many algorithms not suitable for ad hoc teamwork โ—ฆ Need many iterations of game โ€“ typically limited amount of time

โ—ฆ Designed for self-play (all agents use the same strategy) โ€“ no control over other agents in ad hoc teamwork

Ad hoc TeamworkType-based methods

โ—ฆ Assume different types of agents

โ—ฆ Based on interaction history โ€“ compute belief over types of other agents

โ—ฆ Play own actions based on beliefs

โ—ฆ Can also add parameters to types

Other problems in MALAnalysis of emergent behaviors

โ—ฆ Typically no new learning algorithms, but single-agent learning algorithms evaluated in multi-agent environment

โ—ฆ Emergent language โ—ฆ Learn agents to use some language

โ—ฆ E.g. signaling game โ€“ two agents are show two images, one of them (sender) is told the target and can send a message (from fixedvocabulary) to the receiver; both agents receive a positive reward if the receiver identifies the correct image

Learning communicationโ—ฆ Agent can typically exchange vectors of numbers for communication

โ—ฆ Maximization of shared utility by means of communication in partially observable environment

Learning cooperation

Agent modelling agents

References and Further Readingโ—ฆ Foerster, Jakob, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. โ€œCounterfactual Multi-Agent Policy Gradients.โ€

ArXiv:1705.08926 [Cs], May 24, 2017. http://arxiv.org/abs/1705.08926.

โ—ฆ Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. โ€œHuman-Level Control through Deep Reinforcement Learning.โ€ Nature 518, no. 7540 (February 2015): 529โ€“33. https://doi.org/10.1038/nature14236.

โ—ฆ Albrecht, Stefano, and Peter Stone. โ€œMultiagent Learning - Foundations and Recent Trends.โ€ http://www.cs.utexas.edu/~larg/ijcai17_tutorial/

โ—ฆ Nice presentation about general multi-agent learning (slides available)

โ—ฆ Open AI Gym. https://gym.openai.com/

โ—ฆ Environments for reinforcement learning

โ—ฆ Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. โ€œContinuous Control with Deep Reinforcement Learning.โ€ ArXiv:1509.02971 [Cs, Stat], September 9, 2015. http://arxiv.org/abs/1509.02971.

โ—ฆ Actor-Critic method for reinforcement learning with continuous actions

โ—ฆ Hernandez-Leal, Pablo, Bilal Kartal, and Matthew E. Taylor. โ€œIs Multiagent Deep Reinforcement Learning the Answer or the Question? A Brief Survey.โ€ ArXiv:1810.05587 [Cs], October 12, 2018. http://arxiv.org/abs/1810.05587.

โ—ฆ A survey on multiagent deep reinforcement learning

โ—ฆ Lazaridou, Angeliki, Alexander Peysakhovich, and Marco Baroni. โ€œMulti-Agent Cooperation and the Emergence of (Natural) Language.โ€ ArXiv:1612.07182 [Cs], December 21, 2016. http://arxiv.org/abs/1612.07182.

โ—ฆ Emergence of language in multiagent communication

top related