multi-agent reinforcement learningcq-learning : statistical tests 30 • agents have been learning...
TRANSCRIPT
![Page 1: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/1.jpg)
MULTI-AGENT REINFORCEMENT LEARNING
1
Sparse Interactions
![Page 2: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/2.jpg)
REINFORCEMENT LEARNING• Agent acting in an unknown environment,
learning to maximise a numerical reward signal
a(t)
s(t+1)
r(t+1)
Environment
s(t) s(t+1)a(t)
r(t+1)
s(t+2)
r(t+2)
a(t+1) a(t+2)...
2
![Page 3: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/3.jpg)
MARKOV DECISION PROCESS• SINGLE AGENT!!!
• States = set of states of the agent• Actions = set of actions the agent can take• Transition function• Reward function
M = �S, A, T,R⇥
S
A
T : S �A⇥ S
R : S ⇥A⇥ S ! R
3
![Page 4: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/4.jpg)
Q-LEARNING• model-free, reinforcement learning algorithm• Stores Q-values for every state-action pair• Update rule:
Q(s, a) = Q(s, a) + �
�rt + ⇥argmax
a�Q(s�, a�)�Q(s, a)
⇥
4
![Page 5: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/5.jpg)
SIMPLE EXAMPLE
5
![Page 6: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/6.jpg)
RL WITH BOLTZMANN EXPLORATION
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
0 1000 2000 3000 4000 5000
010
0020
0030
0040
00
episodes
step
s to
goa
l
Agent 1Agent 2
6
![Page 7: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/7.jpg)
RL WITH Ε-GREEDY (Ε = 0.9)
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
0 1000 2000 3000 4000 5000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
7
![Page 8: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/8.jpg)
MULTI-AGENT REINFORCEMENT LEARNING
• Agents influence each other• Possibly conflicting interests
Environment
a1
a2 joint action a(t)
...an
r1(t+1)
s(t+1)
s(t+1)
s(t+1)
r2(t+1)
rn(t+1)
joint state s(t+1)reward r(t+1)
8
• Observations• Expensive communication
![Page 9: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/9.jpg)
MARKOV GAMES
• the number of agents• a finite set of states• with Ak the action set of agent k• the transition function• the reward function of agent k
s(t) s(t+1)
a1(t)a2(t)
...an(t)
r1(t+1)r2(t+1)
...rn(t+1)
...s(t+2)
r1(t+2)r2(t+2)
...rn(t+2)
a1(t+1)a2(t+1)
...an(t+1)
a1(t+2)a2(t+2)
...an(t+2)
9
n
S = s1, . . . , sN
A = A1, . . . , AN
T = S ⇥A1 ⇥ . . .⇥AN ⇥ S ! [0, 1]
Rk = S ⇥A1 ⇥ . . .⇥AN ⇥ S ! R
![Page 10: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/10.jpg)
SPARSE INTERACTIONS1 agentTransitions & rewards are only dependent on 1 agent
2 agentsFar away and not interacting with each other Transitions & rewards are independent of state/action of other agents
2 agentsClose to each other and interacting!!!i.e. transitions & rewards are dependent
G
G2 G1
G2 G1
10
![Page 11: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/11.jpg)
SPARSE INTERACTIONS
2 agentsClose to each other and interacting!!!i.e. transitions & rewards are dependentG2 G1
Assumptions:Agents can do something useful aloneInteractions are sparsef.i. Air traffic control, automated warehouses, ...
10
![Page 12: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/12.jpg)
TAXONOMY BASED ON STRATEGIC INTERACTIONS
Local state Joint state
Independentactions
Joint action(view or selection)
Single agent RL
Nash-Q, CE-Q,...SuperAgent
JAL
MMDP-ILA (Vrancx et al. 2008) MG-ILA (Vrancx et al. 2008)
State and actions must be communicated among agentsState-action space is exponential in the number of agents
11
![Page 13: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/13.jpg)
TAXONOMY BASED ON STRATEGIC INTERACTIONS
Local state Joint state
Independentactions
Joint action(view or selection)
Single agent RL
Nash-Q, CE-Q,...SuperAgent
JAL
MMDP-ILA (Vrancx et al. 2008) MG-ILA (Vrancx et al. 2008)
Utile Coordination (Kok et al. 2005)
Learning of Coordination (Melo et al. 2009)2Observe (De Hauwere et al. 2009)
CQ-Learning (De Hauwere et al. 2010)FCQ-Learning (De Hauwere et al. 2011)
11
![Page 14: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/14.jpg)
INTUITION OF SPARSE INTERACTIONS
When should agents observe the state information of other agents to avoid coordination problems?
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
G G2 G1 G2 G1
12
Is there influence from another agent?Is there influence from another agent?
![Page 15: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/15.jpg)
MODELING INTERACTIONS• Dynamics of the system are a Markov game• Model sparse interactions as a DEC-SIMDP (Melo
et al., 2010)� =
�Mk, (M I,l, SI,l)
�
{MDP for each agent k in the
absence of other agents (containing local states)
G G2 G1
{
Team Markov game for the local interaction between K agents in L
interaction states (containing system states)
G2 G1
13
![Page 16: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/16.jpg)
OUTLINE
Learning of Coordination2Observe
CQ-LearningFCQ-Learning
Transfer learning
14
![Page 17: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/17.jpg)
Learning of Coordination
15
![Page 18: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/18.jpg)
LEARNING OF COORDINATION• Add Pseudo COORDINATE action• External Active Perception• Cost for coordination
16
![Page 19: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/19.jpg)
THE ALGORITHM
17
![Page 20: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/20.jpg)
RESULTS
18
![Page 21: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/21.jpg)
2Observe
19
![Page 22: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/22.jpg)
PROBLEM SETTING
• Learn when to act upon sensory input• Adaptive obstacle avoidance• Save energy
20
![Page 23: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/23.jpg)
INTERACTIONS AS A FUNCTION• State space contains sensor data• Sensor information is only partly relevant• Interaction area is relative to the agent• Special kind of sparse interactions,
modeled as a DEC-LIMDP (Section 4.2)• Ik: Sk → S1 x ... x SM• Approximating this function using a generalized
learning automaton: 2Observe
21
![Page 24: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/24.jpg)
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
SOLUTION METHOD: 2OBSERVE
22
GLA approximating theInteraction function
Single agent Q-learning selecting actions based on local state information
Communication protocol between the agents to avoid a collision in the next timestep
![Page 25: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/25.jpg)
EXPERIMENTAL SETTING
• Reach goal• Avoid collisions
23
![Page 26: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/26.jpg)
EXPERIMENTAL RESULTS (TUNNELTOGOAL)
0 2000 4000 6000 8000 10000
010
2030
4050
episodes
step
s to
goa
l
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
collis
ions
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
coor
dina
tions
2Observe coordinations
24
![Page 27: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/27.jpg)
EXPERIMENTAL RESULTS (TUNNELTOGOAL)
0 2000 4000 6000 8000 10000
010
2030
4050
episodes
step
s to
goa
l
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
collis
ions
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
coor
dina
tions
2Observe coordinations
24
![Page 28: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/28.jpg)
EXPERIMENTAL RESULTS (TUNNELTOGOAL)
0 2000 4000 6000 8000 10000
010
2030
4050
episodes
step
s to
goa
l
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
collis
ions
Independent Q−learningJoint−state learnersMMDP2Observe
0 2000 4000 6000 8000 10000
02
46
8
episodes
coor
dina
tions
2Observe coordinations
24
![Page 29: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/29.jpg)
EXPERIMENTAL RESULTS (2) (TUNNELTOGOAL)
• Interactions are relative to the agent• GLA can approximate this interaction area
25
![Page 30: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/30.jpg)
CQ-Learning
G2 G1
26
![Page 31: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/31.jpg)
PROBLEM SETTING
• Agents only interact where their policies interfere• Locally adapt policy
G G2 G1 G2 G1
27
![Page 32: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/32.jpg)
REPRESENTATION IDEA
Expand
Generalise
32
7 98
5
1
4 6
4-1 4-2 4-3 6-1 6-2
28
![Page 33: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/33.jpg)
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
SOLUTION METHOD: CQ-LEARNING
Statistical test on the rewards
29
Single agent Q-learning selecting actions based on local state information
Q-learning, based on the combination of local state information and the state information of another
agent
![Page 34: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/34.jpg)
CQ-LEARNING : STATISTICAL TESTS
30
• Agents have been learning alone in the environment
• Agent k acts independently using only local state information (sk) in a multi-agent environment
• Perform statistical test against baseline
• Samples its rewards, based on the state information of other agents & performs the same test
sk1 sk2 sk3 sk411.010.09.010.011.0
20.020.020.019.020.0
15.015.014.815.014.9
10.019.09.020.020.0
... ... ... ...
10.0 20.0 15.0 20.0
20.019.020.0
20.020.020.0
10.0
9.010.0
... ... ...
i
sk ⇒ � sk , sl �4 4 3
sk4
sl1 sl
2 sl3
Expand
Expected reward:
30
![Page 35: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/35.jpg)
CQ-LEARNING BASELINE FOR STATISTICAL TESTS
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
31
![Page 36: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/36.jpg)
CQ-LEARNING BASELINE FOR STATISTICAL TESTS
Initial rewards (sliding window)
for a particular state action pair :
Wk1
Compare Wk against Wk1 2
0 200 400 600 800 1000
020
4060
8010
0
episodes
step
s to
goa
l
Agent 1Agent 2
31
Wk2
![Page 37: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/37.jpg)
EXPERIMENTAL RESULTS (1)Env Alg #states #actions #coll #steps
Grid game 2 Indep 9 4 2.7 22.2± 17.9JS 81 4 0.1 4.0± 0.2
(min steps: 3) JSA 81 16 0.0 4.7± 0.1LOC 9.9± 0.5 5 0.1 4.0± 0.4CQ 10± 0.0 4 0.0 3.6± 0.3
CQ NI 10.9± 2.0 4 0.1 4.0± 0.3
Env Alg #states #actions #coll #steps
ISR Indep 43 4 0.4 9.3± 44.8JS 1849 4 0.1 5.7± 1.6
(min steps: 4) JSA 1849 16 0.0 7.6± 1.4LOC 51.3± 82.3 5 0.2 6.7± 7.5CQ 49.0± 2.3 4 0.1 5.1± 0.7
CQ NI 49.9± 7.8 4 0.1 6.0± 1.9
32
![Page 38: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/38.jpg)
EXPERIMENTAL RESULTS (2)• Sample run
G2 G1
33
![Page 39: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/39.jpg)
FCQ-Learning
G2 G1
34
![Page 40: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/40.jpg)
PROBLEM SETTING
• Reflected in immediate reward signal• Too late to solve the problem
1 2 2 1
Reward: +20 Reward: +10
35
![Page 41: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/41.jpg)
DETECTING RELEVANT STATES
• Changes in reward signal are reflected in the Q-values
0 50 100 150 200 250 300 350 400 450 5000
2
4
6
8
10
12
14
16
18
20
Episodes
��������
(12,4)(13,4)(14,4)(15,4)(10,3)(9,2)(8,2)(7,2)(6,2)(1,3)
36
G
![Page 42: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/42.jpg)
FCQ-LEARNING STATISTICAL TESTS
37
sk1 sk2 sk3 sk411.110.911.011.111.0
20.019.919.920.020.0
15.015.014.815.014.9
20.018.817.416.115.9
... ... ... ...11.0 20.0 15.0 20.0
20.019.020.0
20.020.020.0
10.0
9.010.0
... ... ...
sk4
sl1 sl
2 sl3
• Agent k has been learning alone, and its Q-values have converged
• Agent k acts independently using only local state information (sk) in a multi-agent environment
• Performs statistical test against the single agent Q-values
• Samples rewards monte carlo and perform a comparison test to determine what information should be included
sk ⇒ � sk , sl �4 4 3
Expand
Learned Q-value:
![Page 43: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/43.jpg)
EXPERIMENTAL RESULTSEnvironment Algorithm #states #actions #collisions #steps reward
Grid game 2 Indep 9 4 2.4± 0.0 22.7± 30.4 �24.3± 35.6JS 81 4 0.1± 0.0 6.3± 0.3 18.2± 0.6
LOC 9.0± 0.0 5 1.8± 0.0 10.3± 2.7 �6.8± 8.0FCQ 19.4± 4.4 4 0.1± 0.0 8.1± 13.9 17.6± 3.7
FCQ NI 21.7± 3.1 4 0.1± 0.0 7.1± 6.9 17.9± 0.7
Environment Algorithm #states #actions #collisions #steps reward
Bottleneck Indep 43 4 n.a. n.a. n.a.JS 1849 4 0.0± 0.0 23.3± 30.8 13.1± 36.1
LOC 54.0± 0.8 5 1.7± 0.6 167.2± 19, 345.1 �157.5± 10, 327.0FCQ 124.5± 32.8 4 0.1± 0.0 17.3± 1.3 16.6± 0.4
FCQ NI 135.0± 88.7 4 0.2± 0.0 19.2± 5.6 15.4± 2.3
38
![Page 44: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/44.jpg)
EXPERIMENTAL RESULTS
• Order to reach the goal:• Red Agent• Blue Agent• Green Agent
+20+20+20
39
![Page 45: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/45.jpg)
Transfer LearningGeneralized learning
automaton
Single agent Q-learning Coordination through communication
Coordination is not needed
Coordination is needed
Generalized learning automaton
Single agent Q-learning Coordination through communication
Source agent Target agent
2Observe algorithm 2Observe algorithm
Coordination is not needed
Coordination is needed
40
![Page 46: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/46.jpg)
TRANSFER LEARNING
“Transfer of learning occurs when learning in one context enhances (positive transfer) or undermines (negative transfer) a
related performance in another context.”
(D. Perkins, G. Salomon, Transfer of Learning, 1992, International Encyclopedia of Education)
41
![Page 47: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/47.jpg)
MOTIVATIONS FOR TRANSFER LEARNING
• Learning tabula rasa can be extremely slow• Lots of data / time may be needed• Every algorithm has biases: why use an
uninformed bias?• Humans always use past knowledge
• What knowledge is relevant?• How can it be effectively leveraged?
42
![Page 48: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/48.jpg)
TRANSFER LEARNING WITH 2OBSERVE
43
Generalized learning automaton
Single agent Q-learning Coordination through communication
Coordination is not needed
Coordination is needed
Generalized learning automaton
Single agent Q-learning Coordination through communication
Source agent Target agent
2Observe algorithm 2Observe algorithm
Coordination is not needed
Coordination is needed
Can another agent influence me?
Act independently, as if single-agent. Use a multi-agent technique to coordinate.
No Yes
Is there influence from another
Single agent Q-learningselecting actions based on
local state information
Communication protocolbetween the agents to avoid
a collision in the next timestep
GLA approximating theInteraction function
![Page 49: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/49.jpg)
RESULTS
0 2000 4000 6000 8000 10000
05
1015
20
Iterations
# st
eps
to g
oal
Agent 1Agent 2Agent 3
0 2000 4000 6000 8000 10000
05
1015
2025
Iterations
# st
eps
to g
oal
Agent 1Agent 2Agent 3
0 2000 4000 6000 8000 10000
05
1015
2025
Iterations#
step
s to
goa
l
Agent 1Agent 2Agent 3
44
![Page 50: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/50.jpg)
RESULTS (COORDINATION)
0 2000 4000 6000 8000 10000
02
46
810
12
Iterations
# C
ollis
ions
/Coo
rdin
atio
ns
CollisionsCoordinations
0 2000 4000 6000 8000 10000
02
46
810
Iterations
# C
ollis
ions
/Coo
rdin
atio
ns
CollisionsCoordinations
0 2000 4000 6000 8000 10000
02
46
810
Iterations#
Col
lisio
ns/C
oord
inat
ions
CollisionsCoordinations
45
![Page 51: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/51.jpg)
GENERALISATION WITH CQ-LEARNING
Neural network
Δ(x)
Δ(y)
a1
a2
0 | 1
32
7 98
5
1
4 6
4-1 4-2 4-3 6-1 6-2
Expand
Generalise
Generalisation learned with 2Observe
Local state space
![Page 52: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/52.jpg)
GENERALISATION WITH CQ-LEARNING
Neural network
Δ(x)
Δ(y)
a1
a2
0 | 1
32
7 98
5
1
4 6
4-1 4-2 4-3 6-1 6-2
Expand
Generalise
Generalisation learned with CQ-learning
Local state space
![Page 53: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/53.jpg)
GENERALISATION WITH CQ-LEARNING (2)
Safe initialisation Danger initialisation
47
![Page 54: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/54.jpg)
GENERALISATION WITH CQ-LEARNING (2)
EASTWEST
NORTH
SOUTH
48
![Page 55: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/55.jpg)
TRANSFER LEARNING WITH CQ-LEARNING
Augment
32
7 98
5
1
4 6
32
7 98
5
1
4 6
State space Agent k State space Agent l
Generalise
Rule learningsystem (Ripper)
Transfer trained classi er
+ Qaug-table
Source task
CQ-learning
Target task
Trained classi er
Single agent Q-learningQ-learning initialised
with Qaug-table
Coordination is not needed
Coordination is needed
49
![Page 56: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/56.jpg)
TRANSFER LEARNING WITH CQ-LEARNING (2)
50
![Page 57: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/57.jpg)
RESULTS
0 200 400 600 800 1000
050
100
150
200
250
300
episodes
step
s to
goa
l
CQ−learningTransfer learning
0 200 400 600 800 1000
050
100
150
200
250
300
episodesst
eps
to g
oal
CQ−learningTransfer learning
0 200 400 600 800 1000
050
100
150
200
250
300
episodes
step
s to
goa
l
CQ−learningTransfer learning
51
![Page 58: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/58.jpg)
RESULTS (2)
0 200 400 600 800 1000
0.0
0.5
1.0
1.5
2.0
episodesco
llisio
ns
CQ−learningTransfer learning
0 200 400 600 800 1000
0.0
0.5
1.0
1.5
2.0
episodes
collis
ions
CQ−learningTransfer learning
0 200 400 600 800 1000
0.0
0.5
1.0
1.5
2.0
episodes
collis
ions
CQ−learningTransfer learning
52
![Page 59: MULTI-AGENT REINFORCEMENT LEARNINGCQ-LEARNING : STATISTICAL TESTS 30 • Agents have been learning alone in the environment • Agent k acts independently using only local state information](https://reader033.vdocuments.site/reader033/viewer/2022051917/6009dac6e6c8ae725b424b68/html5/thumbnails/59.jpg)
CONCLUSIONS• In multi-agent environments with sparse interactions,
learning these interaction states improves the learning process
• Interaction states can be learned through increased penalties for miscoordination
• GLA can approximate interaction areas relative to the agent• Interaction states can be identified using statistical tests on
the reward signal (immediate + future)• Information about interaction states can be generalized and
transferred between agents and environments
53