multi-agent reinforcement learningcq-learning : statistical tests 30 • agents have been learning...

MULTI-AGENT REINFORCEMENT LEARNING

1

Sparse Interactions

REINFORCEMENT LEARNING• Agent acting in an unknown environment,

learning to maximise a numerical reward signal

a(t)

s(t+1)

r(t+1)

Environment

s(t) s(t+1)a(t)

r(t+1)

s(t+2)

r(t+2)

a(t+1) a(t+2)...

2

MARKOV DECISION PROCESS• SINGLE AGENT!!!

• States = set of states of the agent• Actions = set of actions the agent can take• Transition function• Reward function

M = �S, A, T,R⇥

S

A

T : S �A⇥ S

R : S ⇥A⇥ S ! R

3

Q-LEARNING• model-free, reinforcement learning algorithm• Stores Q-values for every state-action pair• Update rule:

Q(s, a) = Q(s, a) + �

�rt + ⇥argmax

a�Q(s�, a�)�Q(s, a)

⇥

4

SIMPLE EXAMPLE

5

RL WITH BOLTZMANN EXPLORATION

0 200 400 600 800 1000

020

4060

8010

0

episodes

step

s to

goa

l

Agent 1Agent 2

0 1000 2000 3000 4000 5000

010

0020

0030

0040

00

episodes

step

s to

goa

l

Agent 1Agent 2

6

RL WITH Ε-GREEDY (Ε = 0.9)

0 200 400 600 800 1000

020

4060

8010

0

episodes

step

s to

goa

l

Agent 1Agent 2

0 1000 2000 3000 4000 5000

020

4060

8010

0

episodes

step

s to

goa

l

Agent 1Agent 2

7

MULTI-AGENT REINFORCEMENT LEARNING

• Agents influence each other• Possibly conflicting interests

Environment

a1

a2 joint action a(t)

...an

r1(t+1)

s(t+1)

s(t+1)

s(t+1)

r2(t+1)

rn(t+1)

joint state s(t+1)reward r(t+1)

8

• Observations• Expensive communication

MARKOV GAMES

• the number of agents• a finite set of states• with Ak the action set of agent k• the transition function• the reward function of agent k

s(t) s(t+1)

a1(t)a2(t)

...an(t)

r1(t+1)r2(t+1)

...rn(t+1)

...s(t+2)

r1(t+2)r2(t+2)

...rn(t+2)

a1(t+1)a2(t+1)

...an(t+1)

a1(t+2)a2(t+2)

...an(t+2)

9

n

S = s1, . . . , sN

A = A1, . . . , AN

T = S ⇥A1 ⇥ . . .⇥AN ⇥ S ! [0, 1]

Rk = S ⇥A1 ⇥ . . .⇥AN ⇥ S ! R

SPARSE INTERACTIONS1 agentTransitions & rewards are only dependent on 1 agent

2 agentsFar away and not interacting with each other Transitions & rewards are independent of state/action of other agents

2 agentsClose to each other and interacting!!!i.e. transitions & rewards are dependent

G

G2 G1

G2 G1

10

SPARSE INTERACTIONS

2 agentsClose to each other and interacting!!!i.e. transitions & rewards are dependentG2 G1

Assumptions:Agents can do something useful aloneInteractions are sparsef.i. Air traffic control, automated warehouses, ...

10

TAXONOMY BASED ON STRATEGIC INTERACTIONS

Local state Joint state

Independentactions

Joint action(view or selection)

Single agent RL

Nash-Q, CE-Q,...SuperAgent

JAL

MMDP-ILA (Vrancx et al. 2008) MG-ILA (Vrancx et al. 2008)

State and actions must be communicated among agentsState-action space is exponential in the number of agents

11

TAXONOMY BASED ON STRATEGIC INTERACTIONS

Local state Joint state

Independentactions

Joint action(view or selection)

Single agent RL

Nash-Q, CE-Q,...SuperAgent

JAL

MMDP-ILA (Vrancx et al. 2008) MG-ILA (Vrancx et al. 2008)

Utile Coordination (Kok et al. 2005)

Learning of Coordination (Melo et al. 2009)2Observe (De Hauwere et al. 2009)

CQ-Learning (De Hauwere et al. 2010)FCQ-Learning (De Hauwere et al. 2011)

11

INTUITION OF SPARSE INTERACTIONS

When should agents observe the state information of other agents to avoid coordination problems?

Can another agent influence me?

Act independently, as if single-agent. Use a multi-agent technique to coordinate.

No Yes

G G2 G1 G2 G1

12

Is there influence from another agent?Is there influence from another agent?

MODELING INTERACTIONS• Dynamics of the system are a Markov game• Model sparse interactions as a DEC-SIMDP (Melo

et al., 2010)� =

�Mk, (M I,l, SI,l)

�

{MDP for each agent k in the

absence of other agents (containing local states)

G G2 G1

{

Team Markov game for the local interaction between K agents in L

interaction states (containing system states)

G2 G1

13

OUTLINE

Learning of Coordination2Observe

CQ-LearningFCQ-Learning

Transfer learning

14

Learning of Coordination

15

LEARNING OF COORDINATION• Add Pseudo COORDINATE action• External Active Perception• Cost for coordination

16

THE ALGORITHM

17

RESULTS

18

2Observe

19

PROBLEM SETTING

• Learn when to act upon sensory input• Adaptive obstacle avoidance• Save energy

20

INTERACTIONS AS A FUNCTION• State space contains sensor data• Sensor information is only partly relevant• Interaction area is relative to the agent• Special kind of sparse interactions,

modeled as a DEC-LIMDP (Section 4.2)• Ik: Sk → S1 x ... x SM• Approximating this function using a generalized

learning automaton: 2Observe

21



No Yes

SOLUTION METHOD: 2OBSERVE

22

GLA approximating theInteraction function

Single agent Q-learning selecting actions based on local state information

Communication protocol between the agents to avoid a collision in the next timestep

EXPERIMENTAL SETTING

• Reach goal• Avoid collisions

23

EXPERIMENTAL RESULTS (TUNNELTOGOAL)

0 2000 4000 6000 8000 10000

010

2030

4050

episodes

step

s to

goa

l

Independent Q−learningJoint−state learnersMMDP2Observe

0 2000 4000 6000 8000 10000

02

46

8

episodes

collis

ions

Independent Q−learningJoint−state learnersMMDP2Observe

0 2000 4000 6000 8000 10000

02

46

8

episodes

coor

dina

tions

2Observe coordinations

24

EXPERIMENTAL RESULTS (2) (TUNNELTOGOAL)

• Interactions are relative to the agent• GLA can approximate this interaction area

25

CQ-Learning

G2 G1

26

PROBLEM SETTING

• Agents only interact where their policies interfere• Locally adapt policy

G G2 G1 G2 G1

27

REPRESENTATION IDEA

Expand

Generalise

32

7 98

5

1

4 6

4-1 4-2 4-3 6-1 6-2

28



No Yes

SOLUTION METHOD: CQ-LEARNING

Statistical test on the rewards

29

Single agent Q-learning selecting actions based on local state information

Q-learning, based on the combination of local state information and the state information of another

agent

CQ-LEARNING : STATISTICAL TESTS

30

• Agents have been learning alone in the environment

• Agent k acts independently using only local state information (sk) in a multi-agent environment

• Perform statistical test against baseline

• Samples its rewards, based on the state information of other agents & performs the same test

sk1 sk2 sk3 sk411.010.09.010.011.0

20.020.020.019.020.0

15.015.014.815.014.9

10.019.09.020.020.0

... ... ... ...

10.0 20.0 15.0 20.0

20.019.020.0

20.020.020.0

10.0

9.010.0

... ... ...

i

sk ⇒ � sk , sl �4 4 3

sk4

sl1 sl

2 sl3

Expand

Expected reward:

30

CQ-LEARNING BASELINE FOR STATISTICAL TESTS

0 200 400 600 800 1000

020

4060

8010

0

episodes

step

s to

goa

l

Agent 1Agent 2

31

CQ-LEARNING BASELINE FOR STATISTICAL TESTS

Initial rewards (sliding window)

for a particular state action pair :

Wk1

Compare Wk against Wk1 2

0 200 400 600 800 1000

020

4060

8010

0

episodes

step

s to

goa

l

Agent 1Agent 2

31

Wk2

EXPERIMENTAL RESULTS (1)Env Alg #states #actions #coll #steps

Grid game 2 Indep 9 4 2.7 22.2± 17.9JS 81 4 0.1 4.0± 0.2

(min steps: 3) JSA 81 16 0.0 4.7± 0.1LOC 9.9± 0.5 5 0.1 4.0± 0.4CQ 10± 0.0 4 0.0 3.6± 0.3

CQ NI 10.9± 2.0 4 0.1 4.0± 0.3

Env Alg #states #actions #coll #steps

ISR Indep 43 4 0.4 9.3± 44.8JS 1849 4 0.1 5.7± 1.6

(min steps: 4) JSA 1849 16 0.0 7.6± 1.4LOC 51.3± 82.3 5 0.2 6.7± 7.5CQ 49.0± 2.3 4 0.1 5.1± 0.7

CQ NI 49.9± 7.8 4 0.1 6.0± 1.9

32

EXPERIMENTAL RESULTS (2)• Sample run

G2 G1

33

FCQ-Learning

G2 G1

34

PROBLEM SETTING

• Reflected in immediate reward signal• Too late to solve the problem

1 2 2 1

Reward: +20 Reward: +10

35

DETECTING RELEVANT STATES

• Changes in reward signal are reflected in the Q-values

0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

10

12

14

16

18

20

Episodes

��

(12,4)(13,4)(14,4)(15,4)(10,3)(9,2)(8,2)(7,2)(6,2)(1,3)

36

G

FCQ-LEARNING STATISTICAL TESTS

37

sk1 sk2 sk3 sk411.110.911.011.111.0

20.019.919.920.020.0

15.015.014.815.014.9

20.018.817.416.115.9

... ... ... ...11.0 20.0 15.0 20.0

20.019.020.0

20.020.020.0

10.0

9.010.0

... ... ...

sk4

sl1 sl

2 sl3

• Agent k has been learning alone, and its Q-values have converged

• Agent k acts independently using only local state information (sk) in a multi-agent environment

• Performs statistical test against the single agent Q-values

• Samples rewards monte carlo and perform a comparison test to determine what information should be included

sk ⇒ � sk , sl �4 4 3

Expand

Learned Q-value:

EXPERIMENTAL RESULTSEnvironment Algorithm #states #actions #collisions #steps reward

Grid game 2 Indep 9 4 2.4± 0.0 22.7± 30.4 �24.3± 35.6JS 81 4 0.1± 0.0 6.3± 0.3 18.2± 0.6

LOC 9.0± 0.0 5 1.8± 0.0 10.3± 2.7 �6.8± 8.0FCQ 19.4± 4.4 4 0.1± 0.0 8.1± 13.9 17.6± 3.7

FCQ NI 21.7± 3.1 4 0.1± 0.0 7.1± 6.9 17.9± 0.7

Environment Algorithm #states #actions #collisions #steps reward

Bottleneck Indep 43 4 n.a. n.a. n.a.JS 1849 4 0.0± 0.0 23.3± 30.8 13.1± 36.1

LOC 54.0± 0.8 5 1.7± 0.6 167.2± 19, 345.1 �157.5± 10, 327.0FCQ 124.5± 32.8 4 0.1± 0.0 17.3± 1.3 16.6± 0.4

FCQ NI 135.0± 88.7 4 0.2± 0.0 19.2± 5.6 15.4± 2.3

38

EXPERIMENTAL RESULTS

• Order to reach the goal:• Red Agent• Blue Agent• Green Agent

+20+20+20

39

Transfer LearningGeneralized learning

automaton

Single agent Q-learning Coordination through communication

Coordination is not needed

Coordination is needed

Generalized learning automaton


Source agent Target agent

2Observe algorithm 2Observe algorithm



40

TRANSFER LEARNING

“Transfer of learning occurs when learning in one context enhances (positive transfer) or undermines (negative transfer) a

related performance in another context.”

(D. Perkins, G. Salomon, Transfer of Learning, 1992, International Encyclopedia of Education)

41

MOTIVATIONS FOR TRANSFER LEARNING

• Learning tabula rasa can be extremely slow• Lots of data / time may be needed• Every algorithm has biases: why use an

uninformed bias?• Humans always use past knowledge

• What knowledge is relevant?• How can it be effectively leveraged?

42

TRANSFER LEARNING WITH 2OBSERVE

43







Source agent Target agent

2Observe algorithm 2Observe algorithm





No Yes

Is there influence from another

Single agent Q-learningselecting actions based on

local state information

Communication protocolbetween the agents to avoid

a collision in the next timestep

GLA approximating theInteraction function

RESULTS

0 2000 4000 6000 8000 10000

05

1015

20

Iterations

# st

eps

to g

oal

Agent 1Agent 2Agent 3

0 2000 4000 6000 8000 10000

05

1015

2025

Iterations

# st

eps

to g

oal


0 2000 4000 6000 8000 10000

05

1015

2025

Iterations#

step

s to

goa

l


44

RESULTS (COORDINATION)

0 2000 4000 6000 8000 10000

02

46

810

12

Iterations

# C

ollis

ions

/Coo

rdin

atio

ns

CollisionsCoordinations

0 2000 4000 6000 8000 10000

02

46

810

Iterations

# C

ollis

ions

/Coo

rdin

atio

ns


0 2000 4000 6000 8000 10000

02

46

810

Iterations#

Col

lisio

ns/C

oord

inat

ions


45

GENERALISATION WITH CQ-LEARNING

Neural network

Δ(x)

Δ(y)

a1

a2

0 | 1

32

7 98

5

1

4 6

4-1 4-2 4-3 6-1 6-2

Expand

Generalise

Generalisation learned with 2Observe

Local state space

GENERALISATION WITH CQ-LEARNING

Neural network

Δ(x)

Δ(y)

a1

a2

0 | 1

32

7 98

5

1

4 6

4-1 4-2 4-3 6-1 6-2

Expand

Generalise

Generalisation learned with CQ-learning

Local state space

GENERALISATION WITH CQ-LEARNING (2)

Safe initialisation Danger initialisation

47

GENERALISATION WITH CQ-LEARNING (2)

EASTWEST

NORTH

SOUTH

48

TRANSFER LEARNING WITH CQ-LEARNING

Augment

32

7 98

5

1

4 6

32

7 98

5

1

4 6

State space Agent k State space Agent l

Generalise

Rule learningsystem (Ripper)

Transfer trained classi er

+ Qaug-table

Source task

CQ-learning

Target task

Trained classi er

Single agent Q-learningQ-learning initialised

with Qaug-table



49

TRANSFER LEARNING WITH CQ-LEARNING (2)

50

RESULTS

0 200 400 600 800 1000

050

100

150

200

250

300

episodes

step

s to

goa

l

CQ−learningTransfer learning

0 200 400 600 800 1000

050

100

150

200

250

300

episodesst

eps

to g

oal


0 200 400 600 800 1000

050

100

150

200

250

300

episodes

step

s to

goa

l


51

RESULTS (2)

0 200 400 600 800 1000

0.0

0.5

1.0

1.5

2.0

episodesco

llisio

ns


0 200 400 600 800 1000

0.0

0.5

1.0

1.5

2.0

episodes

collis

ions


0 200 400 600 800 1000

0.0

0.5

1.0

1.5

2.0

episodes

collis

ions


52

CONCLUSIONS• In multi-agent environments with sparse interactions,

learning these interaction states improves the learning process

• Interaction states can be learned through increased penalties for miscoordination

• GLA can approximate interaction areas relative to the agent• Interaction states can be identified using statistical tests on

the reward signal (immediate + future)• Information about interaction states can be generalized and

transferred between agents and environments

53

multi-agent reinforcement learningcq-learning : statistical tests 30 • agents have been learning...

Documents