sic-mmab: synchronisation involves communication · lower bounds centralizedlowerbound x k>m...

SIC-MMAB: Synchronisation involvescommunication

Etienne Boursier Vianney Perchet

MLMDA Seminar, November 2019

Overview

Multiplayer bandits problem

SIC-MMAB

Contradiction with lower bounds

Dynamic setting

Related works

Multiplayer bandits problem

Introduction

Motivation: Cognitive Radio (5G)Optimize spectrum access for Primary and Secondary userswhen Primary user on channel k → priority over Secondary userswhen several Secondary on same channel: interference/collision

Goal for secondary users: find and communicate on best channels

1 / 29

Bandit game at round t ∈ {1, . . . ,T}K arms

Player

X1(t) X2(t) X3(t) X4(t)

µ1 µ2 µ3 µ4

i.i.d. Xk(t) ∼ B(µk) in [0, 1]pull arm π(t) given pastobserve reward Xπ(t)(t)

arms

means

Xk(t) =

{0 if Primary user on k

1 otherwise

2 / 29

Bandit game at round t ∈ {1, . . . ,T}K arms

Player

X1(t) X2(t) X3(t) X4(t)

µ1 µ2 µ3 µ4

Pull arm2

i.i.d. Xk(t) ∼ B(µk) in [0, 1]pull arm π(t) given pastobserve reward Xπ(t)(t)

arms

means

Xk(t) =


1 otherwise

2 / 29

Multiplayer Bandit game at round t ∈ {1, . . . ,T}K arms, M players

Player 1 Player 2 Player 3

X1(t) X2(t) X3(t) X4(t)

µ1 µ2 µ3 µ4

arms

means

Xk(t) =


1 otherwise

2 / 29

Multiplayer Bandit game at round t ∈ {1, . . . ,T}K arms, M players

Player 1 Player 2 Player 3

X1(t) 0 X3(t) X4(t)

µ1 µ2 µ3 µ4

Collision

arms

means

Xk(t) =


1 otherwise

2 / 29

Model: Multiplayer Multi-Armed Bandits

K arms with Bernoulli rewards Xk(t) ∼ B(µk)

w.l.o.g. µ1 ≥ µ2 ≥ . . . ≥ µK

M ≤ K players pull arms πj(t) simultaneously for t = 1, . . . ,TDecentralized: players can not communicate & M is unknownget reward r j(t) = Xπj (t)(t)1no collision on πj (t)

Regret: RT = TM∑k=1

µk − Eµ[ T∑

t=1

M∑j=1

r j(t)

]

3 / 29

Feedback/sensing settings

r j(t) = Xπj (t)(t)1no collision on πj (t)

Collision sensing: observe r j(t) and 1no collision on πj (t)

No sensing: observe only r j(t)

Statistic sensing: observe r j(t) and Xπj (t)(t)

4 / 29

Collision Sensing: SIC-MMAB

Centralized case

Players communicate (for free) → no collisionCombinatorial bandits, tight bound:[Anantharam et al., 1987, Komiyama et al., 2015]

Regret in∑k>M

log(T )

µk − µM

5 / 29

Lower bounds

Centralized lower bound∑k>M

log(T )µM−µk

[Anantharam et al., 1987]

Decentralized lower bound

M∑k>M

log(T )µM−µk

[Liu and Zhao, 2010][Besson and Kaufmann, 2018]

6 / 29

Lower bounds


log(T )µM−µk




��

��HHHH

HHH

M∑k>M

log(T )µM−µk

SIC-M

MAB

6 / 29

Lower bounds


log(T )µM−µk




��

��HHHH

HHH

M∑k>M

log(T )µM−µk

Decentralized ∼ Centralized

SIC-M

MAB

How is this possible?

6 / 29

Main trick

Observation: 1no collision on k ∈ {0, 1} seen as a bit sent between players

force collisions during communication rounds

when i talks to j :

{collide with j to send a 1 bitdo not collide to send a 0

players communicate empirical means to each other→ centralizationsublogarithmic number of communication rounds

7 / 29

Algorithm structure

Algorithm 1: SIC-MMABInitialization Phasefor p = 1, ...,∞ do

Exploration phase ppp for 2p roundsCommunication phase pppAccept/reject (sub)-optimal arms

endExploitation phase: pull optimal arms until T

8 / 29

Initialization phase

Orthogonalize players: Musical Chairs for K log(T ) rounds[Rosenski et al., 2016]

Sample arm k uniformly at randomIf collision → continueNo collision → stick to arm k until K log(T )

With proba 1−M/T , all players end on different arms

Compute M and rank j : Sequential Hoppingplayer on arm k waits for 2k roundsplayer then hops for 2(K − k) roundsM − 1 = number of collisions andj − 1 = number of collisions for the 2k first rounds

9 / 29

Exploration phase p

each player explores each arm 2p roundsstart at different positions given by rankssequential hopping → no collision

player j gathered statistics on arm k

S jk(p) rewards 1

T jk(p) pulls

10 / 29

Communication phase p

player i communicates S ik(p) ∈ [2p] to player j :

encoded in p bits (0, 1, 0, . . . , 0)send it in p rounds: (no coll., coll., no coll., . . ., no coll.)

players communicate one at a timethey know when and how to do so, thanks to their ranks jpossible quantization for non binary rewards

length of comm. phase p: KM2p

11 / 29

Algorithm structure

Algorithm 2: SIC-MMABInitialization Phasefor p = 1, ...,∞ do

Exploration phase pppCommunication phase pppAccept/reject (sub)-optimal arms

endExploitation phase

12 / 29

Accept/Eliminate (sub)-optimal arms

All players have the same centralized empirical means µ̂k

Concentration inequality (Hoeffding)

With high proba, |µk − µ̂k | ≤√

2 log(T )/Tk(p)

→ arm k is detected better than l if:

µ̂k −√

2 log(T )/Tk(p) ≥ µ̂l +√

2 log(T )/Tl(p)

happens after log(T )(µk−µl )2

pulls

13 / 29

Accept/Eliminate (sub)-optimal arms

arm k sub-optimal if M arms are detected better→ eliminated from the set to explorearm k optimal if K −M arms are detected worse→ attributed to player with largest rank

→ exploration ends after N = log(

log(T )(µM−µM+1)2

))phases

14 / 29

Regret bound

Initialization: M × length ' MK log(T )

Communication: M ×∑N

p=1 pM2K ' M3K log2( log(T )

(µM−µM+1)2)

Exploration: centralized regret bound∑

k>Mlog(T )µM−µk

Low probability events: o(log(T ))

Total regret

RT .∑k>M

log(T )

µM − µk+ MK log(T )

15 / 29

Contradiction with lower bounds

Contradict the lower bound?

RecallLower bound M

∑k>M

log(T )µk−µM

SIC-MMAB∑

k>Mlog(T )µk−µM

+ KM log(T )

Why this contradiction?Lower bound proofs assumed that best algorithms do not collideWrong: SIC-MMAB deduces a lot of information from collisionsDecentralized as hard as centralized

16 / 29

Towards a better model?

SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing itneed for a better model, without such a loopholewhich model assumption did go wrong?

collision sensing?

17 / 29

No sensing setting

AssumptionKnown lower bounds µk ≥ µmin > 0

Observation: we can send a bit with high proba. in log(T )/µmin rounds

Algo 1 SIC-MMAB with log(T )/µmin comm. rounds instead of 1

comm. regret becomes M3K log(T )log(T )log(T )µmin

log2(log(T ))log2(log(T ))log2(log(T ))

Algo 2 limited & different communicationdo not communicate statistics but only when an arm isfound (sub)-optimalregret in M

∑k>M

log(T )µk−µM

+ MK2

µminlog(T )

18 / 29


SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing such protocolswhich model assumption did go wrong?

collision sensing?

cooperative players? (work in progress)synchronisation between players?→ more realistic dynamic model

19 / 29



collision sensing?cooperative players? (work in progress)

synchronisation between players?→ more realistic dynamic model

19 / 29



collision sensing?cooperative players? (work in progress)synchronisation between players?→ more realistic dynamic model

19 / 29

Dynamic setting: DYN-MMAB

Dynamic Model

Asynchronicity assumptionPlayer j enters game at unknown time τ j ∈ [T ] and stays until T .

varying & unknown set of playersM(t)

no synchronisation =⇒ similar protocols are not possibleNo Sensing setting

20 / 29

A dynamic algorithm

Only 2 different states:Exploration: sample arm uniformly at randomExploitation: occupy some optimal arm until T

Three difficulties:1. Detect arms occupied by other players2. Estimate the best available arm3. Start occupying the best available arm

21 / 29

Detect occupied arms

If k occupied, rewards only 0If k not occupied, positive reward with proba µk(1− 1

K )Mt−1 ≥ µk

e

For an occupied arm k

if µk tightly estimated: after ' e log(T )µk

successive 0, k is assumedoccupiedotherwise, µ̂k will quickly drop to 0 and k will become sub-optimal

22 / 29

Estimate available arms

Players sample uniformly at random =⇒ E[rk(t)] = µk(1− 1K )Mt−1

Player estimates γtµk where γt = 1t

∑τ j+ts=τ j+1(1− 1

K )Ms

µk ≥ µl ⇐⇒ γtµk ≥ γtµl

concentration inequalities for γtµk (when k still free)γt ≥ 1

e =⇒ estimating γtµk instead of µk takes roughly same time

Player detects best available arm k after time O(

K log(T )(µk−µk+1)2

)

23 / 29

Occupy best available arm

Once arm detected as best available → try to occupy itContinue sampling uniformly at randompositive reward → occupy that armobserve only 0 rewards ?

detect it as occupiedcontinue exploration until next available arm

At some point, succeed in occupying an arm, while all better arms occupied

24 / 29

Regret bound

New regret definition:

T∑t=1

card(M(t))∑k=1

µk − Eµ[ T∑

t=1

∑j∈M(t)

r j(t)

]

Dynamic regret bound

RT .

detection of optimal arms︷︸︸︷MK log(T )

∆̄2M

+

detection of occupied arms︷︸︸︷M2K log(T )

µM

with ∆̄M = mink≤M µk − µk+1

Drawback: quadratic dependence in ∆ (due to uniform sampling)

25 / 29

Some related works (in random order)

Adversarial case

[Bubeck et al., 2019] considered adversarial rewards Xk(t)√T regret for 2 players

uses communication trick to coordinate players:one with high frequency switchesthe other with low frequency switches

26 / 29

Improving SIC-MMAB

Heterogeneous case: [Boursier et al., 2019]

Arm means µjk differ between players

Improvement of comm. protocol: a leader gathers the informationand decides for the othersDo not eliminate arms, but player-arm pairs (j , k)

Optimal algorithm for homogeneous: [Proutiere and Wang, 2019]initialization in constant time (in T )exploration only by the leader

regret ≤∑

k>Mlog(T )µM−µk

+ o(log(T ))

Confirms: decentralized is as hard as centralized

27 / 29

Other recent works

Heterogeneous case:similar protocols [Tibrewal et al., 2019]implicit comm. through Markov chains [Bistritz and Leshem, 2018]arms have preferences over players [Liu et al., 2019]

No sensing [Lugosi and Mehrabian, 2018]Collision only implies drop in reward [Magesh and Veeravalli, 2019]

28 / 29

Recap & Open questions

Recap:Synchronisation allows communication protocolscontradicts previous lower bounds: decentralized ∼ centralizedsynchronisation is a loophole in the model and has to be removedmore realistic dynamic model: first logarithmic regret algorithm

Open questions:is the dynamic setting a perfect choice?room for improvement in hard settings (statistic sensing, adversarialrewards, heterogeneous, dynamic, etc.)

Thank you!

29 / 29

References I

Anantharam, V., Varaiya, P., and Walrand, J. (1987).Asymptotically efficient allocation rules for the multiarmed banditproblem with multiple plays-part i: I.i.d. rewards.IEEE Transactions on Automatic Control, 32(11):968–976.

Besson, L. and Kaufmann, E. (2018).Multi-Player Bandits Revisited.In Algorithmic Learning Theory, Lanzarote, Spain.

Bistritz, I. and Leshem, A. (2018).Distributed multi-player bandits-a game of thrones approach.In Advances in Neural Information Processing Systems, pages7222–7232.

Boursier, E., Kaufmann, E., Mehrabian, A., and Perchet, V. (2019).A practical algorithm for multiplayer bandits when arm means varyamong players.arXiv preprint arXiv:1902.01239.

References II

Bubeck, S., Li, Y., Peres, Y., and Sellke, M. (2019).Non-stochastic multi-player multi-armed bandits: Optimal rate withcollision information, sublinear without.arXiv preprint arXiv:1904.12233.

Komiyama, J., Honda, J., and Nakagawa, H. (2015).Optimal regret analysis of thompson sampling in stochasticmulti-armed bandit problem with multiple plays.In International Conference on Machine Learning, pages 1152–1161.

Liu, K. and Zhao, Q. (2010).Distributed learning in multi-armed bandit with multiple players.IEEE Transactions on Signal Processing, 58(11):5667–5681.

Liu, L., Mania, H., and Jordan, M. (2019).Competing bandits in matching markets.arXiv preprint arXiv:1906.05363.

References III

Lugosi, G. and Mehrabian, A. (2018).Multiplayer bandits without observing collision information.arXiv preprint arXiv:1808.08416.

Magesh, A. and Veeravalli, V. (2019).Multi-player multi-armed bandits with non-zero rewards on collisionsfor uncoordinated spectrum access.arXiv preprint arXiv:1910.09089.

Proutiere, A. and Wang, P. (2019).An optimal algorithm in multiplayer multi-armed bandits.

Rosenski, J., Shamir, O., and Szlak, L. (2016).Multi-player bandits–a musical chairs approach.In International Conference on Machine Learning, pages 155–163.

References IV

Tibrewal, H., Patchala, S., Hanawal, M., and Darak, S. (2019).Distributed learning and optimal assignment in multiplayerheterogeneous networks.In IEEE INFOCOM 2019-IEEE Conference on ComputerCommunications, pages 1693–1701. IEEE.

sic-mmab: synchronisation involves communication · lower bounds centralizedlowerbound x k>m...

Documents