sic-mmab: synchronisation involves communication · lower bounds centralizedlowerbound x k>m...
TRANSCRIPT
SIC-MMAB: Synchronisation involvescommunication
Etienne Boursier Vianney Perchet
MLMDA Seminar, November 2019
Overview
Multiplayer bandits problem
SIC-MMAB
Contradiction with lower bounds
Dynamic setting
Related works
Multiplayer bandits problem
Introduction
Motivation: Cognitive Radio (5G)Optimize spectrum access for Primary and Secondary userswhen Primary user on channel k → priority over Secondary userswhen several Secondary on same channel: interference/collision
Goal for secondary users: find and communicate on best channels
1 / 29
Bandit game at round t ∈ {1, . . . ,T}K arms
Player
X1(t) X2(t) X3(t) X4(t)
µ1 µ2 µ3 µ4
i.i.d. Xk(t) ∼ B(µk) in [0, 1]pull arm π(t) given pastobserve reward Xπ(t)(t)
arms
means
Xk(t) =
{0 if Primary user on k
1 otherwise
2 / 29
Bandit game at round t ∈ {1, . . . ,T}K arms
Player
X1(t) X2(t) X3(t) X4(t)
µ1 µ2 µ3 µ4
Pull arm2
i.i.d. Xk(t) ∼ B(µk) in [0, 1]pull arm π(t) given pastobserve reward Xπ(t)(t)
arms
means
Xk(t) =
{0 if Primary user on k
1 otherwise
2 / 29
Multiplayer Bandit game at round t ∈ {1, . . . ,T}K arms, M players
Player 1 Player 2 Player 3
X1(t) X2(t) X3(t) X4(t)
µ1 µ2 µ3 µ4
arms
means
Xk(t) =
{0 if Primary user on k
1 otherwise
2 / 29
Multiplayer Bandit game at round t ∈ {1, . . . ,T}K arms, M players
Player 1 Player 2 Player 3
X1(t) X2(t) X3(t) X4(t)
µ1 µ2 µ3 µ4
arms
means
Xk(t) =
{0 if Primary user on k
1 otherwise
2 / 29
Multiplayer Bandit game at round t ∈ {1, . . . ,T}K arms, M players
Player 1 Player 2 Player 3
X1(t) 0 X3(t) X4(t)
µ1 µ2 µ3 µ4
Collision
arms
means
Xk(t) =
{0 if Primary user on k
1 otherwise
2 / 29
Model: Multiplayer Multi-Armed Bandits
K arms with Bernoulli rewards Xk(t) ∼ B(µk)
w.l.o.g. µ1 ≥ µ2 ≥ . . . ≥ µK
M ≤ K players pull arms πj(t) simultaneously for t = 1, . . . ,TDecentralized: players can not communicate & M is unknownget reward r j(t) = Xπj (t)(t)1no collision on πj (t)
Regret: RT = TM∑k=1
µk − Eµ[ T∑
t=1
M∑j=1
r j(t)
]
3 / 29
Feedback/sensing settings
r j(t) = Xπj (t)(t)1no collision on πj (t)
Collision sensing: observe r j(t) and 1no collision on πj (t)
No sensing: observe only r j(t)
Statistic sensing: observe r j(t) and Xπj (t)(t)
4 / 29
Collision Sensing: SIC-MMAB
Centralized case
Players communicate (for free) → no collisionCombinatorial bandits, tight bound:[Anantharam et al., 1987, Komiyama et al., 2015]
Regret in∑k>M
log(T )
µk − µM
5 / 29
Lower bounds
Centralized lower bound∑k>M
log(T )µM−µk
[Anantharam et al., 1987]
Decentralized lower bound
M∑k>M
log(T )µM−µk
[Liu and Zhao, 2010][Besson and Kaufmann, 2018]
6 / 29
Lower bounds
Centralized lower bound∑k>M
log(T )µM−µk
[Anantharam et al., 1987]
Decentralized lower bound
[Liu and Zhao, 2010][Besson and Kaufmann, 2018]
�����
��HHHH
HHH
M∑k>M
log(T )µM−µk
SIC-M
MAB
6 / 29
Lower bounds
Centralized lower bound∑k>M
log(T )µM−µk
[Anantharam et al., 1987]
Decentralized lower bound
[Liu and Zhao, 2010][Besson and Kaufmann, 2018]
�����
��HHHH
HHH
M∑k>M
log(T )µM−µk
Decentralized ∼ Centralized
SIC-M
MAB
How is this possible?
6 / 29
Main trick
Observation: 1no collision on k ∈ {0, 1} seen as a bit sent between players
force collisions during communication rounds
when i talks to j :
{collide with j to send a 1 bitdo not collide to send a 0
players communicate empirical means to each other→ centralizationsublogarithmic number of communication rounds
7 / 29
Algorithm structure
Algorithm 1: SIC-MMABInitialization Phasefor p = 1, ...,∞ do
Exploration phase ppp for 2p roundsCommunication phase pppAccept/reject (sub)-optimal arms
endExploitation phase: pull optimal arms until T
8 / 29
Initialization phase
Orthogonalize players: Musical Chairs for K log(T ) rounds[Rosenski et al., 2016]
Sample arm k uniformly at randomIf collision → continueNo collision → stick to arm k until K log(T )
With proba 1−M/T , all players end on different arms
Compute M and rank j : Sequential Hoppingplayer on arm k waits for 2k roundsplayer then hops for 2(K − k) roundsM − 1 = number of collisions andj − 1 = number of collisions for the 2k first rounds
9 / 29
Initialization phase
Orthogonalize players: Musical Chairs for K log(T ) rounds[Rosenski et al., 2016]
Sample arm k uniformly at randomIf collision → continueNo collision → stick to arm k until K log(T )
With proba 1−M/T , all players end on different arms
Compute M and rank j : Sequential Hoppingplayer on arm k waits for 2k roundsplayer then hops for 2(K − k) roundsM − 1 = number of collisions andj − 1 = number of collisions for the 2k first rounds
9 / 29
Exploration phase p
each player explores each arm 2p roundsstart at different positions given by rankssequential hopping → no collision
player j gathered statistics on arm k
S jk(p) rewards 1
T jk(p) pulls
10 / 29
Communication phase p
player i communicates S ik(p) ∈ [2p] to player j :
encoded in p bits (0, 1, 0, . . . , 0)send it in p rounds: (no coll., coll., no coll., . . ., no coll.)
players communicate one at a timethey know when and how to do so, thanks to their ranks jpossible quantization for non binary rewards
length of comm. phase p: KM2p
11 / 29
Algorithm structure
Algorithm 2: SIC-MMABInitialization Phasefor p = 1, ...,∞ do
Exploration phase pppCommunication phase pppAccept/reject (sub)-optimal arms
endExploitation phase
12 / 29
Accept/Eliminate (sub)-optimal arms
All players have the same centralized empirical means µ̂k
Concentration inequality (Hoeffding)
With high proba, |µk − µ̂k | ≤√
2 log(T )/Tk(p)
→ arm k is detected better than l if:
µ̂k −√
2 log(T )/Tk(p) ≥ µ̂l +√
2 log(T )/Tl(p)
happens after log(T )(µk−µl )2
pulls
13 / 29
Accept/Eliminate (sub)-optimal arms
arm k sub-optimal if M arms are detected better→ eliminated from the set to explorearm k optimal if K −M arms are detected worse→ attributed to player with largest rank
→ exploration ends after N = log(
log(T )(µM−µM+1)2
))phases
14 / 29
Regret bound
Initialization: M × length ' MK log(T )
Communication: M ×∑N
p=1 pM2K ' M3K log2( log(T )
(µM−µM+1)2)
Exploration: centralized regret bound∑
k>Mlog(T )µM−µk
Low probability events: o(log(T ))
Total regret
RT .∑k>M
log(T )
µM − µk+ MK log(T )
15 / 29
Contradiction with lower bounds
Contradict the lower bound?
RecallLower bound M
∑k>M
log(T )µk−µM
SIC-MMAB∑
k>Mlog(T )µk−µM
+ KM log(T )
Why this contradiction?Lower bound proofs assumed that best algorithms do not collideWrong: SIC-MMAB deduces a lot of information from collisionsDecentralized as hard as centralized
16 / 29
Towards a better model?
SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing itneed for a better model, without such a loopholewhich model assumption did go wrong?
collision sensing?
17 / 29
Towards a better model?
SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing itneed for a better model, without such a loopholewhich model assumption did go wrong?
collision sensing?
17 / 29
No sensing setting
AssumptionKnown lower bounds µk ≥ µmin > 0
Observation: we can send a bit with high proba. in log(T )/µmin rounds
Algo 1 SIC-MMAB with log(T )/µmin comm. rounds instead of 1
comm. regret becomes M3K log(T )log(T )log(T )µmin
log2(log(T ))log2(log(T ))log2(log(T ))
Algo 2 limited & different communicationdo not communicate statistics but only when an arm isfound (sub)-optimalregret in M
∑k>M
log(T )µk−µM
+ MK2
µminlog(T )
18 / 29
Towards a better model?
SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing such protocolswhich model assumption did go wrong?
collision sensing?
cooperative players? (work in progress)synchronisation between players?→ more realistic dynamic model
19 / 29
Towards a better model?
SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing such protocolswhich model assumption did go wrong?
collision sensing?cooperative players? (work in progress)
synchronisation between players?→ more realistic dynamic model
19 / 29
Towards a better model?
SIC-MMAB uses unrealistic/undesired communication protocolsabuses from a loophole allowing such protocolswhich model assumption did go wrong?
collision sensing?cooperative players? (work in progress)synchronisation between players?→ more realistic dynamic model
19 / 29
Dynamic setting: DYN-MMAB
Dynamic Model
Asynchronicity assumptionPlayer j enters game at unknown time τ j ∈ [T ] and stays until T .
varying & unknown set of playersM(t)
no synchronisation =⇒ similar protocols are not possibleNo Sensing setting
20 / 29
A dynamic algorithm
Only 2 different states:Exploration: sample arm uniformly at randomExploitation: occupy some optimal arm until T
Three difficulties:1. Detect arms occupied by other players2. Estimate the best available arm3. Start occupying the best available arm
21 / 29
Detect occupied arms
If k occupied, rewards only 0If k not occupied, positive reward with proba µk(1− 1
K )Mt−1 ≥ µk
e
For an occupied arm k
if µk tightly estimated: after ' e log(T )µk
successive 0, k is assumedoccupiedotherwise, µ̂k will quickly drop to 0 and k will become sub-optimal
22 / 29
Estimate available arms
Players sample uniformly at random =⇒ E[rk(t)] = µk(1− 1K )Mt−1
Player estimates γtµk where γt = 1t
∑τ j+ts=τ j+1(1− 1
K )Ms
µk ≥ µl ⇐⇒ γtµk ≥ γtµl
concentration inequalities for γtµk (when k still free)γt ≥ 1
e =⇒ estimating γtµk instead of µk takes roughly same time
Player detects best available arm k after time O(
K log(T )(µk−µk+1)2
)
23 / 29
Occupy best available arm
Once arm detected as best available → try to occupy itContinue sampling uniformly at randompositive reward → occupy that armobserve only 0 rewards ?
detect it as occupiedcontinue exploration until next available arm
At some point, succeed in occupying an arm, while all better arms occupied
24 / 29
Regret bound
New regret definition:
T∑t=1
card(M(t))∑k=1
µk − Eµ[ T∑
t=1
∑j∈M(t)
r j(t)
]
Dynamic regret bound
RT .
detection of optimal arms︷ ︸︸ ︷MK log(T )
∆̄2M
+
detection of occupied arms︷ ︸︸ ︷M2K log(T )
µM
with ∆̄M = mink≤M µk − µk+1
Drawback: quadratic dependence in ∆ (due to uniform sampling)
25 / 29
Some related works (in random order)
Adversarial case
[Bubeck et al., 2019] considered adversarial rewards Xk(t)√T regret for 2 players
uses communication trick to coordinate players:one with high frequency switchesthe other with low frequency switches
26 / 29
Improving SIC-MMAB
Heterogeneous case: [Boursier et al., 2019]
Arm means µjk differ between players
Improvement of comm. protocol: a leader gathers the informationand decides for the othersDo not eliminate arms, but player-arm pairs (j , k)
Optimal algorithm for homogeneous: [Proutiere and Wang, 2019]initialization in constant time (in T )exploration only by the leader
regret ≤∑
k>Mlog(T )µM−µk
+ o(log(T ))
Confirms: decentralized is as hard as centralized
27 / 29
Improving SIC-MMAB
Heterogeneous case: [Boursier et al., 2019]
Arm means µjk differ between players
Improvement of comm. protocol: a leader gathers the informationand decides for the othersDo not eliminate arms, but player-arm pairs (j , k)
Optimal algorithm for homogeneous: [Proutiere and Wang, 2019]initialization in constant time (in T )exploration only by the leader
regret ≤∑
k>Mlog(T )µM−µk
+ o(log(T ))
Confirms: decentralized is as hard as centralized
27 / 29
Other recent works
Heterogeneous case:similar protocols [Tibrewal et al., 2019]implicit comm. through Markov chains [Bistritz and Leshem, 2018]arms have preferences over players [Liu et al., 2019]
No sensing [Lugosi and Mehrabian, 2018]Collision only implies drop in reward [Magesh and Veeravalli, 2019]
28 / 29
Recap & Open questions
Recap:Synchronisation allows communication protocolscontradicts previous lower bounds: decentralized ∼ centralizedsynchronisation is a loophole in the model and has to be removedmore realistic dynamic model: first logarithmic regret algorithm
Open questions:is the dynamic setting a perfect choice?room for improvement in hard settings (statistic sensing, adversarialrewards, heterogeneous, dynamic, etc.)
Thank you!
29 / 29
References I
Anantharam, V., Varaiya, P., and Walrand, J. (1987).Asymptotically efficient allocation rules for the multiarmed banditproblem with multiple plays-part i: I.i.d. rewards.IEEE Transactions on Automatic Control, 32(11):968–976.
Besson, L. and Kaufmann, E. (2018).Multi-Player Bandits Revisited.In Algorithmic Learning Theory, Lanzarote, Spain.
Bistritz, I. and Leshem, A. (2018).Distributed multi-player bandits-a game of thrones approach.In Advances in Neural Information Processing Systems, pages7222–7232.
Boursier, E., Kaufmann, E., Mehrabian, A., and Perchet, V. (2019).A practical algorithm for multiplayer bandits when arm means varyamong players.arXiv preprint arXiv:1902.01239.
References II
Bubeck, S., Li, Y., Peres, Y., and Sellke, M. (2019).Non-stochastic multi-player multi-armed bandits: Optimal rate withcollision information, sublinear without.arXiv preprint arXiv:1904.12233.
Komiyama, J., Honda, J., and Nakagawa, H. (2015).Optimal regret analysis of thompson sampling in stochasticmulti-armed bandit problem with multiple plays.In International Conference on Machine Learning, pages 1152–1161.
Liu, K. and Zhao, Q. (2010).Distributed learning in multi-armed bandit with multiple players.IEEE Transactions on Signal Processing, 58(11):5667–5681.
Liu, L., Mania, H., and Jordan, M. (2019).Competing bandits in matching markets.arXiv preprint arXiv:1906.05363.
References III
Lugosi, G. and Mehrabian, A. (2018).Multiplayer bandits without observing collision information.arXiv preprint arXiv:1808.08416.
Magesh, A. and Veeravalli, V. (2019).Multi-player multi-armed bandits with non-zero rewards on collisionsfor uncoordinated spectrum access.arXiv preprint arXiv:1910.09089.
Proutiere, A. and Wang, P. (2019).An optimal algorithm in multiplayer multi-armed bandits.
Rosenski, J., Shamir, O., and Szlak, L. (2016).Multi-player bandits–a musical chairs approach.In International Conference on Machine Learning, pages 155–163.
References IV
Tibrewal, H., Patchala, S., Hanawal, M., and Darak, S. (2019).Distributed learning and optimal assignment in multiplayerheterogeneous networks.In IEEE INFOCOM 2019-IEEE Conference on ComputerCommunications, pages 1693–1701. IEEE.