nonstationary stochastic multiarmed bandits: ucb policies ...3 iii. preliminaries in this section,...

1

Nonstationary Stochastic Multiarmed Bandits:UCB Policies and Minimax Regret

Lai Wei Vaibhav Srivastava

Abstract—We study the nonstationary stochastic Multi-ArmedBandit (MAB) problem in which the distribution of rewards asso-ciated with each arm are assumed to be time-varying and the totalvariation in the expected rewards is subject to a variation budget.The regret of a policy is defined by the difference in the expectedcumulative rewards obtained using the policy and using an oraclethat selects the arm with the maximum mean reward at eachtime. We characterize the performance of the proposed policiesin terms of the worst-case regret, which is the supremum of theregret over the set of reward distribution sequences satisfying thevariation budget. We extend Upper-Confidence Bound (UCB)-based policies with three different approaches, namely, periodicresetting, sliding observation window and discount factor andshow that they are order-optimal with respect to the minimaxregret, i.e., the minimum worst-case regret achieved by anypolicy. We also relax the sub-Gaussian assumption on rewarddistributions and develop robust versions the proposed policesthat can handle heavy-tailed reward distributions and maintaintheir performance guarantees.

Index Terms—Nonstationary multiarmed bandit, variationbudget, minimax regret, upper-confidence bound, heavy-taileddistributions.

I. INTRODUCTION

UNCERTAINTY and nonstationarity of the environmentare two of the major barriers in decision-making prob-

lems across scientific disciplines, including engineering, eco-nomics, social science, neuroscience, and ecology. An efficientstrategy in such environments requires balancing several trade-offs, including exploration-versus-exploitation, i.e., choosingbetween the most informative and the empirically most re-warding alternatives, and remembering-versus-forgetting, i.e.,using more but possibly outdated information or using less butrecent information.

The stochastic MAB problem is a canonical formulationof the exploration-versus-exploitation tradeoff. In an MABproblem, an agent selects one from K options at each time andreceives a reward associated with it. The reward sequence ateach option is assumed to be an unknown i.i.d random process.The MAB formulation has been applied in many scientific andtechnological areas. For example, it is used for opportunisticspectrum access in communication networks, wherein thearm models the availability of a channel [1], [2]. In MABformulation of online learning for demand response [3], [4],an aggregator calls upon a subset of users (arms) who have anunknown response to the request to reduce their loads. MAB

This work was supported by NSF Award IIS-1734272L. Wei and V. Srivastava are with the Department of Electri-

cal and Computer Engineering. Michigan State University, East Lans-ing, MI 48823 USA. e-mail: [email protected]; e-mail:[email protected]

formulation has also been used in robotic foraging and surveil-lance [5]–[8] and acoustic relay positioning for underwatercommunication [9], wherein the information gain at differentsites is modeled as rewards from arms. Besides, contextualbandits are widely used in recommender systems [10], [11],wherein the acceptation of a recommendation correspondsto the rewards from an arm. The stationarity assumption inclassic MAB problems limits their utility in these applicationssince channel usage, robot working environment and people’spreference are inherently uncertain and evolving. In this paper,we relax this assumption and study non-stationary stochasticMAB problems.

Robbins [12] formulated the objective of the stochasticMAB problem as minimizing the regret, that is, the lossin expected cumulative rewards caused by failing to selectthe best arm every time. In their seminal work, Lai andRobbins [13], followed by Burnetas and Katehakis [14], estab-lished a logarithm problem-dependent asymptotic lower boundon the regret achieved by any policy, which has a leadingconstant determined by the underlying reward distributions.A general method of constructing UCB rules for parametricfamilies of reward distributions is also presented in [13], andthe associated policy is shown to attain the logarithm lowerbound. Several subsequent UCB-based algorithms [15], [16]with efficient finite time performance have been proposed.

The adversarial MAB [17] is a paradigmatic nonstationaryproblem. In this model, the bounded reward sequence at eacharm is arbitrary. The performance of an policy is evaluatedusing the weak regret, which is the difference in the cumulatedreward of a policy compared against the best single actionpolicy. A Ω(

√KT ) lower bound on the weak regret and a

near-optimal policy Exp3 is also presented in [17]. Whilebeing able to capture nonstationarity, the generality of thereward model in adversarial MAB makes the investigation ofglobally optimal policies very challenging.

The nonstationary stochastic MAB can be viewed as acompromise between stationary stochastic MAB and adver-sarial MAB. It maintains the stochastic nature of the rewardsequence while allowing some degree of nonstationarity inreward distributions. Instead of the weak regret analyzed inadversarial MAB, a strong notion of regret defined with respectto the best arm at each time step is studied in these problems.A broadly studied nonstationary problem is piecewise sta-tionary MAB, wherein the reward distributions are piecewisestationary. To deal with the remembering-versus-forgettingtradeoff, the idea of using discount factor to compute theUCB index is proposed in [18]. Garivier and Moulines [19]present and analyze Discounted UCB (D-UCB) and Sliding-Window UCB (SW-UCB), in which they compute the UCB

arX

iv:2

101.

0898

0v1

[cs

.LG

] 2

2 Ja

n 20

21

2

using discounted sampling history and recent sampling history,respectively. They pointed out that if the number of changepoints NT is available, both algorithms can be tuned toachieve regret close to the Ω(

√KNTT ) regret lower bound.

In our earlier work [20], the near optimal regret is achievedusing deterministic sequencing of explore and exploit withlimited memory. Other works handle the change of rewarddistributions in an adaptive manner by adopting change pointdetection techniques [21]–[25].

A more general nonstationary problem is studied in [26],wherein the cumulative maximum variation in mean rewardsis subject to a variation budget VT . Additionally, the authorsin [26] establish a Ω((KVT )

13T

23 ) minimax regret lower

bound and propose the Rexp3 policy. In their subsequentwork [27], they tune Exp3.S policy from [17] to achievenear optimal worst-case regret. Discounted Thomson Sampling(DTS) [28] has also been shown to have good experimentalperformance within this general framework. However, we arenot aware of any analytic regret bounds for the DTS algorithm.

In this paper, we follow the more general nonstationarystochastic MAB formulation in [26] and design UCB-basedpolicies that achieve efficient performance in environmentswith sub-Gaussian as well as heavy-tailed rewards. We focuson UCB-based policies instead of EXP3-type policies becauseEXP3-type policies require bounded rewards and have largevariance in cumulative rewards [17]. Additionally, by usingrobust mean estimator, UCB-based policies for light-tailedrewards can be extended to handle heavy-tailed reward dis-tributions, which exist in many domains such as social net-works [29] and financial markets [30]. The major contributionsof this work are:• Assuming the variation density VT /T is known, we

extend MOSS [31] to design Resetting MOSS (R-MOSS)and Sliding-Window MOSS (SW-MOSS). Also, we showD-UCB can be tuned to solve the problem.

• With rigorous analysis, we show that R-MOSS and SW-MOSS achieve the exact order-optimal minimax regretand D-UCB achieves near-optimal worst-case regret.

• We relax the bounded or sub-Gaussian assumption onthe rewards required by Rexp3 and SW-UCB and designpolicies robust to heavy-tailed rewards. We show thetheoretical guarantees on the worst-case regret can bemaintained by the robust policies.

The remainder of the paper is organized as follows. Weformulate nonstationary stochastic MAB with variation budgetin Section II and review some preliminaries in Section III.In Section IV, we present and analyze three UCB policies:R-MOSS, SW-MOSS and D-UCB. We present and analyzealgorithms for nonstationary heavy-tailed bandit in Section V.We complement the theoretical results with numerical illustra-tions in Section VI and conclude this work in Section VII.

II. PROBLEM FORMULATION

We consider a nonstationary stochastic MAB problem withK arms and a horizon length T . Let K := 1, . . . ,K be theset of arms and T := 1, . . . , T be the sequence of time slots.The reward sequence

Xkt

t∈T for each arm k ∈ K is com-

posed of independent samples from potentially time-varying

probability distribution function sequence fkT :=fkt (x)

t∈T .

We refer to the set FKT =fkT | k ∈ K

containing reward

distribution sequences at all arms as the environment. Letµkt = E[Xk

t ]. Then, the total variation of FKT is defined by

v(FKT)

:=

T−1∑t=1

supk∈K

∣∣∣µkt+1 − µkt∣∣∣ , (1)

which captures the non-stationarity of the environment. Wefocus on the class of non-stationary environments that havethe total variation within a variation budget VT ≥ 0 which isdefined by

E(VT , T,K) :=FKT | v

(FKT)≤ VT

.

At each time slot t ∈ T , a decision-making agent selectsan arm ϕt ∈ K and receives an associated random rewardXϕtt . The objective is to maximize the expected value of the

cumulative reward ST :=∑Tt=1X

ϕtt . We assume that ϕt is

selected based upon past observations Xϕss , ϕst−1

s=1 follow-ing some policy ρ. Specifically, ρ determines the conditionaldistribution

Pρ(ϕt = k | Xϕs

s , ϕst−1s=1

)at each time t ∈ 1, . . . , T − 1. If Pρ (·) takes binary values,we call ρ deterministic; otherwise, it is called stochastic.

Let the expected reward from the best arm at time t beµ∗t = maxk∈K µ

kt . Then, maximizing the expected cumulative

reward is equivalent to minimizing the regret defined by

RρT :=

T∑t=1

µ∗t − Eρ[ST ] = Eρ[

T∑t=1

µ∗t − µϕtt

],

where the expectation is with respect to different realizationof ϕt that depends on obtained rewards through policy ρ.

Note that the performance of a policy ρ differs with differentFKT ∈ E(VT , T,K). For a fixed variation budget VT and apolicy ρ, the worst-case regret is the regret with respect to theworst possible choice of environment, i.e.,

Rρworst(VT , T,K) = supFKT ∈E(VT ,T,K)

RρT .

In this paper, we aim at designing policies to minimize theworst-case regret. The optimal worst-case regret achieved byany policy is called the minimax regret, and is defined by

infρ

supFKT ∈E(VT ,T,K)

RρT .

We will study the nonstationary MAB problem under thefollowing two classes of reward distributions:

Assumption 1 (Sub-Gaussian reward). For any k ∈ K andany t ∈ T , distribution fkt (x) is 1/2 sub-Gaussian, i.e.,

∀λ ∈ R : E[exp(λ(Xk

t − µ))]≤ exp

(λ2

8

).

Moreover, for any arm k ∈ K and any time t ∈ T , E[Xkt

]∈

[a, a+ b], where a ∈ R and b > 0.

Assumption 2 (Heavy-tailed reward). For any arm k ∈ Kand any time t ∈ T , E

[(Xk

t )2]≤ 1.

3

III. PRELIMINARIES

In this section, we review existing minimax regret lowerbounds and minimax policies from literature. These resultsapply to both sub-Gaussian and heavy-tailed rewards. Thediscussion is made first for VT = 0. Then, we show howthe minimax regret lower bound for VT = 0 can be extendedto establish the minimax regret lower bound for VT > 0. Tothis end, we review two UCB algorithms for the stationarystochastic MAB problem: UCB1 and MOSS. In the latersections, they are extended to design a variety of policies tomatch with the minimax regret lower bound for VT > 0.

A. Lower Bound for Minimax Regret when VT = 0

In the setting of VT = 0, for each arm k ∈ K, µkt is identicalfor all t ∈ T . In stationary stochastic MAB problems, therewards from each arm k ∈ K are independent and identicallydistributed, so they belong to the environment set E(0, T,K).According to [32], if VT = 0, the minimax regret is no smallerthan 1/20

√KT . This result is closely related to the standard

logarithmic lower bound on regret for stationary stochasticMAB problems as discussed below. Consider a scenario inwhich there is a unique best arm and all other arms haveidentical mean rewards such that the gap between optimaland suboptimal mean rewards is ∆. From [33], for such astationary stochastic MAB problem

RρT ≥ C1K

∆ln(T∆2

K

)+ C2

K

∆, (2)

for any policy ρ, where C1 and C2 are some positive constants.It needs to be noted that for ∆ =

√K/T , the above lower

bound becomes C2

√KT , which matches with the lower

bound 1/20√KT .

B. Lower Bound for Minimax Regret when VT > 0

In the setting of VT > 0, we recall here the minimax regretlower bound for nonstationary stochastic MAB problems.

Lemma 1 (Minimax Lower Bound: VT > 0 [26]). For thenon-stationary MAB problem with K arms, time horizon Tand variation budget VT ∈ [1/K, T/K],

infρ


RρT ≥ C(KVT )13T

23 ,

where C ∈ R>0 is some constant.

To understand this lower bound, consider the followingnon-stationary environment. The horizon T is partitioned intoepochs of length τ =

⌈K

13 (T/VT )

23

⌉. In each epoch, the

reward distribution sequences are stationary and all the armshave identical mean rewards except for the unique best arm.Let the gap in the mean be ∆ =

√K/τ . The index of the

best arm switches at the end of each epoch following someunknown rule. So, the total variation is no greater than ∆T/τ ,which satisfies the variation budget VT . Besides, for any policyρ, we know from (2) that worst case regret in each epoch is noless than C2

√Kτ . Summing up the regret over all the epochs,

minimax regret is lower bounded by T/τ × C2

√Kτ , which

is consistent with Lemma 1.

C. UCB Algorithms in Stationary Environments

The family of UCB algorithms uses the principle calledoptimism in the face of uncertainty. In these policies, ateach time slot, a UCB index which is a statistical indexcomposed of both mean reward estimate and the associateduncertainty measure is computed at each arm, and the armwith the maximum UCB is picked. Within the family of UCBalgorithms, two state-of-the-art algorithms for the stationarystochastic MAB problems are UCB1 [15] and MOSS [31].Let nk(t) be the number of times arm k is sampled until timet − 1, and µk,nk(t) be the associated empirical mean. Then,UCB1 computes the UCB index for each arm k at time t as

gUCB1k,t = µk,nk(t) +

√2 ln t

nk(t).

It has been proved in [15] that, for the stationary stochasticMAB problem, UCB1 satisfies

RUCB1T ≤ 8

∑k:∆k>0

lnT

∆k+

(1 +

π2

3

)K∑k=1

∆k,

where ∆k is the difference in the mean rewards from armk and the best arm. In [31], a simple variant of this resultis given by selecting values for ∆k to maximize the upperbound, resulting in

supFKT ∈E(0,T,K)

RUCB1T ≤ 10

√(K − 1)T (lnT ).

Comparing this result with the lower bound on the minimaxregret discussed in Section III-A, there exists an extra factor√

lnT . This issue has been resolved by the MOSS algorithm.With prior knowledge of horizon length T , and the UCB indexfor MOSS is expressed as

gMOSSk,t = µk,nk(t) +

√√√√max(

ln(

TKnk(t)

), 0)

nk(t).

We now recall the worst-case regret upper bound for MOSS.

Lemma 2 (Worst-case regret upper bound for MOSS [31]).For the stationary stochastic MAB problem (VT = 0), theworst-case regret of the MOSS algorithm satisfies

supFKT ∈E(0,T,K)

RMOSST ≤ 49

√KT.

IV. UCB ALGORITHMS FOR SUB-GAUSSIANNONSTATIONARY STOCHASTIC MAB PROBLEMS

In this section, we extend UCB1 and MOSS to designnonstationary UCB policies for scenarios with VT > 0. Threedifferent techniques are employed, namely periodic resetting,sliding observation window and discount factor, to deal withthe remembering-forgetting tradeoff. The proposed algorithmsare analyzed to provide guarantees on the worst-case regret.We show their performances match closely with the lowerbound in Lemma 1.

The following notations are used in later discussions. LetN =

⌈T/τ

⌉, for some τ ∈ 1, . . . , T, and let T1, . . . , TN

4

Algorithm 1: R-MOSSInput : VT ∈ R≥0 and T ∈ NSet : τ =

⌈K

13(T/VT

) 23

⌉Output : sequence of arm selection

1 while t ≤ T do2 if mod (t, τ) = 0 then3 Restart the MOSS policy;

be a partition of time slots T , where each epoch Ti has lengthτ except possibly TN . In particular,

Ti =

1 + (i− 1)τ , . . . , min (iτ, T ), i ∈ 1, . . . , N.

Let the maximum mean reward within Ti be achieved at timeτi ∈ Ti and arm κi, i.e., µκiτi = maxt∈Ti µ

∗t . We define the

variation within Ti as

vi :=∑t∈Ti

supk∈K

∣∣∣µkt+1 − µkt∣∣∣ ,

where we trivially assign µkT+1 = µkT for all k ∈ K. Let 1 ·denote the indicator function and |·| denote the cardinality ofthe set, if its argument is a set, and the absolute value if itsargument is a real number.

A. Resetting MOSS Algorithm

Periodic resetting is an effective technique to preserve thefreshness and authenticity of the information history. It hasbeen employed in [26] to modify Exp3 to design Rexp3 policyfor nonstationary stochastic MAB problems. We extend thisapproach to MOSS and propose nonstationary policy ResettingMOSS (R-MOSS). In R-MOSS, after every τ time slots, thesampling history is erased and MOSS is restarted. The pseudo-code is provided in Algorithm 1 and the performance in termsof the worst-case regret for is established below.

Theorem 3. For the sub-Gaussian nonstationary MAB prob-lem with K arms, time horizon T , variation budget VT > 0,and τ =

⌈K

13

(T/VT

) 23

⌉, the worst case regret of R-MOSS

satisfies


RR-MOSST ∈ O((KVT )

13T

23 ).

Sketch of the proof. Note that one run of MOSS takes placein each epoch. For epoch Ti, define the set of bad arms forR-MOSS by

BRi :=

k ∈ K | µκiτi − µ

kτi ≥ 2vi

. (3)

Notice that for any t1, t2 ∈ Ti,∣∣∣µkt1 − µkt2 ∣∣∣ ≤ vi, ∀k ∈ K. (4)

Therefore, for any t ∈ Ti, we have

µ∗t − µϕtt ≤ µκiτi − µ

ϕtt ≤ µκiτi − µ

ϕtτi + vi.

Then, the regret from Ti can be bounded as the following,

E[∑t∈Ti

µ∗t − µϕtt

]≤|Ti| vi + E

[∑t∈Ti

µκiτi − µϕtτi

]≤ 3|Ti| vi + Si, (5)

where Si = E[∑t∈Ti

∑k∈BR

i

1 ϕt = k(µκiτi − µ

ϕtτi − 2vi

) ].

Now, we have decoupled the problem, enabling us tothe generalize the analysis of MOSS in stationary environ-ment [31] to bound Si. We will only specify the generalizationsteps and skip the details for brevity.

First notice inequality (4) indicates that for any k ∈ BRi and

any t ∈ Ti,

µκit ≥ µκiτi − vi and µkt ≤ µkτi + vi.

So, at any t ∈ Ti, µκi,nκi (t) concentrate around a value nosmaller than µκiτi −vi, and µk,nk(t) concentrate around a valueno greater than µkτi + vi for any k ∈ BR

i . Also µκiτi − vi ≥µkτi + vi due to the definition in (3).

In the analysis of MOSS in stationary environment [31],the UCB of each suboptimal arm is compared with the bestarm and each selection of suboptimal arm k contribute ∆k inregret. Here, we can apply a similar analysis by comparingthe UCB of each arm k ∈ BR

i with κi and each selectionof arm k ∈ BR

i contributes (µκiτi − vi) − (µkτi + vi) in Si.Accordingly, we borrow the upper bound in Lemma 2 to getSi ≤ 49

√K|Ti|.

Substituting the upper bound on Si into (5) and summariz-ing over all the epochs, we conclude that


RR-MOSST ≤ 3τVT +

N∑i=1

49√Kτ,

which implies the theorem.

The upper bound in Theorem 3 is in the same order asthe lower bound in Lemma 1. So, the worst-case regret forR-MOSS is order optimal.

B. Sliding-Window MOSS AlgorithmWe have shown that periodic resetting coarsely adapts the

stationary policy to a nonstationary setting. However, it is inef-ficient to entirely remove the sampling history at the restartingpoints and the regret accumulates quickly close to these points.In [19], a sliding observation window is used to erase theoutdated information smoothly and more efficiently utilizethe information history. The authors proposed the SW-UCBalgorithm that intends to solve the MAB problem with piece-wise stationary mean rewards. We show that a similar approachcan also deal with the general nonstationary environment witha variation budget. In contrast to SW-UCB, we integrate thesliding window technique with MOSS instead of UCB1 andachieve the order optimal worst-case regret.

Let the sliding observation window at time t be Wt :=min(1, t− τ), . . . , t− 1

. Then, the associated mean esti-

mator is given by

µknk(t) =1

nk(t)

∑s∈Wt

Xs1ϕs = k, nk(t) =∑s∈Wt

1ϕs = k.

5

Algorithm 2: SW-MOSSInput : VT ∈ R>0, T ∈ N and η > 1/2

Set : τ =⌈K

13(T/VT

) 23

⌉Output : sequence of arm selection

1 Pick each arm once.2 while t ≤ T do

Compute statistics within Wt =

min(1, t− τ), . . . , t− 1

:

µknk(t)=

1

nk(t)

∑s∈Wt

Xs1ϕs = k, nk(t) =∑s∈Wt

1ϕs = k

Pick arm

ϕt = arg maxk∈K

µknk(t)+

√√√√η

max(

ln(

τKnk(t)

), 0)

nk(t);

For each arm k ∈ K, define the UCB index for SW-MOSS by

gkt = µknk(t) + cnk(k), cnk(t) =

√√√√η

max(

ln(

τKnk(t)

), 0)

nk(t),

where η > 1/2 is a tunable parameter. With these notations,SW-MOSS is defined in Algorithm 2. To analyze it, wewill use the following concentration bound for sub-Gaussianrandom variables.

Fact 1 (Maximal Hoeffding inequality [34]). Let X1, . . . , Xn

be a sequence of independent 1/2 sub-Gaussian randomvariables. Define di := Xi − µi, then for any δ > 0,

P(∃m ∈ 1, . . . , n :

m∑i=1

di ≥ δ)≤ exp

(−2δ2/n

)and P

(∃m ∈ 1, . . . , n :

m∑i=1

di ≤ −δ)≤ exp

(−2δ2/n

).

At time t, for each arm k ∈ K define

Mkt :=

1

nk(t)

∑s∈Wt

µks1ϕs=k.

Now, we are ready to present concentration bounds for thesliding window empirical mean µknk(t).

Lemma 4. For any arm k ∈ K and any time t ∈ T , if η > 1/2,for any x > 0 and l ≥ 1, the probability of event A :=µknk(t) + cnk(t) ≤Mk

t − x, nk(t) ≥ l

is no greater than

(2η)32

ln(2η)

K

τx2exp

(−x2l/η

). (6)

The probability of event B :=µknk(t) − cnk(t) ≥ Mk

t +

x, nk(t) ≥ l

is also upper bounded by (6).

Proof. For any t ∈ T , let ukti be the i-th time slot when armk is selected within Wt and let dkti = Xk

ukti− µk

ukti. Note that

P (A) ≤ P(∃m ∈ l, . . . , τ :

1

m

m∑i=1

dkti ≤ −x− cm),

Let a =√

2η such that a > 1. We now apply a peelingargument [35, Sec 2.2] with geometric grid asl < m ≤ as+1lover l, . . . , τ. Since cm is monotonically decreasing in m,

P(∃m ∈ l, . . . , τ :

1

m

m∑i=1

dkti ≤ −x− cm)

≤∑s≥0

P(∃m ∈ [asl, as+1l) :

m∑i=1

dkti ≤ −asl (x+ cas+1l)

).

According to Fact 1, the above summand is no greater than∑s≥0

P(∃m ∈ [1, as+1l) :

m∑i=1

dkti ≤ −asl (x+ cas+1l)

)

≤∑s≥0

exp

(−2

a2sl2

bas+1lc

(x2 + c2as+1l

))

≤∑s≥0

exp

(−2as−1lx2 − 2η

a2ln

(τ

Kas+1l

))

=∑s≥1

Klas

τexp

(−2as−2lx2

).

Let b = 2x2l/a2. It follows that∑s≥1

Klas

τexp (−bas) ≤ Kl

τ

∫ +∞

0

ay+1 exp(− bay

)dy

=Kla

τ ln(a)

∫ +∞

1

exp(−bz)dz (where we set z = ay)

=Klae−b

τb ln(a),

which concludes the bound for the probability of event A. Byusing upper tail bound, similar result exists for event B.

We now leverage Lemma 4 to get an upper bound on theworst-case regret for SW-MOSS.

Theorem 5. For the nonstationary MAB problem with Karms, time horizon T , variation budget VT > 0 and τ =⌈K

13

(T/VT

) 23

⌉, the worst-case regret of SW-MOSS satisfies


RSW-MOSST ∈ O((KVT )

13T

23 ).

Proof. The proof consists of the following five steps.Step 1: Recall that vi is the variation within Ti. Here,we trivially assign T0 = ∅ and v0 = 0. Then, for eachi ∈ 1, . . . , N, let

∆ki := µκiτi − µ

kτi − 2vi−1 − 2vi, ∀k ∈ K.

Define the set of bad arms for SW-MOSS in Ti as

BSWi :=

k ∈ K | ∆k

i ≥ ε,

where we assign ε = 4√eηK/τ .

Step 2: We decouple the regret in this step. For any t ∈ Ti,since

∣∣µkt − µkτi∣∣ ≤ vi for any k ∈ K, it satisfies that

µ∗t − µϕtt ≤ µκiτi − µ

ϕtt

≤ µκiτi − µϕtτi + vi

≤ 1ϕt ∈ BSW

i

(∆ϕt

i − ε) + 2vi−1 + 3vi + ε.

6

Then we get the following inequalities,∑t∈T

µ∗t − µϕtt

≤N∑i=1

∑t∈Ti

1ϕt ∈ BSW

i

(∆ϕt

i − ε) + 2vi−1 + 3vi + ε

≤5τVT + Tε+

N∑i=1

∑t∈Ti

1ϕt ∈ BSW

i

(∆ϕt

i − ε). (7)

To continue, we take a decomposition inspired by the analysisof MOSS in [31] below,∑

t∈Ti

1ϕt ∈ BSW

i

(∆ϕti − ε

)≤∑t∈Ti

1

ϕt ∈ BSW

i , gκit > Mκit −

∆ϕti

4

∆ϕti (8)

+∑t∈Ti

1

ϕt ∈ BSW

i , gκit ≤Mκit −

∆ϕti

4

(∆ϕti − ε

), (9)

where summands (8) describes the regret when arm κi is fairlyestimated and summand (9) quantifies the regret incurred byunderestimating arm κi.

Step 3: In this step, we bound E [(8)]. Since gϕtt ≥ gκit ,

(8) ≤∑t∈Ti

1

ϕt ∈ BSW

i , gϕtt > Mκit −

∆ϕti

4

∆ϕti

=∑k∈BSW

i

∑t∈Ti

1

ϕt = k, gkt > Mκi

t −∆ki

4

∆ki . (10)

Notice that for any t ∈ Ti−1 ∪ Ti,∣∣∣µkt − µkτi∣∣∣ ≤ vi−1 + vi, ∀k ∈ K.

It indicates that an arm k ∈ BSWi is at least ∆k

i worse in meanreward than arm κi at any time slot t ∈ Ti−1 ∪ Ti. SinceWt ⊂ Ti−1 ∪Ti, for any t ∈ Ti

Mκit −Mk

t ≥ ∆ki ≥ ε, ∀k ∈ BSW

i .

It follows that

(10) ≤∑k∈BSW

i

∑t∈Ti

1

ϕt = k, gkt > Mk

t +3∆k

i

4

∆ki . (11)

Let tiks be the s-th time slot when arm k is selected within Ti.Then, for any k ∈ BSW

i ,∑t∈Ti

1

ϕt = k, gkt > Mk

t +3∆k

i

4

=∑s≥1

1

gktiks > Mk

tiks+

3∆ki

4

≤lki +

∑s≥lki +1

1

gktiks > Mk

tiks+

3∆ki

4

, (12)

where we set lki =

⌈η(

4∆ki

)2

ln

(τηK

(∆ki

4

)2)⌉

. Since ∆ki ≥

ε, for k ∈ BSWi , we have

lki ≥⌈η(

4/∆ki

)2

ln

(τ

ηK

(ε/4)2)⌉ ≥ η (4/∆k

i

)2

,

where the second inequality follows by substituting ε =4√eηK/τ . Additionally, since tik1 , . . . , t

iks−1 ∈ Wtiks

, we getnk(tiks ) ≥ s − 1. Furthermore, since cm is monotonicallydecreasing with m,

cnk(tks ) ≤ clki ≤

√√√√ η

lkiln

(τ

ηK

(∆ki

4

)2)≤ ∆k

i

4,

for s ≥ lki + 1. Therefore,

(12) ≤ lki +∑

s≥lki +1

1

gktiks − 2cnk(tiks ) > Mk

tiks+

∆ki

4

.

By applying Lemma 4, considering nk(tiks ) ≥ s− 1,∑s≥lki +1

Pgktiks − 2cnk(tiks ) > Mk

tiks+

∆ki

4

≤∑s≥lki

(2η)32

ln(2η)

K

τ

(4

∆ki

)2

exp

(− sη

(∆ki

4

)2)

≤∫ +∞

lki−1

(2η)32

ln(2η)

K

τ

(4

∆ki

)2

exp

(−yη

(∆ki

4

)2)dy

≤ (2η)32

ln(2η)

ηK

τ

(4

∆ki

)4

. (13)

Let h(x) = 16η/x ln(τx2/16ηK

)which achieves maximum

at 4e√ηK/τ . Combining (13), (12), (11), and (10), we obtain

E[(8)] ≤∑k∈Bi

(2η)32

ln(2η)

ηK

τ

256(∆ki

)3 + lki ∆ki

≤∑k∈Bi

(2η)32

ln(2η)

ηK

τ

256(∆ki

)3 + h(∆ki ) + ∆k

i

≤∑k∈Bi

(2η)32

ln(2η)

ηK

τ

256

ε3+ h

(4e√ηK/τ

)+ b

≤(

2.6η

ln(2η)+ 3√η

)√Kτ +Kb.

Step 4: In this step, we bound E[(9)]. When eventϕt ∈ BSW

i , gκit ≤Mκit −∆ϕt

i /4

happens, we know

∆ϕti ≤ 4Mκi

t − 4gκit and gκit ≤Mκit −

ε

4.

Thus, we have

1

ϕt ∈ BSW

i , gκit ≤Mκit −

∆ϕti

4

(∆ϕti − ε

)≤1gκit ≤M

κit −

ε

4

×(4Mκi

t − 4gκit − ε)

:= Y

7

Since Y is a nonnegative random variable, its expectation canbe computed involving only its cumulative density function:

E [Y ] =

∫ +∞

0

P (Y > x) dx

≤∫ +∞

0

P(

4Mκit − 4gκit − ε ≥ x

)dx

=

∫ +∞

ε

P(

4Mκit − 4gκit > x

)dx

≤∫ +∞

ε

16(2η)32

ln(2η)

K

τx2dx =

16(2η)32

ln(2η)

K

τε.

Hence, E[(9)] ≤ 16(2η)32K|Ti| /

(ln(2η)τε

).

Step 5: With bounds on E [(8)] and E[(9)] from previous steps,

E[(7)] ≤5τVT + Tε+N

(2.6η

ln(2η)+ 3√η

)√Kτ

+NKb+16(2η)

32

ln(2η)

KT

τε≤ C(KVT )

13T

23

for some constant C, which concludes the proof.

We have shown that SW-MOSS also enjoys order optimalworst-case regret. One drawback of the sliding window methodis that all sampling history within the observation windowneeds to be stored. Since window size is selected to beτ =

⌈K

13 (T/VT )

23

⌉, large memory is needed for large horizon

length T . The next policy resolves this problem.

C. Discounted UCB Algorithm

The discount factor is widely used in estimators to forget oldinformation and put more attention on the recent information.In [19], such an estimation is used together with UCB1to solve the piecewise stationary MAB problem, and thepolicy designed is called Discounted UCB (D-UCB). Here,we tune D-UCB to work in the nonsationary environment withvariation budget VT . Specifically, the mean estimator used isdiscounted empirical average given by

µkγ,t =1

nkγ,t

t−1∑s=1

γt−s1ϕs = kXs,

nkγ,t =

t−1∑s=1

γt−s1ϕs = k,

where γ = 1 − K−13 (T/VT )−

23 is the discount factor.

Besides, the UCB is designed as gkt = µkt + 2ckt , whereckγ,t =

√ξ ln(τ)/nkγ,t for some constant ξ > 1/2. The pseudo

code for D-UCB is reproduced in Algorithm 3. It can benoticed that the memory size is only related to the numberof arms, so D-UCB requires small memory.

To proceed the analysis, we review the concentration in-equality for discounted empirical average, which is an exten-sion of Chernoff-Hoeffding bound. Let

Mkγ,t :=

1

nkγ,t

t−1∑s=1

γt−s1ϕs = kµks .

Then, the following fact is a corollary of [19, Theorem 18].

Algorithm 3: D-UCBInput : VT ∈ R>0, T ∈ N and ξ > 1

2

Set : γ = 1−K−13 (T/VT )−

23

Output : sequence of arm selection

1 for t ∈ 1, . . . ,K doPick arm ϕt = t and set nt ← γK−t and µt ← Xt

t ;

2 while t ≤ T do

Pick arm ϕt = arg maxk∈K

µk + 2

√ξ ln(τ)

nk;

For each arm k ∈ K, set nk ← γnk;

Set nϕt ← nϕt + 1 & µϕt ← µϕt + 1nϕt

(Xϕtt − Xϕt );

Fact 2 (A Hoeffding-type inequality for discounted em-pirical average with a random number of summands). Forany t ∈ T and for any k ∈ K, the probability of event

A =

µkγ,t −Mk

γ,t ≥ δ/√nkγ,t

is no greater than

⌈log1+λ(τ)

⌉exp

(−2δ2

(1− λ2/16

))(14)

for any δ > 0 and λ > 0. The probability of event B =µkγ,t −Mk

γ,t ≤ −δ/√nkγ,t

is also upper bounded by (14).

Theorem 6. For the nonstationary MAB problem with Karms, time horizon T , variation budget VT > 0, and γ =1 −K− 1

3 (T/VT )−23 , if ξ > 1/2, the worst case regret of D-

UCB satisfies


RD-UCBT ≤ C ln(T )(KVT )

13T

23 .

Proof. We establish the theorem in four steps.Step 1: In this step, we analyze

∣∣µkγ,t −Mkγ,t

∣∣ at some timeslot t ∈ Ti. Let τ ′ = logγ

((1− γ)ξ ln(τ)/b2

)and take t− τ ′

as a dividing point, then we obtain∣∣∣µkτi −Mkγ,t

∣∣∣ ≤ 1

nkγ,t

t−1∑s=1

γt−s1ϕs = k∣∣∣µkτi − µks ∣∣∣

≤ 1

nkγ,t

∑s≤t−τ ′

γt−s1ϕs = k∣∣∣µkτi − µks ∣∣∣ (15)

+1

nkγ,t

t−1∑s≥t−τ ′

γt−s1ϕs = k∣∣∣µkτi − µks ∣∣∣ . (16)

Since µkt ∈ [a, a+ b] for all t ∈ T , we have (15) ≤ b. Also,

(15) ≤ 1

nkγ,t

∑s≤t−τ ′

bγt−s ≤ bγτ′

(1− γ)nkγ,t=ξ ln(τ)

bnkγ,t.

Accordingly, we get

(15) ≤ min

(b,ξ ln(τ)

bnkγ,t

)≤

√ξ ln(τ)

nkγ,t.

Furthermore, for any t ∈ Ti,

(16) ≤ maxs∈[t−τ ′,t−1]

∣∣∣µkτi − µks ∣∣∣ ≤ i∑j=i−n′

vj ,

8

where n′ = dτ ′/τe and vj is the variation within Tj . So weconclude that for any t ∈ Ti,∣∣∣µkκi −Mk

γ,t

∣∣∣ ≤ ckγ,t +

i∑j=i−n′

vj , ∀k ∈ K. (17)

Step 2: Within partition Ti, let

∆ki = µκiτi − µ

kτi − 2

i∑j=i−n′

vj ,

and define a subset of bad arms as

BDi =

k ∈ K | ∆k

i ≥ ε′,

where we select ε′ = 4√ξγ1−τK ln(τ)/τ . Since

∣∣µkt − µkτi∣∣ ≤vi for any t ∈ Ti and for any k ∈ K∑

t∈Tµ∗t − µ

ϕtt ≤

N∑i=1

∑t∈Ti

µκiτi − µϕtτi + vi

≤τVT +

N∑i=1

∑t∈Ti

[1ϕt ∈ BD

i

∆ϕti + 2

i∑j=i−n′

vj + ε′]

≤(2n′ + 3)τVT +Nε′τ+

N∑i=1

∑k∈BD

i

∆ki

∑t∈Ti

1 ϕt = k . (18)

Step 3: In this step, we bound E[∆ki

∑t∈Ti 1 ϕt = k

]for

an arm k ∈ BDi . Let tki (l) be the l-th time slot arm k is selected

within Ti. From arm selection policy, we get gϕtt ≥ gκit , which

result in∑t∈Ti

1 ϕt = k ≤ lki +∑t∈Ti

1gkt ≥ g

κit , t > tki (lki )

, (19)

where we pick lki =⌈16ξγ1−τ ln(τ)/

(∆ki

)2⌉. Note that gkt ≥

gκit is true means at least one of the followings holds,

µkγ,t ≥Mkγ,t + ckγ,t, (20)

µκiγ,t ≤Mκiγ,t − c

κiγ,t, (21)

Mκiγ,t + cκiγ,t < Mk

γ,t + 3ckγ,t. (22)

For any t ∈ Ti, since every sample before t within Ti has aweight greater than γτ−1, if t > tki (lki ),

ckγ,t =

√ξ ln(τ)

nkγ,t≤

√ξ ln(τ)

γτ−1lki≤ ∆k

i

4.

Combining it with (17) yields

Mκiγ,t −Mk

γ,t ≥ µκiτi − µkτi − c

κiγ,t − ckγ,t − 2

i∑j=i−n′

vj

≥ ∆ki − c

κiγ,t − ckγ,t ≥ 3ckγ,t − c

κiγ,t,

which indicates (22) is false. As ξ > 1/2, we select λ =4√

1− 1/(2ξ) and apply Fact 2 to get

P((20) is true) ≤⌈log1+λ(τ)

⌉τ−2ξ(1−λ2/16) ≤

⌈log1+λ(τ)

⌉τ

.

The probability of (21) to be true shares the same bound. Then,it follows from (19) that E

[∆ki

∑t∈Ti 1 ϕt = k

]is upper

bounded by

∆ki lki + ∆k

i

∑t∈Ti

P ((20) or (21) is true)

≤16ξγ1−τ ln(τ)

∆ki

+ ∆ki + 2∆k

i

⌈log1+λ (τ)

⌉≤16ξγ1−τ ln(τ)

ε′+ b+ 2b

⌈log1+λ (τ)

⌉, (23)

where we use ε′ ≤ ∆ki ≤ b in the last step.

Step 4: From (18) and (23), and plugging in the value of ε′,an easy computation results in

RD-UCBT ≤(2n′ + 3)τVT + 8N

√ξγ1−τKτ ln(τ)

+ 2Nb+ 2Nb log1+λ (τ) ,

where the dominating term is (2n′ + 3)τVT . Considering

τ ′ =ln((1− γ)ξ ln(τ)/b2

)ln γ

≤− ln

((1− γ)ξ ln(τ)/b2

)1− γ

,

we get n′ ≤ C ′ ln(T ) for some constant C ′. Hence there existssome absolute constant C such that

RD-UCBT ≤ C ln(T )(KVT )

13T

23 .

Although discount factor method requires less memory,there exists an extra factor ln(T ) in the upper bound on theworst-case regret for D-UCB comparing with the minimaxregret. This is due to the fact that discount factor method doesnot entirely cut off outdated sampling history like periodicresetting or sliding window techniques.

V. UCB POLICIES FOR HEAVY-TAILED NONSTATIONARYSTOCHASTIC MAB PROBLEMS

In this section, we propose and analyze UCB algorithmsfor non-stationary stochastic MAB problem with heavy-tailedrewards defined in Assumption 2. We first recall a minimaxpolicy for the stationary heavy-tailed MAB problem calledRobust MOSS [36]. We then extend it to nonstationary settingand design resetting robust MOSS algorithm and sliding-window robust MOSS algorithm.

A. Background on Robust MOSS algorithm for the stationaryheavy-tailed MAB problem

Robust MOSS algorithm handles stationary heavy-tailedMAB problems in which the rewards have finite momentsof order 1 + ε, for ε ∈ (0, 1]. For simplicity, as stated inAssumption 2, we restrict our discussion to ε = 1.

Robust MOSS uses the saturated empirical mean instead ofthe empirical mean. Let nk(t) be the number of times thatarm k has been selected until time t − 1. Pick a > 1 and leth(m) = abloga(m)c+1. Let the saturation limit at time t bedefined by

Bnk(t) :=

√√√√ h(nk(t))

ln+

(T

Kh(nk(t))

) ,

9

where ln+(x) := max(lnx, 1). Then, the saturated empiricalmean estimator is defined by

µnk(t) :=1

nk(t)

t−1∑s=1

1ϕs = k sat(Xs, Bnk(t)), (24)

where sat(Xs, Bm) := sign(Xs) min|Xs| , Bm

. The Ro-

bust MOSS algorithm initializes by selecting each arm onceand subsequently, at each time t, selects the arm that maxi-mizes the following upper confidence bound

gknk(t) = µknk(t) + (1 + ζ)cnk(t),

where cnk(t) =√

ln+

(T

Knk(t)

)/nk(t), ζ is an positive

constant such that ψ(2ζ/a) ≥ 2a/ζ and ψ(x) = (1 +1/x) ln(1 + x) − 1. Note that for x ∈ (0,∞), function ψ(x)is monotonically increasing in x.

B. Resetting robust MOSS for the non-stationary heavy-tailedMAB problem

Similarly to R-MOSS, Resetting Robust MOSS (R-RMOSS) restarts Robust MOSS after every τ time slots. Fora stationary heavy-tailed MAB problem, it has been shownin [36] that the worst-case regret of Robust MOSS belongsto O(

√KT ). This result along with an analysis similar to

the analysis for R-MOSS in Theorem 3 yield the followingtheorem for R-RMOSS. For brevity, we skip the proof.

Theorem 7. For the nonstationary heavy-tailed MAB problemwith K arms, horizon T , variation budget VT > 0 and τ =⌈K

13

(T/VT

) 23

⌉, if ψ(2ζ/a) ≥ 2a/ζ, the worst-case regret of

R-RMOSS satisfies


RR-RMOSST ∈ O((KVT )

13T

23 ).

C. Sliding-window robust MOSS for the non-stationary heavy-tailed MAB problem

In Sliding-Window Robust MOSS (SW-RMOSS), nk(t)and µnk(t) are computed from the sampling history withinWt, and cnk(t) =

√ln+

(τ

Knk(t)

)/nk(t). To analyze SW-

RMOSS, we want to establish a similar property as Lemma 4to bound the probability about an arm being under or overestimated. Toward this end, we need the following propertiesfor truncated random variable.

Lemma 8. Let X be a random variable with expected valueµ and E[X2] ≤ 1. Let d := sat(X,B) − E[sat(X,B)]. Thenfor any B > 0, it satisfies (i) |d| ≤ 2B (ii) E[d2] ≤ 1 (iii)∣∣E[sat(X,B)]− µ

∣∣ ≤ 1/B.

Proof. Property (i) follows immediately from definition of dand property (ii) follows from

E[d2] ≤ E[

sat2(X,B)]≤ E

[X2].

To see property (iii), since

µ = E[X(1|X| ≤ B

+ 1|X| > B

)],

one have∣∣E[sat(X,B)]− µ∣∣ ≤ E

[(|X| −B

)1|X| > B

]≤ E

[|X|1

|X| > B

]≤ E

[X2/B

].

Moreover, we will also use a maximal Bennett type inequalityas shown in the following.

Lemma 9 (Maximal Bennett’s inequality [37]). LetXii∈1,...,n be a sequence of bounded random vari-ables with support [−B,B], where B ≥ 0. Suppose thatE[Xi|X1, . . . , Xi−1] = µi and Var[Xi|X1, . . . , Xi−1] ≤ v.Let Sm =

∑mi=1(Xi − µi) for any m ∈ 1, . . . , n. Then, for

any δ ≥ 0

P(∃m ∈ 1, . . . , n : Sm ≥ δ

)≤ exp

(− δBψ

(Bδ

nv

)),

P(∃m ∈ 1, . . . , n : Sm ≤ −δ

)≤ exp

(− δBψ

(Bδ

nv

)).

Now, we are ready to establish a concentration property forsaturated sliding window empirical mean.

Lemma 10. For any arm k ∈ 1, . . . ,K and any t ∈K + 1, . . . , T, if ψ(2ζ/a) ≥ 2a/ζ, the probability ofeither event A =

gkt ≤ Mk

t − x, nk(t) ≥ l

or eventB =

gkt − 2cnk(t) ≥Mk

t +x, nk(t) ≥ l

, for any x > 0 andany l ≥ 1, is no greater than

2a

β2 ln(a)

K

τx2(βx√h(l)/a+ 1) exp

(−βx

√h(l)/a

),

where β = ψ(2ζ/a

)/(2a).

Proof. Recall that ukti is the i-th time slot when arm k isselected within Wt. Since cm is a monotonically decreasingin m, 1/Bm = ch(m) ≤ cm due to h(m) ≥ m. Then, itfollows from property (iii) in Lemma 8 that

P(A)≤ P(∃m∈l, . . . , τ : µkm ≤

m∑i=1

µkukti

m− (1 + ζ)cm− x

)≤ P

(∃m∈l, . . . , τ :

m∑i=1

dktimm≤ 1

Bm− (1 + ζ)cm− x

)≤ P

(∃m∈l, . . . , τ :

1

m

m∑i=1

dktim ≤ −x− ζcm),

(25)

where dktim = sat(Xkukti, Bm

)− E

[sat(Xkukti, Bm

)]. Recall

we select a > 1. Again, we apply a peeling argument withgeometric grid as ≤ m < as+1 over time interval l, . . . , τ.Let s0 =

⌊loga(l)

⌋. Since cm is monotonically decreasing

with m,

(25) ≤∑s≥s0

P

(∃m ∈ [as, as+1) :

m∑i=1

dktim ≤−as (x+ ζcas+1)

).

10

For all m ∈ [as, as+1), since Bm = Bas , from Lemma 8 weknow

∣∣dktim∣∣ ≤ 2Bas and Var[dktim]≤ 1. Continuing from

previous step, we apply Lemma 9 to get

(25) ≤∑s≥s0

exp

(−a

s (x+ ζcas+1)

2Basψ

(2Bas

a(x+ ζcas+1)

))(since ψ(x) is monotonically increasing

)≤∑s≥s0

exp

(−a

s (x+ ζcas+1)

2Basψ

(2ζ

aBascas+1

))(substituting cas+1 , Bas and using h(as) = as+1)

=∑

s≥s0+1

exp

(−as

(x

Bas−1

+ ζc2as

)ψ(2ζ/a

)2a

)(since ζψ(2ζ/a) ≥ 2a

)≤Kτ

∑s≥s0+1

as exp

(−as x

Bas−1

ψ(2ζ/a

)2a

). (26)

Let b = xψ(2ζ/a

)/(2a). Since ln+(x) ≥ 1 for all x > 0,

(26) ≤Kτ

∑s≥s0+1

as exp(−b√as)

≤Kτ

∫ +∞

s0+1

ay exp(− b√ay−1

)dy

=K

τa

∫ +∞

s0

ay exp(− b√ay)dy

=K

τ

2a

ln(a)b2

∫ +∞

b√as0

z exp(− z)dz (where z = b

√ay)

≤Kτ

2a

ln(a)b2(b√as0 + 1) exp(−b

√as0),

which concludes the proof.

With Lemma 10, the upper bound on the worst-case regretfor SW-RMOSS in the nonstationary heavy-tailed MAB prob-lem can be analyzed similarly as Theorem 5.

Theorem 11. For the nonstationary heavy-tailed MAB prob-lem with K arms, time horizon T , variation budget VT > 0

and τ =⌈K

13

(T/VT

) 23

⌉, if ψ(2ζ/a) ≥ 2a/ζ, the worst-case

regret of SW-RMOSS satisfies


RSW-RMOSST ≤ C(KVT )

13T

23 .

Sketch of the proof. The procedure is similar as the proof ofTheorem 5. The key difference is due to the nuance betweenthe concentration properties on mean estimator. Neglecting theleading constants, the probability upper bound in Lemma 4has a factor exp(−x2l/η) comparing with (βx

√h(l)/a +

1) exp(−βx

√h(l)/a

)in Lemma 10. Since both factors are

no greater than 1, by simply replacing η with (1 + ζ)2 andtaking similar calculation in every step except inequality (13),comparable bounds that only differs in leading constants can

be obtained. Applying Lemma 10, we revise the computationof (13) as the following,∑

s≥lki +1

Pgkts − 2cnk(ts) > Mk

ts +∆ki

4

≤∑s≥lki

C ′

(β∆k

i

4

√h(l)

a+ 1

)exp

(−β∆k

i

4

√h(l)

a

)

≤∫ +∞

lki−1

C ′

(β∆k

i

4

√y

a+ 1

)exp

(−β∆k

i

4

√y

a

)dy

≤6a

β2

2a

β2 ln(a)

K

τ

(4

∆ki

)4

. (27)

where C ′ = 2aK(4/∆k

i

)2/(β2 ln(a)τ

).The second inequality

is due to the fact that (x + 1) exp(−x) is monotonicallydecreasing in x for x ∈ [0,∞) and h(l) > l. In the lastinequality, we change the lower limits of the integration fromlki −1 to 0 since lki ≥ 1 and plug in the value of C ′. Comparingwith (13), this upper bound only varies in constant multiplier.So is the worst-regret upper bound.

Remark 1. The benefit of discount factor method is that it ismemory friendly. This advantage is lost if truncated empiricalmean is used. As nk(t) could both increase and decreasewith time, the truncated point could both grow and decline,so all sampling history needs to be recorded. It remains anopen problem how to effectively using discount factor in anonstationary heavy-tailed MAB problem.

VI. NUMERICAL EXPERIMENTS

We complement the theoretical results in previous sectionwith two Monte-Carlo experiments. For the light-tailed setting,we compare R-MOSS, SW-MOSS and D-UCB in this paperwith other state-of-art policies. For the heavy-tailed setting, wetest the robustness of R-RMOSS and SW-RMOSS against bothheavy-tailed rewards and nonstationarity. Each result in thissection is derived by running designated policies 500 times.And parameter selections for compared policies are strictlycoherent with referred literature.

A. Bernoulli Nonstationay Stochastic MAB Experiment

To evaluated the performance of different policies, weconsider two nonstationary environment as shown in Figs. 1aand 1b, which both have 3 arms with nonstationary Bernoullireward. The success probability sequence at each arm is aBrownian motion in environment 1 and a sinusoidal functionof time t in environment 2. And the variation budget VT is8.09 and 3 respectively.

The growths of regret in Figs. 1c and 1d show that UCBbased policies (R-MOSS, SW-MOSS, and D-UCB) maintaintheir superior performance against adversarial bandit basedpolicies (Rexp3 and Exp3.S) for stochastic bandits evenin nonstationary settings, especially for R-MOSS and SW-MOSS. Besides, DTS outperforms other polices when the bestarm does not switch. While each switch of the best arm seemsto incur larger regret accumulation for DTS, which results ina lager regret compared with SW-MOSS and R-MOSS.

11

(a) Environment 1 (b) Environment 2

(c) Regrets for environment 1 (d) Regrets for environment 2

Fig. 1: Comparison of different policies.

B. Heavy-tailed Nonstationay Stochastic MAB Experiment

Again we consider the 3-armed bandit problem with sinu-soidal mean rewards. In particular, for each arm k ∈ 1, 2, 3,

µkt = 0.3 sin(0.001πt+ 2kπ/3

), t ∈ 1, . . . , 5000.

Thus, the variation budget is 3. Besides, mean reward iscontaminated by additive sampling noise ν, where |ν| is ageneralized Pareto random variable and the sign of ν has equalprobability to be “+” and “−”. So the probability distributionfor Xk

t is

fkt (x) =1

2σ

(1 +

ξ∣∣x− µkt ∣∣σ

)− 1ξ−1

for x ∈ (−∞,+∞).

We select ξ = 0.4 and σ = 0.23 such that Assumption 2 issatisfied. We select a = 1.1 and ζ = 2.2 for both R-RMOSSand SW-RMOSS such that condition ψ(2ζ/a) ≥ 2a/ζ is met.

(a) Regret (b) Histogram of RT

Fig. 2: Performances with heavy-tailed rewards.

Fig. 2a show RMOSS based polices and slightly outperformMOSS based polices in heavy-tailed settings. While by com-paring the estimated histogram of RT for different policies inFig. 2b, R-RMOSS and SW-RMOSS have a better consistencyand a smaller possibility of a particular realization of the regretdeviating significantly from the mean value.

VII. CONCLUSION

We studied the general nonstationary stochastic MAB prob-lem with variation budget and provided three UCB basedpolicies for the problem. Our analysis showed that the pro-posed policies enjoy the worst-case regret that is within aconstant factor of the minimax regret lower bound. Besides,the sub-Gaussian assumption on reward distributions is relaxedto define the nonstationary heavy-tailed MAB problem. Weshow the order optimal worst-case regret can be maintainedby extending the previous policies to robust versions.

There are several possible avenues for future research.In this paper, we relied on passive methods to balance theremembering-versus-forgetting tradeoff. The general idea isto keep taking in new information and removing out-datedinformation. Parameter-free active approaches that adaptivelydetect and react to environment changes are promising alterna-tives and may result in better experimental performance. Alsoextensions from the single decision-maker to distributed mul-tiple decision-makers is of interest. Another possible directionis the nonstaionary version of rested and restless bandits.

REFERENCES

[1] A. B. H. Alaya-Feki, E. Moulines, and A. LeCornec, “Dynamic spectrumaccess with non-stationary multi-armed bandit,” in IEEE Workshop onSignal Processing Advances in Wireless Communications, 2008, pp.416–420.

[2] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, “Distributedalgorithms for learning and cognitive medium access with logarithmicregret,” IEEE Journal on Selected Areas in Communications, vol. 29,no. 4, pp. 731–745, 2011.

[3] Y. Li, Q. Hu, and N. Li, “A reliability-aware multi-armed banditapproach to learn and select users in demand response,” Automatica,vol. 119, p. 109015, 2020.

[4] D. Kalathil and R. Rajagopal, “Online learning for demand response,”in Annual Allerton Conference on Communication, Control, and Com-puting, 2015, pp. 218–222.

[5] J. R. Krebs, A. Kacelnik, and P. Taylor, “Test of optimal sampling byforaging great tits,” Nature, vol. 275, no. 5675, pp. 27–31, 1978.

[6] V. Srivastava, P. Reverdy, and N. E. Leonard, “On optimal foraging andmulti-armed bandits,” in Annual Allerton Conference on Communica-tion, Control, and Computing, Monticello, IL, USA, 2013, pp. 494–499.

[7] ——, “Surveillance in an abruptly changing world via multiarmedbandits,” in IEEE Conference on Decision and Control, 2014, pp. 692–697.

[8] C. Baykal, G. Rosman, S. Claici, and D. Rus, “Persistent surveillanceof events with unknown, time-varying statistics,” in IEEE InternationalConference on Robotics and Automation, 2017, pp. 2682–2689.

[9] M. Y. Cheung, J. Leighton, and F. S. Hover, “Autonomous mobileacoustic relay positioning as a multi-armed bandit with switching costs,”in IEEE/RSJ Int Conf on Intelligent Robots and Systems, Tokyo, Japan,Nov. 2013, pp. 3368–3373.

[10] D. Agarwal, B.-C. Chen, P. Elango, N. Motgi, S.-T. Park, R. Ramakrish-nan, S. Roy, and J. Zachariah, “Online models for content optimization,”in Advances in Neural Information Processing Systems, 2009, pp. 17–24.

[11] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-banditapproach to personalized news article recommendation,” in InternationalConference on World Wide Web, 2010, pp. 661–670.

[12] H. Robbins, “Some aspects of the sequential design of experiments,”Bulletin of the American Mathematical Society, vol. 58, no. 5, pp. 527–535, 1952.

[13] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocationrules,” Advances in Applied Mathematics, vol. 6, no. 1, pp. 4–22, 1985.

[14] A. N. Burnetas and M. N. Katehakis, “Optimal adaptive policies forsequential allocation problems,” Advances in Applied Mathematics,vol. 17, no. 2, pp. 122–142, 1996.

[15] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,” Machine Learning, vol. 47, no. 2, pp. 235–256, 2002.

12

[16] A. Garivier and O. Cappe, “The KL-UCB algorithm for boundedstochastic bandits and beyond,” in Annual Conference on LearningTheory, 2011, pp. 359–376.

[17] P. Auer, Y. F. N. Cesa-Bianchi, and R. Schapire, “The nonstochasticmultiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1,pp. 48–77, 2002.

[18] L. Kocsis and C. Szepesvari, “Discounted UCB,” in 2nd PASCALChallenges Workshop, vol. 2, 2006.

[19] A. Garivier and E. Moulines, “On upper-confidence bound policies forswitching bandit problems,” in International Conference on AlgorithmicLearning Theory. Springer, 2011, pp. 174–188.

[20] L. Wei and V. Srivastava, “On abruptly-changing and slowly-varyingmultiarmed bandit problems,” in American Control Conference, Mil-waukee, WI, Jun. 2018, pp. 6291–6296.

[21] C. Hartland, N. Baskiotis, S. Gelly, M. Sebag, and O. Teytaud, “ChangePoint Detection and Meta-Bandits for Online Learning in DynamicEnvironments,” in Conference Francophone Sur L’Apprentissage Au-tomatique, Grenoble, France, Jul. 2007, pp. 237–250.

[22] F. Liu, J. Lee, and N. Shroff, “A change-detection based frameworkfor piecewise-stationary multi-armed bandit problem,” in Thirty-SecondAAAI Conference on Artificial Intelligence, 2018.

[23] L. Besson and E. Kaufmann, “The generalized likelihood ratio test meetsklucb: an improved algorithm for piece-wise non-stationary bandits,”arXiv preprint arXiv:1902.01575, 2019.

[24] Y. Cao, Z. Wen, B. Kveton, and Y. Xie, “Nearly optimal adaptiveprocedure with change detection for piecewise-stationary bandit,” inInternational Conference on Artificial Intelligence and Statistics, 2019,pp. 418–427.

[25] J. Mellor and J. Shapiro, “Thompson sampling in switching environ-ments with bayesian online change detection,” in Artificial Intelligenceand Statistics, 2013, pp. 442–450.

[26] O. Besbes and Y. Gur, “Stochastic multi-armed-bandit problem withnon-stationary rewards,” in Advances in Neural Information ProcessingSystems, 2014, pp. 199–207.

[27] O. Besbes, Y. Gur, and A. Zeevi, “Optimal exploration–exploitation ina multi-armed bandit problem with non-stationary rewards,” StochasticSystems, vol. 9, no. 4, pp. 319–337, 2019.

[28] V. Raj and S. Kalyani, “Taming non-stationary bandits: A Bayesianapproach,” arXiv preprint arXiv:1707.09727, 2017.

[29] R. Albert and A.-L. Barabasi, “Statistical mechanics of complex net-works,” Reviews of Modern Physics, vol. 74, no. 1, p. 47, 2002.

[30] M. Vidyasagar, “Law of large numbers, heavy-tailed distributions, andthe recent financial crisis,” in Perspectives in Mathematical SystemTheory, Control, and Signal Processing. Springer, 2010, pp. 285–295.

[31] J. Audibert and S. Bubeck, “Minimax policies for adversarial andstochastic bandits,” in Annual Conference on Learning Theory, Montreal,Canada, Jun. 2009, pp. 217–226.

[32] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling ina rigged casino: The adversarial multi-armed bandit problem,” in IEEEAnnual Foundations of Computer Science, 1995, pp. 322–331.

[33] S. Mannor and J. N. Tsitsiklis, “The sample complexity of explorationin the multi-armed bandit problem,” Journal of Machine LearningResearch, vol. 5, no. Jun, pp. 623–648, 2004.

[34] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,” Journal of the American Statistical Association, vol. 58, no.301, pp. 13–30, 1963.

[35] S. Bubeck, “Bandits games and clustering foundations,” Theses,Universite des Sciences et Technologie de Lille - Lille I, 2010.[Online]. Available: https://tel.archives-ouvertes.fr/tel-00845565

[36] L. Wei and V. Srivastava, “Minimax policy for heavy-tailed bandits,”IEEE Control Systems Letters, vol. 5, no. 4, pp. 1423–1428, 2021.

[37] X. Fan, I. Grama, and Q. Liu, “Hoeffding’s inequality for supermartin-gales,” Stochastic Processes and their Applications, vol. 122, no. 10, pp.3545–3559, 2012.

https://tel.archives-ouvertes.fr/tel-00845565

nonstationary stochastic multiarmed bandits: ucb policies ...3 iii. preliminaries in this section,...

Documents