an adaptive algorithm for finite stochastic partial monitoring

7/31/2019 An adaptive algorithm for fi nite stochastic partial monitoring

1/20

An adaptive algorithm for finite stochastic partial monitoring

Gabor Bartok [email protected]

Navid Zolghadr [email protected]

Csaba Szepesvari [email protected]

Department of Computing Science, University of Alberta, AB, Canada, T6G 2E8

Abstract

We present a new anytime algorithm thatachieves near-optimal regret for any instanceof finite stochastic partial monitoring. In par-ticular, the new algorithm achieves the min-imax regret, within logarithmic factors, for

both easy and hard problems. For easyproblems, it additionally achieves logarithmicindividual regret. Most importantly, the al-gorithm is adaptive in the sense that if theopponent strategy is in an easy region ofthe strategy space then the regret grows as ifthe problem was easy. As an implication, weshow that under some reasonable additionalassumptions, the algorithm enjoys an O(

T)

regret in Dynamic Pricing, proven to be hardby Bartok et al. (2011).

1. IntroductionPartial monitoring can be cast as a sequential gameplayed by a learner and an opponent. In every timestep, the learner chooses an action and simultaneouslythe opponent chooses an outcome. Then, based on theaction and the outcome, the learner suffers some lossand receives some feedback. Neither the outcome northe loss are revealed to the learner. Thus, a partial-monitoring game with N actions and M outcomes isdefined with the pair G = (L, H), where L RNMis the loss matrix, and H NM is the feedbackmatrix over some arbitrary set of symbols . These

matrices are announced to both the learner and theopponent before the game starts. At time step t, ifIt N = {1, 2, . . . , N } and Jt M = {1, 2, . . . , M }denote the (possibly random) choices of the learnerand the opponent, respectively then, the loss sufferedby the learner in that time step is L[It, Jt], while the

Appearing in Proceedings of the 29th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.Copyright 2012 by the author(s)/owner(s).

feedback received is H[It, Jt].

The goal of the learner (or player) is to minimize his

cumulative lossT

t=1 L[It, Jt]. The performance of thelearner is measured in terms of the regret, defined asthe excess cumulative loss he suffers compared to thatof the best fixed action in hindsight:

RT =Tt=1

L[It, Jt] miniN

Tt=1

L[i, Jt] .

The regret usually grows with the time horizon T.What distinguishes between a successful and an un-successful learner is the growth rate of the regret. Aregret linear in T means that the learner does not ap-proach the performance of the optimal action. On theother hand, if the growth rate is sublinear, it is saidthat the learner can learn the game.

In this paper we restrict our attention to stochastic

games, adding the extra assumption that the oppo-nent generates the outcomes with a sequence of inde-pendent and identically distributed random variables.This distribution will be called the opponent strategy.As for the player, a player strategy (or algorithm) isa (possibly random) function from the set of feedbacksequences (observation histories) to the set of actions.

In stochastic games, we use a slightly different notionof regret: we compare the cumulative loss with that ofthe action with the lowest expected loss.

RT =T

t=1 L[It, Jt] TminiNE[L[i, J1]] .The hardness of a game is defined in terms of theminimax expected regret (or minimax regret for short):

RT(G) = minA

maxpM

E[RT] ,

where M is the space of opponent strategies, andA is any strategy of the player. In other words, the


2/20

Adaptive Stochastic Partial Monitoring

minimax regret is the worst-case expected regret of thebest algorithm.

A question of major importance is how the minimaxregret scales with the parameters of the game, such asthe time horizon T, the number of actions N, the num-ber of outcomes M. In the stochastic setting, another

measure of hardness is worth studying, namely theindividual or problem-dependent regret, defined as theexpected regret given a fixed opponent strategy.

1.1. Related work

Two special cases of partial monitoring have beenextensively studied for a long time: full-informationgames, where the feedback carries enough informa-tion for the learner to infer the outcome for anyaction-outcome pair, and bandit games, where thelearner receives the loss of the chosen action as feed-back. Since Vovk (1990) and Littlestone & Warmuth

(1994) we know that for full-information games, theminimax regret scales as (T log N). For banditgames, the minimax regret has been proven to scale as(

N T) (Audibert & Bubeck, 2009).1 The individ-

ual regret of these kind of games has also been studied:Auer et al. (2002) showed that given any opponentstrategy, the expected regret can be upper boundedby c

iN:i=0

1i

log T, where i is the expected dif-ference between the loss of action i and an optimalaction.

Finite partial monitoring problems were introduced byPiccolboni & Schindelhauer (2001). They proved thata game is either hopeless (that is, its minimax re-gret scales linearly with T), or the regret can be upperbounded by O(T3/4). They also give a characteriza-tion of hopeless games. Namely, a game is hopelessif it does not satisfy the global observability condition(see Definition 5 in Section 2). Their upper boundfor non-hopeless games was tightened to O(T2/3) byCesa-Bianchi et al. (2006), who also showed that thereexists a game with a matching lower bound.

Cesa-Bianchi et al. (2006) posted the problem of char-acterizing partial-monitoring games with minimax re-gret less than (T2/3). This problem has been solvedsince then. The first steps towards classifying partial-

monitoring games were made by Bartok et al. (2010),who characterized almost all games with two outcomes.They proved that there are only four categories: gameswith minimax regret 0, (T), (T2/3), and (T),and named them trivial, easy, hard, and hopeless, re-

1The Exp3 algorithm due to Auer et al. (2003) achievesalmost the same regret, with an extra logarithmic term.

spectively.2 They also found that there exist gamesthat are easy, but can not easily be transformed toa bandit or full-information game. Later, Bartok et al.(2011) proved the same results for finite stochastic par-tial monitoring, with any finite number of outcomes.The condition that separates easy games from hardgames is the local observability condition (see Defini-tion 6). The algorithm Balaton introduced thereworks by eliminating actions that are thought to besuboptimal with high confidence. They conjectured intheir paper that the same classification holds for non-stochastic games, without changing the condition. Re-cently, Foster & Rakhlin (2011) designed the algorithmNeighborhoodWatch that proves this conjecture tobe true. Foster & Rakhlin prove an upper bound on astronger notion of regret, called internal regret.

1.2. Contributions

In this paper, we extend the results of Bartok et al.

(2011). We introduce a new algorithm, called CBPfor Confidence Bound Partial monitoring, with var-ious desirable properties. First of all, while Balatononly works on easy games, CBP can be run on anynon-hopeless game, and it achieves (up to logarithmicfactors) the minimax regret rates both for easy andhard games (see Corollaries 3 and 2). Furthermore, italso achieves logarithmic problem-dependent regret foreasy games (see Corollary 1). It is also an anytimealgorithm, meaning that it does not have to know thetime horizon, nor does it have to use the doubling trick,to achieve the desired performance.

The final, and potentially most impactful, aspect ofour algorithm is that through additional assumptionson the set of opponent strategies, the minimax regretof even hard games can be brought down to (T)!While this statement may seem to contradict the re-sult ofBartok et al. (2011), in fact it does not. For theprecise statement, see Theorem 2. We call this prop-erty adaptiveness to emphasize that the algorithmdoes not even have to know that the set of opponentstrategies is restricted.

2. Definitions and notations

Recall from the introduction that an instance of partialmonitoring with N actions and M outcomes is definedby the pair of matrices L RNM and H NM,where is an arbitrary set of symbols. In each roundt, the opponent chooses an outcome Jt M and simul-taneously the learner chooses an action It N. Then,

2Note that these results do not concern the growth ratein terms of other parameters (like N).


3/20


the feedback H[It, Jt] is revealed and the learner suf-fers the loss L[It, Jt]. It is important to note that theloss is not revealed to the learner.

As it was previously mentioned, in this paper we dealwith stochastic opponents only. In this case, the choiceof the opponent is governed by a sequence J1, J2, . . .

of i.i.d. random variables. The distribution of thesevariables p M is called an opponent strategy, whereM, also called the probability simplex, is the set ofall distributions over the M outcomes. It is easy tosee that, given opponent strategy p, the expected lossof action i can be expressed as i p, where i is definedas the column vector consisting of the ith row of L.

The following definitions, taken from Bartok et al.(2011), are essential for understanding how the struc-ture ofL and H determines the hardness of a game.

Action i is called optimal under strategy p if its ex-pected loss is not greater than that of any other ac-

tion i N. That is, i p ip. Determining whichaction is optimal under opponent strategies yields thecell decomposition3 of the probability simplex M:

Definition 1 (Cell decomposition). Forevery action i N, let Ci = {p M : action i is optimal under p}. The sets C1, . . . , CN constitute the cell decomposition of M.Now we can define the following important propertiesof actions:

Definition 2 (Properties of actions). Actioni iscalled dominated if Ci = . If an action is notdominated then it is called non-dominated. Action i is called degenerate if it is non-dominated and there exists an action i such thatCi Ci .

If an action is neither dominated nor degener-ate then it is called Pareto-optimal. The set ofPareto-optimal actions is denoted byP.

From the definition of cells we see that a cell is eitherempty or it is a closed polytope. Furthermore, Pareto-optimal actions have (M 1)-dimensional cells. Thefollowing definition, important for our algorithm, also

uses the dimensionality of polytopes:Definition 3 (Neighbors). Two Pareto-optimal ac-tions i and j are neighbors if Ci Cj is an (M 2)-dimensional polytope. Let N be the set of unorderedpairs over N that contains neighboring action-pairs.The neighborhood action set of two neighboring ac-tionsi, j is defined as N+i,j = {k N : CiCj Ck}.

3The concept of cell decomposition also appears in Pic-colboni & Schindelhauer (2001).

Note that the neighborhood action set N+i,j naturally

contains i and j. If N+i,j contains some other action kthen either Ck = Ci, Ck = Cj, or Ck = Ci Cj .In general, the elements of the feedback matrix H canbe arbitrary symbols. Nevertheless, the nature of thesymbols themselves does not matter in terms of the

structure of the game. What determines the feedbackstructure of a game is the occurrence of identical sym-bols in each row of H. To standardize the feedbackstructure, the signal matrix is defined for each action:

Definition 4. Let si be the number of distinct sym-bols in the ith row of H and let 1, . . . , si be an enumeration of those symbols. Then the sig-nal matrix Si {0, 1}siM of action i is defined asSi[k, l] = I{H[i,l]=k}.

The idea of this definition is that if p M is the op-ponents strategy then Sip gives the distribution overthe symbols underlying action i. In fact, it is also truethat observing H[It, Jt] is equivalent to observing thevector SIteJt , where ek is the k

th unit vector in thestandard basis ofRM. From now on we assume with-out loss of generality that the learners observation attime step t is the random vector Yt = SIteJt . Notethat the dimensionality of this vector depends on theaction chosen by the learner, namely Yt RsIt .The following two definitions play a key role in classify-ing partial-monitoring games based on their difficulty.

Definition 5 (Global observability (Piccolboni &Schindelhauer, 2001)). A partial-monitoring game(L, H) admits the global observability condition, if forall pairs i, j of actions, i j kN Im Sk .Definition 6 (Local observability (Bartok et al.,2011)). A pair of neighboring actions i, j is said tobe locally observable if i j kN+i,j Im S

k . We

denote by L N the set of locally observable pairs ofactions (the pairs are unordered). A game satisfies thelocal observability condition if every pair of neighbor-ing actions is locally observable, i.e., if L = N.The main result of Bartok et al. (2011) is that lo-

cally observable games have O(T) minimax regret.It is easy to see that local observability implies globalobservability. Also, from Piccolboni & Schindelhauer(2001) we know that if global observability does nothold then the game has linear minimax regret. Fromnow on, we only deal with games that admit the globalobservability condition.

A collection of the concepts and symbols introducedin this section is shown in Table 1.


4/20


Table 1. List of basic symbols

Symbol Definition Found in/at

N, M N number of actions and outcomesN {1, . . . , N }, set of actionsM RM M-dim. simplex, set of opponent strategies

p

M opponent strategy

L RNM loss matrixH NM feedback matrixi RM i = L[i, :], loss vector underlying action iCi M cell of action i Definition 1P N set of Pareto-optimal actions Definition 2

N N2 set of unordered neighboring action-pairs Definition 3N+i,j N neighborhood action set of{i, j} N Definition 3Si {0, 1}siM signal matrix of action i Definition 4L N set of locally observable action pairs Definition 6Vi,j N observer actions underlying {i, j} N Definition 7vi,j,k Rsk , k Vi,j observer vectors Definition 7Wi R confidence width for action i N Definition 7

3. The proposed algorithm

Our algorithm builds on the core idea underlying al-gorithm Balaton of Bartok et al. (2011), so we startwith a brief review of Balaton. Balaton usessweeps to successively eliminate suboptimal actions.This is done by estimating the differences betweenthe expected losses of pairs of actions, i.e., i,j =(ij)p (i, j N). In fact, Balaton exploits thatit suffices to keep track of i,j for neighboring pairs ofactions (i.e., for action pairs i, j such that

{i, j

} N).

This is because if an action i is suboptimal, it willhave a neighbor j that has a smaller expected lossand so the action i will get eliminated when i,j ischecked. Now, to estimate i,j for some {i, j} Noneobserves that under the local observability condition,it holds that i j =

kN+i,j

Si vi,j,k for some vec-

tors vi,j,k Rk . This yields that i,j = (ij)p =kN+

i,jvi,j,kSkp

. Since kdef= Skp

is the vector

of the distribution of symbols under action k, whichcan be estimated by k(t), the empirical frequencies ofthe individual symbols observed under k up to timet, Balaton uses kN+i,j v

i,j,kk(t) to estimate i,j.

Since none of the actions in N+i,j can get eliminatedbefore one of {i, j} gets eliminated, the estimate ofi,j gets refined until one of{i, j} is eliminated.The essence of why Balaton achieves a low regret isas follows: When i is not a neighbor of the optimalaction i one can show that it will be eliminated be-fore all neighbors j between i and i get eliminated.Thus, the contribution of such far actions to the re-

gret is minimal. When i is a neighbor of i, it willbe eliminated in time proportional to 2i,i. Thus thecontribution to the regret of such an action is propor-

tional to 1i , where idef= i,i. It also holds that the

contribution to the regret of i cannot be larger thaniT. Thus, the contribution of i to the regret is atmost min(iT,

1i )

T.

When some pairs {i, j} Nare not locally observable,one needs to use actions other than those in N+i,j toconstruct an estimate ofi,j. Under global observabil-

ity, ij = kVi,j Si vi,j,k for an appropriate subsetVi,j N and an appropriate set of vectors vi,j,. Thus,if the actions in Vi,j are kept in play, one can estimatethe difference i,j as before, using

kN+i,j

vi,j,kk(t).

This motivates the following definition:

Definition 7 (Observer sets and observer vectors).The observer set Vi,j N underlying a pair of neigh-boring actions {i, j} N is a set of actions such that

i j kVi,j Im Sk .The observer vectors (vi,j,k)kVi,j are defined to sat-isfy the equation i j =

kVi,jSk vi,j,k. In par-

ticular, vi,j,k Rsk

. In what follows, the choiceof the observer sets and vectors is restricted so thatVi,j = Vj,i and vi,j,k = vj,i,k. Furthermore, the ob-server set Vi,j is constrained to be a superset of N

+i,j

and in particular when a pair {i, j} is locally observ-able, Vi,j = N

+i,j must hold. Finally, for any action

k {i,j}N N+i,j, let Wk = maxi,j:kVi,j vi,j,k bethe confidence width of action k.

The reason of the particular choice Vi,j = N+i,j for lo-


5/20


cally observable pairs {i, j} is that we plan to use Vi,j(and the vectors vi,j,) in the case of locally observablepairs, too. For not locally observable pairs, the wholeaction set N is always a valid observer set (thus, Vi,jcan be found). However, whenever possible, it is bet-ter to use a smaller set. The actual choice of Vi,j (andvi,j,k) is postponed until the effect of this choice on theregret becomes clear.

With the observer sets, the basic idea of the algorithmbecomes as follows: (i) Eliminate the suboptimal ac-tions in successive sweeps; (ii) In each sweep, enrichthe set of remaining actions P(t) by adding the ob-server actions underlying the remaining neighboringpairs {i, j} N(t): V(t) = {i,j}N(t) Vi,j; (iii) Ex-plore the actions in P(t) V(t) to update the symbolfrequency estimate vectors k(t). Another refinementis to eliminate the sweeps so as to make the algorithmenjoy an advantageous anytime property. This can beachieved by selecting in each step only one action. We

propose the action to be chosen should be the one thatmaximizes the reduction of the remaining uncertainty.

This algorithm could be shown to enjoy

T regret forlocally observable games. However, if we run it on anon-locally observable game and the opponent strat-egy is on Ci Cj for {i, j} N \ L, it will suffer linearregret! The reason is that if both actions i and j areoptimal, and thus never get eliminated, the algorithmwill choose actions from Vi,j \ N+i,j too often. Fur-thermore, even if the opponent strategy is not on theboundary the regret can be too high: say action i isoptimal but j is small, while

{i, j

} N \ L. Then

a third action k Vi,j with large k will be chosenproportional to 1/2j times, causing high regret. Tocombat this we restrict the frequency with which anaction can be used for information seeking purposes.For this, we introduce the set of rarely chosen actions,

R(t) = {k N : nk(t) kf(t)} ,where k R, f : N R are tuning parameters to bechosen later. Then, the set of actions available at timet is restricted to P(t) N+(t) (V(t) R(t)), whereN+(t) =

{i,j}N(t) N

+i,j. We will show that with

these modifications, the algorithm achieves O(T2/3)

regret in the general case, while it will also be shownto achieve an O(

T) regret when the opponent uses

a benign strategy. A pseudocode for the algorithm isgiven in Algorithm 1.

It remains to specify the function getPolytope. Itgets the array halfSpace as input. The array half-Space stores which neighboring action pairs have aconfident estimate on the difference of their expectedlosses, along with the sign of the difference (if confi-

Algorithm 1 CBP

Input: L, H, , 1, . . . , N, f = f()Calculate P, N, Vi,j, vi,j,k, Wkfor t = 1 to N do

Choose It = t and observe Yt {Initialization}nIt 1 {# times the action is chosen}It

Yt {

Cumulative observations}end for

for t = N + 1, N + 2, . . . dofor each {i, j} N do

i,j

kVi,jvi,j,k

knk

{Loss diff. estimate}ci,j

kVi,j

vi,j,k

log tnk

{Confidence}if |i,j| ci,j then

halfSpace(i, j) sgn i,jelse

halfSpace(i, j) 0end if

end for

[P(t),N(t)] getPolytope(P,N,halfSpace)N+(t) = {i,j}N(t)N+ijV(t) = {i,j}N(t)VijR(t) = {k N : nk(t) kf(t)}S(t) = P(t) N+(t) (V(t) R(t))Choose It = argmaxiS(t)

W2ini

and observe YtIt It + YtnIt nIt + 1

end for

dent). Each of these confident pairs define an openhalfspace, namely

{i,j} =p M : halfSpace(i, j)(i j)p > 0 .

The function getPolytope calculates the open poly-tope defined as the intersection of the above halfspaces.Then for all i P it checks if Ci intersects with theopen polytope. If so, then i will be an element ofP(t).Similarly, for every {i, j} N, it checks if Ci Cj in-tersects with the open polytope and puts the pair in

N(t) if it does.Note that it is not enough to compute P(t) and thendrop from N those pairs {k, l} where one of k or l isexcluded from

P(t): it is possible that the boundary

Ck Cl between the cells of two actions k, l P(t) isincluded in the rejected region. For an illustration ofcell decomposition and excluding cells, see Figure 1.

Computational complexity The computationallyheavy parts of the algorithm are the initial calculationof the cell decomposition and the function getPoly-tope. All of these require linear programming. In thepreprocessing phase we need to solve N + N2 linear


6/20


(1, 0, 0)

(0, 1, 0)

( , , 1)

(a) Cell decomposition.

(1, 0, 0)

(0, 1, 0)

( , , 1)

(b) Gray indicates ex-cluded area.

Figure 1. An example of cell decomposition (M = 3).

programs to determine cells and neighboring pairs ofcells. Then in every round, at most N2 linear pro-grams are needed. The algorithm can be sped up bycaching previously solved linear programs.

4. Analysis of the algorithmThe first theorem in this section is an individual upperbound on the regret ofCBP.

Theorem 1. Let (L, H) be an N by M partial-monitoring game. For a fixed opponent strategy p M, let i denote the difference between the expectedloss of action i and an optimal action. For any timehorizon T, algorithm CBP with parameters > 1,

k = W2/3k , f(t) =

1/3t2/3 log1/3 t has expected regret

E[RT] {i,j}N2|Vi,j|1 +1

2

2+N

k=1 k+

Nk=1k>0

4W2kd2kk

log T

+

kV\N+

k min

4W2k

d2l(k)2l(k)

log T,

1/3W2/3k T

2/3 log1/3 T

+

kV\N+

k1/3W

2/3k T

2/3 log1/3 T

+ 2dk1/3W2/3T2/3 log1/3 T ,

where W = maxkNWk, V = {i,j}NVi,j, N+ ={i,j}NN+i,j, andd1, . . . , dN are game-dependent con-stants.

The proof is omitted for lack of space.4 Here we give a

4For complete proofs we refer the reader to the supple-mentary material.

short explanation of the different terms in the bound.The first term corresponds to the confidence intervalfailure event. The second term comes from the ini-tialization phase of the algorithm. The remaining fourterms come from categorizing the choices of the algo-rithm by two criteria: (1) Would It be different ifR(t)was defined as

R(t) = N? (2) Is It

P(t)

N+

t

?These two binary events lead to four different cases inthe proof, resulting in the last four terms of the bound.

An implication of Theorem 1 is an upper bound on theindividual regret of locally observable games:

Corollary 1. If G is locally observable then

E[RT]

{i,j}N

2|Vi,j|

1 +1

2 2

+

Nk=1

k + 4W2k

d2kk

log T .

Proof. If a game is locally observable then V \N+ = ,leaving the last two sums of the statement of Theo-rem 1 zero.

The following corollary is an upper bound on the min-imax regret of any globally observable game.

Corollary 2. Let G be a globally observable game.Then there exists a constant c such that the expectedregret can be upper bounded independently of the choiceof p as

E[RT] cT2/3 log1/3 T .

The following theorem is an upper bound on the min-imax regret of any globally observable game againstbenign opponents. To state the theorem, we need anew definition. Let A be some subset of actions in G.We call A a point-local game in G if

iA Ci = .

Theorem 2. LetG be a globally observable game. Let M be some subset of the probability simplexsuch that its topological closure hasCiCj =

for every {i, j} N \ L. Then there exists a constantc such that for every p , algorithmCBP withparameters > 1, k = W

2/3k , f(t) =

1/3t2/3 log1/3 tachieves

E[RT] cdpmax

bT log T ,

where b is the size of the largest point-local game, anddpmax is a game-dependent constant.

In a nutshell, the proof revisits the four cases of theproof of Theorem 1, and shows that the terms whichwould yield T2/3 upper bound can be non-zero onlyfor a limited number of time steps.


7/20


Remark 1. Note that the above theorem implies thatCBP does not need to have any prior knowledge about to achieve

T regret. This is why we say our algo-

rithm is adaptive.

An immediate implication of Theorem 2 is the follow-ing minimax bound for locally observable games:

Corollary 3. LetG be a locally observable finite par-tial monitoring game. Then there exists a constant csuch that for every p M,

E[RT] c

T log T .

Remark 2. The upper bounds in Corollaries 2 and 3both have matching lower bounds up to logarith-mic factors (Bartok et al., 2011), proving that CBPachieves near optimal regret in both locally observableand non-locally observable games.

5. Experiments

We demonstrate the results of the previous sections us-ing instances of Dynamic Pricing, as well as a locallyobservable game. We compare the results ofCBP totwo other algorithms: Balaton (Bartok et al., 2011)which is, as mentioned earlier in the paper, the firstalgorithm that achieves O(T) minimax regret for alllocally observable finite stochastic partial-monitoringgames; and FeedExp3 (Piccolboni & Schindelhauer,2001), which achieves O(T2/3) minimax regret onall non-hopeless finite partial-monitoring games, evenagainst adversarial opponents.

5.1. A locally observable game

The game we use to compare CBP and Balaton has3 actions and 3 outcomes. The game is described withthe loss and feedback matrices:

L =

1 1 00 1 11 0 1

; H =a b bb a b

b b a

.We ran the algorithms 10 times for 15 differentstochastic strategies. We averaged the results for eachstrategy and then took pointwise maximum over the 15strategies. Figure 2(a) shows the empirical minimaxregret calculated the way described above. In addition,Figure 2(b) shows the regret of the algorithms againstone of the opponents, averaged over 100 runs. Theresults indicate that CBP outperforms both FeedExpand Balaton. We also observe that, although theasymptotic performace of Balaton is proven to bebetter than that of FeedExp, a larger constant factormakes Balaton lose against FeedExp even at timestep ten million.

5.2. Dynamic Pricing

In Dynamic Pricing, at every time step a seller (player)sets a price for his product while a buyer (opponent)secretly sets a maximum price he is willing to pay. Thefeedback for the seller is buy or no-buy, while hisloss is either a preset constant (no-buy) or the differ-

ence between the prices (buy). The finite version of thegame can be described with the following matrices:

L =

0 1 N 1c 0 N 2...

. . .. . .

...c c 0

H =

y y yn y y...

. . .. . .

...n n y

This game is not locally observable and thus it ishard (Bartok et al., 2011). Simple linear algebragives that the locally observable action pairs are theconsecutive actions (L = {{i, i + 1} : i N 1}),while quite surprisingly, all action pairs are neighbors.

We compare CBP with FeedExp on Dynamic Pric-ing with N = M = 5 and c = 2. Since Balatonis undefined on not locally observable games, we cannot include it in the comparison. To demonstrate theadaptiveness of CBP, we use two sets of opponentstrategies. The benign setting is a set of opponentswhich are far away from dangerous regions, that is,from boundaries between cells of non-locally observ-able neighboring action pairs. The harsh settings,however, include opponent strategies that are close oron the boundary between two such actions. For eachsetting we maximize over 15 strategies and average

over 10 runs. We also compare the individual regret ofthe two algorithms against one benign and one harshstrategy. We averaged over 100 runs and plotted the90 percent confidence intervals.

The results (shown in Figures 3 and 4) indicate thatCBP has a significant advantage over FeedExp onbenign settings. Nevertheless, for the harsh settingsFeedExp slightly outperforms CBP, which we think isa reasonable price to pay for the benefit of adaptivity.

References

Audibert, J-Y. and Bubeck, S. Minimax policies

for adversarial and stochastic bandits. In COLT2009, Proceedings of the 22nd Annual Conferenceon Learning Theory, Montreal, Canada, June 1821, 2009, pp. 217226. Omnipress, 2009.

Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-timeAnalysis of the Multiarmed Bandit Problem. Mach.Learn., 47(2-3):235256, 2002. ISSN 0885-6125.

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire,


8/20


0 2 4 6 8 10

x 105

0

1

2

3

4

5x 10

4

Time Step

MinimaxRegret

CBP

FeedExp

Balaton

(a) Pointwise maximum over 15 settings.

0 250,000 5,000,000 7,500,000 10,000,0000

0.5

1

1.5

2

2.5

3

3.5x 10

4

Time Step

Regret

CBP

FeedExp

Balaton

(b) Regret against one opponent strategy.

Figure 2. Comparing CBP with Balaton and FeedExp on the easy game

0 2 4 6 8 10

x 105

0

2

4

6

8

10x 10

4

Time Step

MinimaxRegret

CBPFeedExp


0 250,000 500,000 750,000 1,000,0000

1

2

3

4

5

6

7

8x 10

4

Time Step

Regret

CBPFeedExp


Figure 3. Comparing CBP and FeedExp on benign setting of the Dynamic Pricing game.

0 2 4 6 8 10

x 105

0

1

2

3

4

5

6

7x 10

4

Time Step

M

inimaxRegret

CBP

FeedExp


0 250,000 500,000 750,000 1,000,0000

1

2

3

4

5x 10

4

Time Step

Regret

CBP

FeedExp


Figure 4. Comparing CBP and FeedExp on harsh setting of the Dynamic Pricing game.

R.E. The nonstochastic multiarmed bandit problem.SIAM Journal on Computing, 32(1):4877, 2003.

Bartok, G., Pal, D., and Szepesvari, Cs. Toward aclassification of finite partial-monitoring games. InALT 2010, Canberra, Australia, pp. 224238, 2010.

Bartok, G., Pal, D., and Szepesvari, Cs. Minimax re-gret of finite partial-monitoring games in stochasticenvironments. In COLT, July 2011.

Cesa-Bianchi, N., Lugosi, G., and Stoltz, G. Regretminimization under partial monitoring. In Infor-mation Theory Workshop, 2006. ITW06 Punta delEste. IEEE, pp. 7276. IEEE, 2006.

Foster, D.P. and Rakhlin, A. No internal regret vianeighborhood watch. CoRR, abs/1108.6088, 2011.

Littlestone, N. and Warmuth, M.K. The weighted

majority algorithm. Information and Computation,108:212261, 1994.

Piccolboni, A. and Schindelhauer, C. Discrete pre-diction games with arbitrary feedback and loss. InCOLT 2001, pp. 208223. Springer-Verlag, 2001.

Vovk, V.G. Aggregating strategies. In Annual Work-shop on Computational Learning Theory, 1990.


9/20

Supplementary material for the paper An adaptive algorithm

for finite stochastic partial monitoring

List of symbols used by the algorithm

Symbol Definition

It N action chosen at time tYt {0, 1}sIt observation at time ti,j

(t)R estimate of (

i j

)

p ({

i, j} N

)ci,j(t) R confidence width for pair {i, j} ({i, j} N)P(t) N plausible actions

N(t) N2 set of admissible neighborsN+(t) N {i,j}N(t)N+i,j; admissible neighborhood actionsV(t) N {i,j}N(t)Vi,j ; admissible information seeking actionsR(t) N rarely sampled actionsS(t) P(t) N+(t) (V(t) R(t)); admissible actions

Proof of theorems

Here we give the proofs of the theorems and lemmas. For the convenience of the reader, we restatethe theorems.

Theorem 1. Let (L, H) be an N by M partial-monitoring game. For a fixed opponent strategyp M, let i denote the difference between the expected loss of action i and an optimal action.For any time horizonT, algorithmCBP with parameters > 1, k = W

2/3k , f(t) =

1/3t2/3 log1/3 t

1


10/20

has expected regret

E[RT] {i,j}N

2|Vi,j|

1 +1

2 2

+

N

k=1k

+

Nk=1k>0

4W2kd2kk

log T

+

kV\N+kmin

4W2k

d2l(k)2l(k)

log T, 1/3W2/3k T

2/3 log1/3 T

+

kV\N+k

1/3W2/3k T

2/3 log1/3 T

+ 2dk1/3W2/3T2/3 log1/3 T ,

where W = maxkNWk, V= {i,j}NVi,j, N+ = {i,j}NN+i,j, andd1, . . . , dN are game-dependentconstants.

Proof. We use the convention that, for any variable x used by the algorithm, x(t) denotes the valueof x at the end of time step t. For example, ni(t) is the number of times action i is chosen up toand including time step t.

The proof is based on three lemmas. The first lemma shows that the estimate i,j(t) is in thevicinity of i,j with high probability.

Lemma 1. For any {i, j} N, t 1,

P

|i,j(t) i,j| ci,j(t)

2|Vi,j |t12 .

If for some t,i,j, the event whose probability is upper-bounded in Lemma 1 happens, we saythat a confidence interval fails. Let Gt be the event that no confidence intervals fail in time stept and let

Btbe its complement event. An immediate corollary of Lemma 1 is that the sum of the

probabilities that some confidence interval fails is small:

Tt=1

P (Bt) Tt=1

{i,j}N

2|Vi,j|t2

{i,j}N2|Vi,j|

1 +

1

2 2

. (1)

To prepare for the next lemma, we need some new notations. For the next definition we needto denote the dependence of the random sets P(t), N(t) on the outcomes from the underlyingsample space . For this, we will use the notation P(t) and N(t). With this, we define the setof plausible configurations to be

= t1 {(P(t),N(t)) : Gt} .Call = (i0, i1, . . . , ir) (r

0) a path in

N

N2 if

{is, is+1

} N for all 0

s

r

1 (when

r = 0 there is no restriction on ). The path is said to start at i0 and end at ir. In what followswe denote by i an optimal action under p (i.e., ip

i p holds for all actions i).The set of paths that connect i to i and lie in N will be denoted by Bi(N). The next lemma

shows that Bi(N) is non-empty whenever N is such that for some P, (P,N) :

2


11/20

Lemma 2. Take an action i and a plausible pair (P,N) such that i P. Then there existsa path that starts at i and ends at i that lies in N.

For i Pdefine

di = max(P,N)iP

minBi(N)=(i0,...,ir)

rs=1

|Vis1,is | .

According to the previous lemma, for each Pareto-optimal action i, the quantity di is well-definedand finite. The definition is extended to degenerate actions by defining di to be max(dl, dk), wherek, l are such that i N+k,l.

Let k(t) = argmaxiP(t)V(t) W2i /ni(t 1). When k(t) = It this happens because k(t) N+(t)

and k(t) / R(t), i.e., the action k(t) is a purely information seeking action which has beensampled frequently. When this holds we say that the decaying exploration rule is in effect at timestep t. The corresponding event is denoted by Dt = {k(t) = It}.Lemma 3. Fix any t 1.

1. Take any actioni. On the event

Gt

Dt,1 from i

P(t)

N+(t) it follows that

i 2di

log t

f(t)maxkN

Wkk

.

2. Take any actionk. On the event Gt Dct , from It = k it follows that

nk(t 1) minjP(t)N+(t)

4W2kd2j2j

log t .

We are now ready to start the proof. By Walds identity, we can rewrite the expected regret asfollows:

E[RT] = ET

t=1

L[It, Jt]T

t=1

E [L[i, J1]] =N

k=1

E[nk(T)]i

=Nk=1

E

Tt=1

I{It=k}

k

=

Nk=1

E

Tt=1

I{It=k,Bt}

k +

Nk=1

E

Tt=1

I{It=k,Gt}

k .

Now,

Nk=1

E

Tt=1

I{It=k,Bt}

k

Nk=1

E

Tt=1

I{It=k,Bt}

(because k 1)

= E Tt=1

Nk=1

I{It=k,Bt}

= E Tt=1

I{Bt}

=

Tt=1

P (Bt) .1Here and in what follows all statements that start with On event X should be understood to hold almost

surely on the event. However, to minimize clutter we will not add the qualifier almost surely.

3


12/20

Hence,

E[RT] Tt=1

P (Bt) +Nk=1

E[Tt=1

I{It=k,Gt}]k .

Here, the first term can be bounded using (1). Let us thus consider the elements of the second sum:

E[Tt=1

I{It=k,Gt}]k k+

E[T

t=N+1

I{Gt,Dct ,kP(t)N+(t),It=k}] k (2)

+ E[T

t=N+1


+ E[

T

t=N+1I{Gt,Dt,kP(t)N+(t),It=k}] k (4)

+ E[T

t=N+1

I{Gt,Dt,kP(t)N+(t),It=k}] k . (5)

The first k corresponds to the initialization phase of the algorithm when every action gets chosenonce. The next paragraphs are devoted to upper bounding the above four expressions (2)-(5). Notethat, if action k is optimal, that is, if k = 0 then all the terms are zero. Thus, we can assume fromnow on that k > 0.Term (2): Consider the event Gt Dct {k P(t) N+(t)}. We use case 2 of Lemma 3 with thechoice i = k. Thus, from It = k, we get that i = k P(t) N+(t) and so the conclusion of thelemma gives

nk(t 1) Ak(t) def= 4W2kd2k2

k

log t .

Therefore, we have

Tt=N+1

I{Gt,Dct ,kP(t)N+(t),It=k}

T

t=N+1

I{It=k,nk(t1)Ak(t)}

+T

t=N+1

I{Gt,Dct ,kP(t)N+(t),It=k,nk(t1)>Ak(t)}

=T

t=N+1

I

{It=k,nk(t

1)Ak(t)

}

Ak(T) = 4W2kd2k2k

log T

4


13/20

yielding

(2) 4W2kd2kk

log T .

Term (3): Consider the event Gt Dct {k P(t) N+(t)}. We use case 2 of Lemma 3. Thelemma gives that that


4W2kd2j2j

log t .

We know that k V(t) = {i,j}N(t)Vi,j. Let t be the set of pairs {i, j} in N(t) N such thatk Vi,j. For any {i, j} t, we also have that i, j P(t) and thus if l{i,j} = argmaxl{i,j} l then

nk(t 1) 4W2kd2l

{i,j}

2l{i,j}

log t .

Therefore, if we define l(k) as the action with

l(k) = min

l{i,j}

: {i, j} N, k Vi,j

then it follows that

nk(t 1) 4W2kd2l(k)2l(k)

log t .

Note that l(k) can be zero and thus we use the convention c/0 = . Also, since k is not inP(t) N+(t), we have that nk(t 1) kf(t). Define Ak(t) as

Ak(t) = min

4W2k

d2l(k)2l(k)

log t, kf(t)

.

Then, with the same argument as in the previous case (and recalling that f(t) is increasing), weget

(3) k min

4W2kd2l(k)2l(k)

log T, kf(T)

.

We remark that without the concept of rarely sampled actions, the above term would scale with1/2l(k), causing high regret. This is why the vanilla version of the algorithm fails on hard games.

Term (4): Consider the event Gt Dt{k P(t) N+(t)}. From case 1 of Lemma 3 we have thatk 2dk

log tf(t) maxjN

Wjj

. Thus,

(4) dkT log Tf(T) maxlN Wll .

5


14/20


15/20

proof, we arrive at the same expression

E[T

t=1I{It=k,Gt}]k k+

E[T

t=N+1


+ E[T

t=N+1


+ E[

Tt=N+1

I{Gt,Dt,kP(t)N+(t),It=k}] k (4)

+ E[T

t=N+1

I{Gt,Dt,kP(t)N+(t),It=k}] k , (5)

where Gt and Dt denote the events that no confidence intervals fail, and the decaying explorationrule is in effect at time step t, respectively.

From the condition of we have that there exists a positive constant 1 such that for everyneighboring action pair {i, j} N \L, max(i, j) 1. We know from Lemma 3 that ifDt happensthen for any pair {i, j} N \ L it holds that max(i, j) 4N

log tf(t) max(Wk/

k)

def= g(t).

It follows that if t > g1(1) then the decaying exploration rule can not be in effect. Therefore,terms (4) and (5) can be upper bounded by g1(1).

With the value 1 defined in the previous paragraph we have that for any action k V \ N+,l(k) 1 holds, and therefore term (3) can be upper bounded by

(3) 4W2 4N2

21 log T ,

using that dk, defined in the proof of Theorem 1, is at most 2N. It remains to carefully upperbound term (2). For that, we first need a definition and a lemma. Let A = {i N : i }.Lemma 4. Let G = (L, H) be a finite partial-monitoring game and p M an opponent strategy.There exists a 2 > 0 such that A2 is a point-local game in G.

To upper bound term (2), with 2 introduced in the above lemma and > 0 specified later, wewrite

(2) = E[

Tt=N+1

I{Gt,Dct ,kP(t)N+(t),It=k}] k

I{k


16/20

Let b be the number of actions in the largest point-local game. Putting everything together wehave

E[RT]

{i,j}N2

|Vi,j

|1 +

1

2 2+ g1(1) +

N

k=1

k

+ 16W2N3

21 log T + 32W2

N3

2 log T

+ T + 4bW2d2pmax

log T .

Now we choose to be

= 2W dpmax

b log T

T

and we get

E[RT] c1 + c2 log T + 4W dpmaxbT log T .

Proof of lemmas

Lemma 1. For any {i, j} N, t 1,

P

|i,j(t) i,j| ci,j(t)

2|Vi,j |t12 .

Proof.

P|i,j(t) i,j| ci,j(t)kN+i,j

P

|vi,j,k

k(t 1)nk(t 1) v

i,j,kSkp

| vi,j,k

log t

nk(t 1)

(6)

=kN+

i,j

t1s=1

I{nk(t1)=s}P

|vi,j,k

k(t 1)s

vi,j,kSkp| vi,j,k

log t

s

(7)

kN+i,j

2t12 (8)

= 2|N+i,j|t12 ,where in (6) we used the triangle inequality and the union bound and in (8) we used Hoeffdings

inequality.

Lemma 2. Take an action i and a plausible pair (P,N) such that i P. Then there existsa path that starts at i and ends at i that lies in N.

8


17/20

C1

C2

C3

C4

C5

C6

Figure 1: The dashed line defines the feasible path 1 , 5, 4, 3.

Proof. If (P,N) is a valid configuration, then there is a convex polytope M such thatp , P = {i : dim Ci = M 1} and N = {{i, j} : dim Ci Cj = M 2}.

Let p be an arbitrary point in Ci . We enumerate the actions whose cells intersect with theline segment pp, in the order as they appear on the line segment. We show that this sequence ofactions i0, . . . , ir is a feasible path.

It trivially holds that i0 = i, and ir is optimal. It is also obvious that consecutive actions on the sequence are in N.

For an illustration we refer the reader to Figure 1

Next, we want to prove lemma 3. For this, we need the following auxiliary result:Lemma 5. Let action i be a degenerate action in the neighborhood action set N+k,l of neighboringactions k and l. Then i is a convex combination of k and l.

Proof. For simplicity, we rename the degenerate action i to action 1, while the other actions k, lwill be called actions 2 and 3, respectively. Since action 1 is a degenerate action between actions 2an 3, we have that

(p M and p(1 2)) (p(1 3) and p(2 3))

implying

(1

2)

(1

3)

(2

3)

.

Using de Morgans law we get

1 2 1 3 2 3 .

9


18/20

This implies that for any c1, c2 R there exists a c3 R such thatc3(1 2) = c1(1 3) + c2(2 3)

3 =c1 c3c1 + c2

1 +c2 + c3

c1 + c22 ,

suggesting that 3 is an affine combination of (or collinear with) 1 and 2.We know that there exists p1 such that 1 p1 < 2 p1 and 1 p1 < 3 p1. Also, there exists

p2 M such that 2 p2 < 1 p2 and 2 p2 < 3 p2. Using these and linearity of the dot productwe get that 3 must be the middle point on the line, which means that 3 is indeed a convexcombination of 1 and 2.

Lemma 3. Fix any t 1.1. Take any actioni. On the event Gt Dt,2 from i P(t) N+(t) it follows that

i 2di

log t

f(t)maxk

N

Wkk

.

2. Take any actionk. On the event Gt Dct , from It = k it follows that


4W2kd2j2j

log t .

Proof. First we observe that for any neighboring action pair {i, j} N(t), on Gt it holds that i,j 2ci,j(t). Indeed, from {i, j} N(t) it follows that i,j(t) ci,j(t). Now, on Gt, i,j i,j(t)+ci,j(t).Putting together the two inequalities we get i,j 2ci,j(t).

Now, fix some action i that is not dominated. We define the parent action i ofi as follows: Ifi is not degenerate then i = i. If i is degenerate then we define i to be the Pareto-optimal actionsuch that i

i and i is in the neighborhood action set of i

and some other Pareto-optimalaction. It follows from Lemma 5 that i is well-defined.

Consider case 1. Thus, It = k(t) = argmaxjP(t)V(t) W2j /nj(t 1). Therefore, k(t) R(t),i.e., nk(t)(t 1) > k(t)f(t). Assume now that i P(t) N+(t). Ifi is degenerate then i as definedin the previous paragraph is in P(t) (because the rejected regions in the algorithm are closed). Inany case, by Lemma 2, there is a path (i0, . . . , ir) in N(t) that connects i to i (i P(t) holds on

2Here and in what follows all statements that start with On event X should be understood to hold almost

surely on the event. However, to minimize clutter we will not add the qualifier almost surely.

10


19/20

Gt). We have that

i i =r

s=1is1,is

2 rs=1

cis1,is

= 2rs=1

jVis1,is

vis1,is,j

log t

nj(t 1)

2rs=1

jVis1,is

Wj

log t

nj(t 1)

2diWk(t)

log t

nk(t)(t 1)

2diWk(t) log tk(t)f(t) .Upper bounding Wk(t)/

k(t) by maxkNWk/

k we obtain the desired bound.

Now, for case 2 take an action k, consider G Dct , and assume that It = k. On Dct , It = k(t).Thus, from It = k it follows that Wk/

nk(t 1) Wj/

nj(t 1) holds for all j P(t). Let

Jt = argminjP(t)N+(t)d2j2j

. Now, similarly to the previous case, there exists a path (i0, . . . , ir)

from the parent action Jt P(t) of Jt to i in N(t). Hence,

Jt Jt =rs=1

is1,s

2rs=1

jVis1,is

Wj log t

nj(t 1)

2dJtWk

log t

nk(t 1) ,

implying

nk(t 1) 4W2kd2Jt2Jt

log t

= minjP(t)N+(t)

4W2kd2j2j

log t .

This concludes the proof of Lemma 3.

Lemma 4. Let G = (L, H) be a finite partial-monitoring game and p M an opponent strategy.There exists a 2 > 0 such that A2 is a point-local game in G.

11


20/20

Proof. For any (not necessarily neighboring) pair of actions {i, j}, the boundary between them isdefined by the set Bi,j = {p M : (i j)p = 0}. We generalize this notion by introducingthe margin: for any 0, let the margin be the set Bi,j = {p M : |(ij)p| }. It followsfrom finiteness of the action set that there exists a > 0 such that for any set K of neighboring

action pairs, {i,j}K

Bi,j =

{i,j}KB

i,j = . (9)

Let 2 = /2. Let A = A2 . Then for every pair i, j in A, (i j)p = i,j i + j 2.

That is, p Bi,j. It follows that p i,jAAB

i,j. This, together with (9), implies that A is apoint-local game.

12

an adaptive algorithm for finite stochastic partial monitoring

Documents