competing prediction algorithms · in this section we formalize the model. we begin with an...

17
Competing Prediction Algorithms Omer Ben-Porat Technion - Israel Institute of Technology Haifa 32000 Israel Moshe Tennenholtz Technion - Israel Institute of Technology Haifa 32000 Israel Abstract Prediction is a well-studied machine learning task, and prediction algorithms are core ingredients in online products and services. Despite their centrality in the competition between online companies who offer prediction-based products, the strategic use of prediction algorithms remains unexplored. The goal of this paper is to examine strategic use of prediction algorithms. We introduce a novel game- theoretic setting that is based on the PAC learning framework, where each player (aka a prediction algorithm at competition) seeks to maximize the sum of points for which it produces an accurate prediction and the others do not. We show that algorithms aiming at generalization may wittingly miss-predict some points to perform better than others on expectation. We analyze the empirical game, i.e. the game induced on a given sample, prove that it always possesses a pure Nash equilibrium, and show that every better-response learning process converges. Moreover, our learning-theoretic analysis suggests that players can, with high probability, learn an approximate pure Nash equilibrium for the whole population using a small number of samples. 1 Introduction Prediction plays an important role in twenty-first century economics. An important example is the way online retailers advertise services and products tailored to predict individual taste. Companies collect massive amounts of data and employ sophisticated machine learning algorithms to discover patterns and seek connections between different user groups. A company can offer customized products, relying on user properties and past interactions, to outperform the one-size-fits-all approach. For instance, after examining sufficient number of users and the articles they read, media websites promote future articles predicted as having a high probability of satisfying a particular user. For revenue-seeking companies, prediction is another tool that can be exploited to increase revenue. When companies’ products are alike, the chance that a user will select the product of a particular company decreases. In this case a company may purposely avoid offering the user this product and offer an alternative one in order to maximize the chances of having its product selected. Despite the intuitive clarity of the tradeoff above and the enormous amount of work done on prediction in the machine learning and statistical learning communities, far too little attention has been paid to the study of prediction in the context of competition. In this paper we introduce what is, to the best of our knowledge, a first-ever attempt to study how the selection of prediction algorithms is affected by strategic behavior in a competitive setting, using a game-theoretic lens. We consider a space of users, where each user is modeled as a triplet (x,y,t) of an instance, a label and a threshold, respectively. A user’s instance is a real vector that encodes his 1 properties; the label is associated with his taste, and the threshold is the “distance” he is willing to accept between a proposed product and his taste. Namely, the user associated with (x,y,t) embraces 1 For ease of exposition, third-person singular pronouns are “he” for a user and “she” for a player. arXiv:1806.01703v1 [cs.GT] 5 Jun 2018

Upload: others

Post on 21-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

Competing Prediction Algorithms

Omer Ben-PoratTechnion - Israel Institute of Technology

Haifa 32000 [email protected]

Moshe TennenholtzTechnion - Israel Institute of Technology

Haifa 32000 [email protected]

Abstract

Prediction is a well-studied machine learning task, and prediction algorithms arecore ingredients in online products and services. Despite their centrality in thecompetition between online companies who offer prediction-based products, thestrategic use of prediction algorithms remains unexplored. The goal of this paperis to examine strategic use of prediction algorithms. We introduce a novel game-theoretic setting that is based on the PAC learning framework, where each player(aka a prediction algorithm at competition) seeks to maximize the sum of pointsfor which it produces an accurate prediction and the others do not. We showthat algorithms aiming at generalization may wittingly miss-predict some pointsto perform better than others on expectation. We analyze the empirical game,i.e. the game induced on a given sample, prove that it always possesses a pureNash equilibrium, and show that every better-response learning process converges.Moreover, our learning-theoretic analysis suggests that players can, with highprobability, learn an approximate pure Nash equilibrium for the whole populationusing a small number of samples.

1 Introduction

Prediction plays an important role in twenty-first century economics. An important example is theway online retailers advertise services and products tailored to predict individual taste. Companiescollect massive amounts of data and employ sophisticated machine learning algorithms to discoverpatterns and seek connections between different user groups. A company can offer customizedproducts, relying on user properties and past interactions, to outperform the one-size-fits-all approach.For instance, after examining sufficient number of users and the articles they read, media websitespromote future articles predicted as having a high probability of satisfying a particular user.

For revenue-seeking companies, prediction is another tool that can be exploited to increase revenue.When companies’ products are alike, the chance that a user will select the product of a particularcompany decreases. In this case a company may purposely avoid offering the user this product andoffer an alternative one in order to maximize the chances of having its product selected. Despite theintuitive clarity of the tradeoff above and the enormous amount of work done on prediction in themachine learning and statistical learning communities, far too little attention has been paid to thestudy of prediction in the context of competition.

In this paper we introduce what is, to the best of our knowledge, a first-ever attempt to study how theselection of prediction algorithms is affected by strategic behavior in a competitive setting, using agame-theoretic lens. We consider a space of users, where each user is modeled as a triplet (x, y, t) ofan instance, a label and a threshold, respectively. A user’s instance is a real vector that encodes his1

properties; the label is associated with his taste, and the threshold is the “distance” he is willing toaccept between a proposed product and his taste. Namely, the user associated with (x, y, t) embraces

1For ease of exposition, third-person singular pronouns are “he” for a user and “she” for a player.

arX

iv:1

806.

0170

3v1

[cs

.GT

] 5

Jun

201

8

Page 2: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

a customized product f(x) if f(x) − y is less than or equal to t. In such a case, the user is satisfiedand willing to adopt the product. If a user is satisfied with several products (of several companies),he selects one uniformly at random. Indeed, the user-model we adopt is aligned with the celebrated“Satisficing" principle of Simon [18], and other widely-accepted models in the literature on choiceprediction, e.g. the model of selection based on small samples [3, 7]. Several players are equippedwith infinite strategy spaces, or hypothesis classes in learning-theoretic terminology. A player’sstrategy space models the possible predictive functions she can employ. Players are competing forthe users, and a player’s payoff is the expected number of users who select her offer. To modeluncertainty w.r.t. the users’ taste, we use the PAC-learning framework of Valiant [19]. We assumeuser distribution is unknown, but the players have access to a sequence of examples, containinginstances, labels and thresholds, with which they should optimize their payoffs w.r.t. the unknownunderlying user distribution.

From a machine learning perspective we now face the challenge of what would be a good predictionalgorithm profile, i.e. a set of algorithms for the players such that no player would deviate fromher algorithm assuming the others all stick to their algorithms. Indeed, such a profile of algorithmsdetermines a pure Nash equilibrium (PNE) of prediction algorithms, a powerful solution conceptwhich rarely exists in games. An important question in this regard is whether such a profile exists. Anaccompanying question is whether a learning dynamics in which players may change their predictionalgorithms to better-respond to others would converge. Therefore, we ask:

● Does a PNE exist?● Will the players be able to find it efficiently with high probability using a better-response dynamics?

We prove that the answer to both questions is yes. We first show that when the capacity of eachstrategy space is bounded (i.e., finite pseudo-dimension), players can learn payoffs from samples.Namely, we show that the payoff function of each player uniformly converges over all possiblestrategy profiles (that include strategies of the other players). Thus with high probability a player’spayoff under any strategy profile is not too distant from her empirical payoff. Later, we show that anempirical PNE always exists, i.e., a PNE of the game induced on the empirical sample distribution.Moreover, we show that any learning dynamics in which players improve their payoff by more thana non-negligible quantity converges fast to an approximate PNE. Using the two latter results, weshow an interesting property of the setting: the elementary idea of sampling and better-respondingaccording to the empirical distribution until convergence leads to an approximate PNE of the gameon the whole population. We analyze this learning process, and formalize the above intuition viaan algorithm that runs in polynomial time in the instance parameters, and returns an approximatePNE with high probability. Finally, we discuss the case of infinite capacities, and demonstrate thatnon-learnability can occur even if the user distribution is known to all players.

Related work The intersection of game theory and machine learning has increased rapidly inrecent years. Sample empowered mechanism design [16] is a fruitful line of research. For example,[6, 8, 14] reconsider auctions where the auctioneer can sample from bidder valuation functions,thereby relaxing the assumption of prior knowledge on bidder valuation distribution [15]. Empiricaldistributions also play a key role in other lines of research [1, 2, 11], where e.g. [2] show how toobtain an approximate equilibrium by sampling any mixed equilibrium. The PAC-learning frameworkproposed by Valiant [19] has also been extended by Blum et al. [5], who consider a collaborativegame where players attempt to learn the same underlying prediction function, but each player has herown distribution over the space. In their work each player can sample from her own distribution, andthe goal is to use information sharing among the players to reduce the sample complexity.

Our work is inspired by Dueling Algorithms [10]. Immorlica et al. analyze an optimization problemfrom the perspective of competition, rather than from the point of view of a single optimizer. Ourmodel is also related to Competing Bandits [12]. Mansour et al. consider a competition between twobandit algorithms faced with the same sample, where users arrive one by one and choose between thetwo algorithms. In our work players also share the same sample, but we consider an offline settingand not an online one; infinite strategy spaces and not a finite set of actions; context in the form ofproperty vector for each user; and an arbitrary number of asymmetric players, where asymmetry isreflected in the strategy space of each player.

Most relevant to our work is [4]. The authors present a learning task where a newcomer agent isgiven a sequence of examples, and wishes to learn a best-response to the players already on the

2

Page 3: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

market. They assume that the agent can sample triplets composed of instance, label and currentmarket prediction, and define the agent’s payoff as the proportion of points (associated with users) shepredicts better than the other players. Indeed, [4] introduces a learning task incorporating economicinterpretation into the objective function of the (single) optimizer, but in fact does not provide anygame-theoretic analysis. In comparison, this paper considers game-theoretic interaction betweenplayers, and its main contribution lies in the analysis of such interactions. Since learning dynamicsconsists of steps of unilateral deviations that improve the deviating player’s payoff, the Best ResponseRegression of Ben-Porat and Tennenholtz[4] can be thought of as an initial step to this work.

Our contribution Our contribution is three-fold. First, we explicitly suggest that predictionalgorithms, like other products on the market, are in competition. This novel view emphasizes theneed for stability in prediction-based competition similar to Hotelling’s stability in spatial competition[9].

Second, we introduce an extension of the PAC-learning framework for dealing with strategy profiles,each of which is a sequence of functions. We show a reduction from payoff maximization toloss minimization, which is later used to achieve bounds on the sample complexity for uniformconvergence over the set of profiles. We also show that when players have approximate better-response oracles, they can learn an approximate PNE of the empirical game. The main technicalcontribution of this paper is an algorithm which, given ε, δ, samples a polynomial number of pointsin the game instance parameters, runs any ε-better-response dynamics, and returns an ε-PNE withprobability of at least 1 − δ.

Third, we consider games with at least one player with infinite pseudo-dimension. We show agame instance where each player can learn the best prediction function from her hypothesis class ifshe were alone in the game, but a PNE of the empirical game is not generalized. This inability tolearn emphasizes that strategic behavior can introduce further challenges to the machine learningcommunity.

2 Problem definition

In this section we formalize the model. We begin with an informal introduction to elementaryconcepts in both game theory and learning theory that are used throughout the paper.

Game theory A non-cooperative game is composed of a set of players N = {1, . . .N}; a strategyspaceHi for every player i; and a payoff function πi ∶H1 ×⋯ ×HN → R for every player i. The setH =H1 ×⋯ ×HN contains of all possible strategies, and a tuple of strategies h = (h1, . . . hN) ∈His called a strategy profile, or simply a profile. We denote by h−i the vector obtained by omitting thei-th component of h.

A strategy h′i ∈ Hi is called a better response of player i with respect to a strategy profile h ifπi(h′i,h−i) > πi(h). Similarly, h′i is said to be an ε-better response of player i w.r.t. a strategy profileh if πi(h′i,h−i) ≥ πi(h) + ε, and a best response to h−i if πi(h′i,h−i) ≥ suphi∈Hi πi(hi,h−i) .

We say that a strategy profile h is a pure Nash equilibrium (herein denoted PNE) if every player playsa best response under h. We say that a strategy profile h is an ε-PNE if no player has an ε-betterresponse under h, i.e. for every player i it holds that πi(h) ≥ suph′i∈Hi πi(h′i,h−i) − ε.Learning theory Let F be a class of binary-valued functions F ⊆ {0,1}X . Given a sequence S =(x1, . . . xm) ∈ Xm, we denote the restriction of F to S by F ∩ S = {(f(x1), . . . , f(xm)) ∣ f ∈ F}.The growth function of F , denoted ΠF ∶ N → N, is defined as ΠF (m) = maxS∈Xm ∣F ∩ S ∣. Wesay that F shatters S if ∣F ∩ S ∣ = 2∣S∣. The Vapnik-Chervonenkis dimension of a binary functionclass is the cardinality of the largest set of points in X that can be shattered by F , VCdim(F ) =max{m ∈ N ∶ ΠF (m) = 2m}.

Let H be a class of real-valued functions H ⊆ RX . The restriction of H to S ∈ Xm is analogouslydefined, H ∩ S = {(h(x1), . . . , h(xm)) ∣ h ∈H}. We say that H pseudo-shatters S if there existsr = (r1, . . . , rm) ∈ Rm such that for every binary vector b = (b1, . . . bm) ∈ {−1,1}m there existshb ∈H and for every i ∈ [m] it holds that sign(hb(xi) − ri) = bi. The pseudo-dimension of H is the

3

Page 4: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

cardinality of the largest set of points in X that can be pseudo-shattered by H ,

Pdim(H) = max{m ∈ N ∶ ∃S ∈ Xm such that S is pseudo-shattered by H} .2.1 Model

We consider a set of users who are interested in a product provided by a set of competing players.Each user is associated with a vector (x, y, t), where x is the instance; y is the label; and t is thethreshold that the user is willing to accept.

The players offer customized products to the users. When a user associated with a vector (x, y, t)approaches player i, she produces a prediction hi(x). If ∣hi(x) − y∣ is at most t, the user associatedwith (x, y, t) will grant one monetary unit to player i. Alternatively, that user will move on to anotherplayer. We assume that users approach players according to the uniform distribution, although ourmodel and results support any distribution over player orderings. Player i has a set of possiblestrategies (prediction algorithms)Hi, from which she has to decide which one to use. Each playeraims to maximize her expected payoff, and will act strategically to do so.

Formally, the game is a tuple ⟨Z,D,N , (Hi)i∈N ⟩ such that

1. Z is the examples domain Z = X ×Y × T , where X ⊂ Rn is the instance domain; Y ⊂ R isthe label domain; and T ⊂ R≥0 is the tolerance domain.

2. D is a probability distribution over Z = X ×Y × T .3. N is the set of players, with ∣N ∣ = N . A strategy of player i is an element fromHi ⊆ YX .

The space of all strategy profiles is denoted byH = ⨉Ni=1Hi.4. For z = (x, y, t) and a function g ∶ X → Y , we define the indicator I(z, g) to be 1 if the

distance between the value g predicted for x and y is at most t. Formally,

I(z, g) = {1 ∣g(x) − y∣ ≤ t0 otherwise

.

5. Given a strategy profile h = (h1, . . . hN) with hi ∈Hi for i ∈ {1, . . .N} and z = (x, y, t) ∈Z , let

wi(z;h) = ⎧⎪⎪⎨⎪⎪⎩0 if I(z, hi) = 0

1∑Ni′=1 I(z,hi′) otherwise .

Note that wi(z;h) represents the expected payoff of player i w.r.t. the user associated withz. The payoff of player i under h is the average sum over all users, and is defined by

πi(h) = Ez∼D [wi(z;h)] .6. D is unknown to the players.

We assume players have access to a sequence of examples S. Given a game instance⟨Z,D,N , (Hi)i∈N ⟩ and a sample S = {z1, . . . zm}, we denote by ⟨Z,S ∼ Dm,N , (Hi)i∈N ⟩ theempirical game: the game over the same N ,H,Z and uniform distribution over the known S ∈ Zm.We denote the payoff of player i in the empirical game by

πSi (h) = Ez∈S [wi(z;h)] = 1

m

m∑j=1

wi(zj ;h).When S is known from the context, we occasionally use the term empirical PNE to denote a PNE ofthe empirical game. Since the empirical game is a complete information game, players can use thesample in order to optimize their payoffs.

The optimization problem of finding a best response in our model is intriguing in its own right anddeserves future study. In this paper, we assume that each player i has a polynomial ε-better-responseoracle. Namely, given a real number ε > 0, a strategy profile h and sample S, we assume that eachplayer i has an oracle that returns an ε-better response to h−i if such exists or answers false otherwise,which runs in time poly( 1

ε,m,N).2

2Notice that a best response can be found in constant time ifHi is of constant size. In addition, in Section Cwe leverage the algorithm proposed in [4], and show that it can compute a best response within the set of linear

4

Page 5: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

3 Meta algorithm and analysis

Throughout this section we assume the pseudo-dimension ofHi is finite, and we denote it by di, i.e.Pdim(Hi) = di <∞. Our goal is to propose a generic method for finding an ε-PNE efficiently. Themethod is composed of two steps: first, it attains a sample of “sufficient” size. Afterwards, it runs anε-better-response dynamics until convergence, and returns the obtained profile. The underlying ideais straightforward, but its analysis is non-trivial. In particular, we need to show two main claims:

• Given a sufficiently large sample S, the payoff of each player i in the empirical game isnot too far away from her payoff in the actual game, with high probability. This holdsconcurrently for all possible strategy profiles.

• An ε-PNE exists in every empirical game. Therefore, players can reach an ε-PNE of theempirical game fast, using their ε-better-response oracles.

These claims will be made explicit in forthcoming Subsections 3.1 and 3.2. We formalize the abovediscussion via Algorithm 1 in Subsection 3.3.

3.1 Uniform convergence in probability

We now bound the probability (over all choices of S) of having player i’s payoff (for an arbitrary i)greater or less than its empirical counterpart by more than ε. Notice that the restriction ofHi to anyarbitrary sample S , i.e. Hi∩S , may be of infinite size. Nevertheless, the payoff function concerns theindicator function I only and not the real-valued prediction produced by functions inHi; therefore,we now analyze this binary function class.

Let Fi ∶ Z → {0,1} such that Fi def= {I(z, h) ∣ h ∈Hi} . (1)Notice that ∣Fi ∩ S ∣ represents the effective size ofHi ∩ S with respect to the indicator function I.

We already know that the pseudo-dimension ofHi is di. In Lemma 1 we bind the pseudo-dimensionofHi with the VC dimension of Fi.Lemma 1. VCdim(Fi) ≤ 10di.

After discovering the connection between the growth rate ofHi and Fi, we can progress to boundingthe growth of the payoff function class F (which we will define shortly).

For ease of notation, denote I(z,h) = (I(z, h1), . . . ,I(z, hN)). Similarly, let w(z;h) =(w1(z;h), . . . ,wN(z;h)). Note that there is a bijection I(z,h)↦ w(z;h), which divides I(z,h)by its norm if it is greater than zero or leaves it as is otherwise. Formally, there is a bijection M ,M ∶ {0,1}N → {1, 1

2, . . . , 1

N,0}N such that for every v ∈ {0,1}N ,

M(v) = {0 if ∥v∥ = 0v∥v∥ otherwise .

Let F = Z → {0,1}N , defined by

F def= {I (z,h) ∣ h ∈H} .Note that every element in F is a function from Z to {0,1}N . The restriction of F to a sample S isdefined by F ∩ S = {(I(z1,h), . . . ,I(zm,h)) ∣ h ∈H} .Due to the aforementioned bijection, every element in F ∩ S represents a distinct payoff vector ofthe empirical game; thus, bounding ∣F ∩ S ∣ corresponds to bounding the number of distinct strategyprofiles in the empirical game. Clearly,

∣F ∩ S ∣ = N∏i=1

∣Fi ∩ S ∣ .The growth function of F , ΠF(m) = maxS∈Zm ∣F ∩ S ∣, is therefore bounded as follows.

predictors efficiently when the input dimension (denoted by n in the model above) is constant. We also discusssituations where a better response cannot be computed efficiently in Section 5, and present the applicability ofour models for these cases as well.

5

Page 6: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

Lemma 2. ΠF(m) ≤ (em)10∑Ni=1 di .Next, we bound the probability of a player i’s payoff being “too far” from its empirical counterpart.The proof of Lemma 3 below goes along the path of Vapnik and Chervonenkis, introduced in [20].Since in our case F is not a binary function class, few modifications are needed.Lemma 3. Let m be a positive integer, and let ε > 0. It holds that

PrS∼Dm (∃h ∶ ∣πi(h) − πSi (h)∣ ≥ ε) ≤ 4ΠF(2m)e− ε2m8 .

The following Theorem 1 bounds the probability that any player i has a difference greater than εbetween its payoff and its empirical payoff (over the selection of a sample S), uniformly over allpossible strategy profiles. This is done by simply applying the union bound on the bound alreadyobtained in Lemma 3.Theorem 1. Let m be a positive integer, and let ε > 0. It holds that

PrS∼Dm (∃i ∈ [N] ∶ suph∈H ∣πi(h) − πSi (h)∣ ≥ ε) ≤ 4N(2em)10∑Ni=1 die− ε2m8 . (2)

3.2 Existence of a PNE in empirical games

In the previous subsection we bounded the probability of a payoff vector being too far from itscounterpart in the empirical game. Notice, however, that this result implies nothing about theexistence of a PNE or an approximate PNE: for a fixed S , even if suph∈H ∣πi(h) − πSi (h)∣ < ε holdsfor every i, a player may still have a beneficial deviation. Therefore, the results of the previoussubsection are only meaningful if we show that there exists a PNE in the empirical game, which isthe goal of this subsection. We prove this existence using the notion of potential games [13].

A non-cooperative game is called a potential game if there exists a function Φ ∶H → R such that forevery strategy profile h = (h1, . . . , hN) ∈H and every i ∈ [N], whenever player i switches from hito a strategy h′i ∈Hi, the change in her payoff function equals the change in the potential function,i.e.

Φ(h′i,h−i) −Φ(hi,h−i) = πi(h′i,h−i) − πi(hi,h−i).Theorem 2 ([13, 17]). Every potential game with a finite strategy space possesses at least one PNE.

Obviously, in our setting the strategy space of a game instance ⟨Z,D,N , (Hi)i∈N ⟩ is typicallyinfinite. Infinite potential games may also possess a PNE (as discussed in [13]), but in our case thedistribution D is approximated from samples and the empirical game is finite, so no stronger claimsare needed.

Lemma 4 below shows that every empirical game is a potential game.Lemma 4. Every empirical game ⟨Z,S ∼ Dm,N , (Hi)i∈N ⟩ has a potential function.

As an immediate result of Theorem 2 and Lemma 4,Corollary 1. Every empirical game ⟨Z,S ∼ Dm,N , (Hi)i∈N ⟩ possesses at least one PNE.

After establishing the existence of a PNE in the empirical game, we are interested in the rate withwhich it can be “learnt”. More formally, we are interested in the convergence rate of the dynamicsbetween the players, where at every step one player deviates to one of her ε-better responses. Suchdynamics do not necessarily converge in general games, but do converge in potential games. Byexamining the specific potential function in our class of (empirical) games, we can also bound thenumber of steps until convergence.

Lemma 5. Let ⟨Z,S ∼ Dm,N , (Hi)i∈N ⟩ be any empirical game instance. After at most O ( logNε

)iterations of any ε-better-response dynamics, an ε-PNE is obtained.

3.3 Learning ε-PNE with high probability

In this subsection we leverage the results of the previous Subsections 3.1 and 3.2 to devise Algorithm1, which runs in polynomial time and returns an approximate equilibrium with high probability. More

6

Page 7: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

Algorithm: Approximate PNE w.h.p. via better-response dynamicsInput: δ, ε ∈ (0,1)Output: a strategy profile h

1 set m =m ε2 ,δ

// the minimal integer m satisfying Equation (3)

2 sample S from Dm3 execute any ε

2-better-response dynamics on S until convergence, and obtain a strategy profile h

that is an empirical ε2

-PNE4 return h

precisely, we show that Algorithm 1 returns an ε-PNE with probability of at least 1 − δ, and has timecomplexity of poly ( 1

ε,m,N, log ( 1

δ) , d). As in the previous subsections, we denote d = ∑Ni=1 di.

First, we bound the required sample size. Using standard algebraic manipulations on Equation (2),we obtain the following.Lemma 6. Let ε, δ ∈ (0,1), and let

m ≥ 320d

ε2log (160d

ε2) + 160d log(2e)

ε2+ 16

ε2log (4N

δ) . (3)

With probability of at least 1 − δ over all possible samples S of size m, it holds that

∀i ∈ [N] ∶ suph∈H ∣πi(h) − πSi (h)∣ < ε.

Given ε, δ, we denote by mε,δ the minimal integer m satisfying Equation (3). Lemma 6 shows thatmε,δ = O ( d

ε2log ( d

ε2) + 1

ε2log (N

δ)) are enough samples to have all empirical payoff vectors ε-close

to their theoretic counterpart coordinate-wise (i.e. L∞ norm), with a probability of at least 1 − δ.

Next, we bind an approximate PNE in the empirical game with an approximate PNE in the (actual)game.Lemma 7. Let m ≥m ε

4 ,δand let h be an ε

2-PNE in ⟨Z,S ∼ Dm,N , (Hi)i∈N ⟩. Then h is an ε-PNE

with probability of at least 1 − δ.

Recall that Lemma 5 ensures that every O ( logNε

) iterations of any ε-better-response dynamics mustconverge to an ε-PNE of the empirical game. In each such iteration a player calls her approximatebetter-response oracle, which is assumed to run in poly( 1

ε,m,N) time. Altogether, given ε and δ,

Algorithm 1 runs in poly ( 1ε,N, log ( 1

δ) , d) time, and returns an ε-PNE with probability of at least

1 − δ.

4 Learnability in games with infinite dimension

While Lemma 1 upper bounds VCdim(Fi) as a function of Pdim(Hi), it is fairly easy to show thatVCdim(Fi) ≥ Pdim(Hi) (see Claim B2 in the appendix). Therefore, if Pdim(Hi) is infinite, so isVCdim(Fi).

Classical results in learning theory suggest that if Pdim(Fi) = ∞, a best response on the samplemay not generalize to an approximate best response w.h.p. To see this, imagine a “game” with oneplayer, who seeks to maximize her payoff function. No Free Lunch Theorems (see, e.g., [21]) implythat with a constant probability the player cannot get her payoff within a constant distance from theoptimal payoff. We conclude that in general games, if a player has a strategy space with an infinitepseudo-dimension, she may not be able to learn. However, in the presence of such a player, can otherplayers with a finite pseudo-dimension learn an approximate best-response?

One typically shows non-learnability by constructing two distributions and proving that with constantprobability an agent cannot tell which distribution produced the sample she obtained. These twodistributions are constructed to be distant enough from each other, so the loss (or payoff in our setting)is far from optimal by at least a constant. In our setting, however, players are interacting with eachother, and player payoffs are a function of the whole strategy profile; thus, interesting phenomena

7

Page 8: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

occur even if the distribution D is known. In particular, Example 1 below demonstrates that in theinfinite dimension case, not every empirical PNE is generalized to an approximate PNE with highprobability.

Example 1. Let D be a density function over Z = [0,2] × {0,1} × { 12} as follows:

D(x, y, t) =⎧⎪⎪⎪⎪⎨⎪⎪⎪⎪⎩

12

0 ≤ x < 1, y = 0, t = 12

12

1 ≤ x ≤ 2, y = 1, t = 12

0 otherwise.

In addition, for any finite size subset S of Z in the support of D, denote

hS→0(x) = {0 ∃y, t ∶ (x, y, t) ∈ S11≤x≤2 ∀y, t ∶ (x, y, t) ∉ S , hS→1(x) = {1 ∃y, t ∶ (x, y, t) ∈ S

11≤x≤2 ∀y, t ∶ (x, y, t) ∉ S .In other words, hS→0 labels 0 every instance that appears in the sample S and every instance in the[0,1) segment. On the other hand, hS→1 labels 1 every instance that appears in the sample S andevery instance in the [1,2] segment. Denote

H1 = {hS→0 ∣ S ⊂ Z} ∪ {hS→1 ∣ S ⊂ Z},and letH2 =H3 = {10≤x<1,11≤x≤2}. In this three-player game, consider the profile h = (h1, h2, h3)such that h1 = hS→0, h2 = h3 = 11≤x≤2. Notice that the payoffs under h are defined as follows:

π1(h) = 1

m

m∑j=1

(1 − yj), π2(h) = π2(h) = 1

2m

m∑j=1

yj .

Observe that if 12< 1m ∑mj=1 yj < 3

4, then h is an empirical PNE, since no player can improve her

payoff. Notice, however, that π3(h) = 16

yet π3(10≤x<1,h−2) = 14

.

Since we have 12< 1m ∑mj=1 yj < 3

4with probability of at least 1

4over all choices of S for ∣S ∣ ≥ 15 (see

Claim B4 in the appendix), this empirical equilibrium will not be generalized to 112

-PNE w.p. of atleast 1

4. This is true for any ε, δ ∈ (0,1); thus, an empirical PNE is not generalized to an approximate

PNE w.h.p.

Another interesting point is that in Example 1 each player can trivially find a strategy that maximizesher payoff if she were alone, since D is known. Indeed, this inability to generalize from samplesfollows solely from strategic behavior. Notice that if player 3 has knowledge of H1, she can inferthat her strategy under h is sub-optimal. However, knowledge of the strategy spaces of other playersis a heavy assumption: the better-response dynamics we discussed in Subsection 3.2 only assumedthat each player can compute a better response.

5 Discussion

As mentioned in Section 2.1, our analysis assumes players have better-response oracles. In fact,our model and results are valid for a much more general scenario, as described next. Considerthe case where players only have heuristics for finding a better response. After running heuristicbetter-response dynamics and obtaining a strategy profile, the payoffs with respect to the wholepopulation are guaranteed to be close to their empirical counterparts, w.h.p.; therefore, our analysis isstill meaningful even if players cannot maximize their empirical payoff efficiently, as the bounds onthe required sample size we obtained in Section 3 and the rate of convergence are relevant for thiscase as well.

The reader may wonder about a variation of our model, where player payoffs are defined differently.For example, consider each user as granting one monetary unit to the player that offers the closestprediction to his instance. This definition is in the spirit of Dueling Algorithms [10] and BestResponse Regression [4]. Under this payoff function, and unlike our model, an empirical PNE doesnot necessarily exist. Nevertheless, we believe that examining and understanding these scenarios isfundamental to analysis of competing prediction algorithms, and deserves future work.

8

Page 9: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

Acknowledgments

This project has received funding from the European Research Council (ERC) under the EuropeanUnion’s Horizon 2020 research and innovation programme (grant agreement n° 740435).

References[1] I. Althöfer. On sparse approximations to randomized strategies and convex combinations.

Linear Algebra and its Applications, 199:339–355, 1994.

[2] Y. Babichenko, S. Barman, and R. Peretz. Empirical distribution of equilibrium play and itstesting application. Mathematics of Operations Research, 42(1):15–29, 2016.

[3] G. Barron and I. Erev. Small feedback-based decisions and their limited correspondence todescription-based decisions. Journal of Behavioral Decision Making, 16(3):215–233, 2003.

[4] O. Ben-Porat and M. Tennenholtz. Best response regression. In Advances in Neural InformationProcessing Systems, pages 1498–1507, 2017.

[5] A. Blum, N. Haghtalab, A. D. Procaccia, and M. Qiao. Collaborative pac learning. In Advancesin Neural Information Processing Systems, pages 2389–2398, 2017.

[6] R. Cole and T. Roughgarden. The sample complexity of revenue maximization. In Proceedingsof the 46th Annual ACM Symposium on Theory of Computing, pages 243–252. ACM, 2014.

[7] I. Erev, E. Ert, A. E. Roth, E. Haruvy, S. M. Herzog, R. Hau, R. Hertwig, T. Stewart, R. West,and C. Lebiere. A choice prediction competition: Choices from experience and from description.Journal of Behavioral Decision Making, 23(1):15–47, 2010.

[8] Y. A. Gonczarowski and N. Nisan. Efficient empirical revenue maximization in single-parameterauction environments. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theoryof Computing, STOC 2017, Montreal, QC, Canada, June 19-23, 2017, pages 856–868, 2017.doi: 10.1145/3055399.3055427. URL http://doi.acm.org/10.1145/3055399.3055427.

[9] H. Hotelling. Stability in competition. 1929. In the Economic Journal 39(153): 41–57, 1929.

[10] N. Immorlica, A. T. Kalai, B. Lucier, A. Moitra, A. Postlewaite, and M. Tennenholtz. Duelingalgorithms. In Proceedings of the forty-third annual ACM symposium on Theory of computing,pages 215–224. ACM, 2011.

[11] R. J. Lipton, E. Markakis, and A. Mehta. Playing large games using simple strategies. InProceedings of the 4th ACM conference on Electronic commerce, pages 36–41. ACM, 2003.

[12] Y. Mansour, A. Slivkins, and Z. S. Wu. Competing bandits: Learning under competition. In 9thInnovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018,Cambridge, MA, USA, pages 48:1–48:27, 2018. doi: 10.4230/LIPIcs.ITCS.2018.48. URLhttps://doi.org/10.4230/LIPIcs.ITCS.2018.48.

[13] D. Monderer and L. S. Shapley. Potential games. Games and economic behavior, 14(1):124–143, 1996.

[14] J. H. Morgenstern and T. Roughgarden. On the pseudo-dimension of nearly optimal auctions.In Advances in Neural Information Processing Systems, pages 136–144, 2015.

[15] R. B. Myerson. Optimal auction design. Mathematics of operations research, 6(1):58–73, 1981.

[16] N. Nisan and A. Ronen. Algorithmic mechanism design. In Proceedings of the thirty-firstannual ACM symposium on Theory of computing, pages 129–140. ACM, 1999.

[17] R. W. Rosenthal. A class of games possessing pure-strategy nash equilibria. InternationalJournal of Game Theory, 2(1):65–67, 1973.

[18] H. A. Simon. Rational choice and the structure of the environment. Psychological Review, 63(2):129, 1956.

9

Page 10: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

[19] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142,1984.

[20] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies ofevents to their probabilities. In Measures of complexity, pages 11–30. Springer, 2015.

[21] D. H. Wolpert and W. G. Macready. No free lunch theorems for optimization. IEEE transactionson evolutionary computation, 1(1):67–82, 1997.

A Omitted proofs from Section 3

Proof of Lemma 1. First, we define two auxiliary classes of binary functions G≥,G≤ such that

G≥ = {g≥h(x, r) = 1h(x)≥r ∣ h ∈Hi, (x, r) ∈ X ×R},G≤ = {g≤h(x, r) = 1h(x)≤r ∣ h ∈Hi, (x, r) ∈ X ×R}. (4)

Claim A1. VCdim(G≥) = VCdim(G≤) = di.The proof of Claim A1 appears in Section B. Next, we wish to bound the growth function of Fi usingthe growth function of G≥ and G≤.

Claim A2. ΠFi(m) ≤ ΠG≥(m) ⋅ΠG≤(m).

The proof of Claim A2 appears in Section B. We are now ready for the final argument. By the Sauer-Shelah lemma we know that every m satisfying 2m > ΠFi(m) is an upper bound on VCdim(Fi) [?]. In particular, for m = 10di we have

√210di = 25di = (31 + 1)di = di∑

j=0

31j1di−j(dij) Claim B1≥ di∑

j=0

(31dij

)j (5)

≥ di∑j=0

(10edij

)j Claim B1≥ di∑j=0

(10dij

)≥ ΠG≥(10di) = ΠG≤(10di); (6)

thereforeΠFi(10di) ≤ ΠG≥(10di)ΠG≤(10di) < 210di .

Proof of Lemma 2. Recall that the Sauer - Shelah lemma implies that ΠFi(m) ≤ (em/di)di form > di + 1. Since ∣F ∩ S ∣ =∏N

i=1 ∣Fi ∩ S ∣, we have

ΠF(m) ≤ N∏i=1

ΠFi(m) ≤ N∏i=1

(em/di)10di ≤ N∏i=1

(em)10di = (em)10∑Ni=1 di

Proof of Lemma 3. The proof follow closely the four steps in the proof of the classical uniformconvergence theorem for binary functions (see, e.g., [? ? ]). The only steps that need modificationare step 3 and 4, but we present the full proof for completeness.

Step 1 – Symmetrization: First, we want to show that

PrS∼Dm (∃h ∶ ∣πi(h) − πSi (h)∣ ≥ ε) ≤ 2 Pr(S,S′)∼Dm (∃h ∶ ∣πSi (h) − πS′i (h)∣ ≥ ε

2) . (7)

For each S, let h̃(S) be a function for which ∣πi(h̃(S)) − πSi (h̃(S))∣ ≥ ε if such a function ex-ists, and any other fixed function in H otherwise. Notice that if ∣πi(h̃(S)) − πSi (h̃(S))∣ ≥ ε and

10

Page 11: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

∣πi(h̃(S)) − πS′i (h̃(S))∣ ≤ ε2

, then ∣πSi (h̃(S)) − πS′i (h̃(S))∣ ≥ ε2

(triangle inequality); thus,

Pr(S,S′)∼Dm (∃h ∶ ∣πSi (h) − πS′i (h)∣ ≥ ε

2)

≥ Pr(S,S′)∼Dm (∣πSi (h̃(S)) − πS′i (h̃(S))∣ ≥ ε

2)

≥ Pr(S,S′)∼Dm (∣πi(h̃(S)) − πSi (h̃(S))∣ ≥ ε ∩ ∣πi(h̃(S)) − πS′i (h̃(S))∣ ≤ ε

2)

= ES∼Dm [1 (∣πi(h̃(S)) − πSi (h̃(S))∣ ≥ ε) PrS′∣S (∣πi(h̃(S)) − πS′i (h̃(S))∣ ≤ ε

2)]

≥ 1

2PrS∼Dm (∣πi(h̃(S)) − πSi (h̃(S))∣ ≥ ε)

= 1

2PrS∼Dm (∃h ∶ ∣πi(h) − πSi (h)∣ ≥ ε) ,

since S,S ′ are independent and due to Claim B3.

Step 2 – Permutations: we denote Γ2m as the set of all permutations of [2m] that swap i and m+ i insome subset of [m]. Namely,

Γ2m = {σ ∈ Π([2m]) ∣ ∀i ∈ [m] ∶ σ(i) = i ∨ σ(i) =m + i;∀i, j ∈ [2m] ∶ σ(i) = j⇔ σ(j) = i},where Π([2m]) denotes the set of permutations over [2m]. In addition, for S = (z1, . . . , z2m), letσ(S) = (zσ(1), . . . , zσ(2m)). Notice that for every σ ∈ Γ2m it holds that

Pr(S,S′)∼Dm (∃h ∶ ∣πSi (h) − πS′i (h)∣ ≥ ε

2) = Pr(S,S′)∼Dm (∃h ∶ ∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2) ;

hence,

Pr(S,S′)∼D2m(∃h ∶ ∣πSi (h) − πS′i (h)∣ ≥ ε

2)

= 1

2m∑

σ∈Γ2m

Pr(S,S′)∼D2m(∃h ∶ ∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2)

= 1

2m∑

σ∈Γ2m

E(S,S′)∼D2m [1∃h∶∣πσ(S)i (h)−πσ(S′)i (h)∣≥ ε2 ]= E(S,S′)∼D2m

⎡⎢⎢⎢⎣1

2m∑

σ∈Γ2m

1∃h∶∣πσ(S)i (h)−πσ(S′)i (h)∣≥ ε2⎤⎥⎥⎥⎦

= E(S,S′)∼D2m [ Prσ∈Γ2m

(∃h ∶ ∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2)]

≤ sup(S,S′)∼D2m

[ Prσ∈Γ2m

(∃h ∶ ∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2)] . (8)

Step 3 – Reduction to finite class: fix (S,S ′) and consider a random draw of σ ∈ Γ2m. For eachstrategy profile h, the quantity ∣πσ(S)i (h) − πσ(S′)i ∣ is a random variable. Since F ∩ S represents thenumber of distinct strategy profiles in the empirical game over S (see Subsection 3.1), there are atmost ΠF(2m) such random variables.

Prσ∈Γ2m

(∃h ∶ ∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2)

≤ ΠH(2m) suph∈H Pr

σ∈Γ2m

(∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2) (9)

11

Page 12: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

Step 4 – Hoeffding’s inequality: by viewing the previous equation as Rademacher random variables,we have

Prσ∈Γ2m

(∣πσ(S)i (h) − πσ(S′)i (h)∣ ≥ ε

2)

= Prσ∈Γ2m

⎛⎝ 1

m

RRRRRRRRRRRm∑j=1

wi(zσ(j),h) −wi(zσ(j+m),h)RRRRRRRRRRR ≥ε

2

⎞⎠= Prr∈{−1,1}m

⎛⎝ 1

m

RRRRRRRRRRRm∑j=1

rj (wi(zj ,h) −wi(zj+m,h))RRRRRRRRRRR ≥ε

2

⎞⎠ .Observe that for every j it holds that rj (wi(zj ,h) −wi(zj+m,h)) ∈ [−1,1], and

Erj∈{−1,1} [rj (wi(zj ,h) −wi(zj+m,h))] = 0

holds due to symmetry. By applying Hoeffding’s inequality we obtain

Prr∈{−1,1}m

⎛⎝ 1

m

RRRRRRRRRRRm∑j=1

rj (wi(zj ,h) −wi(zj+m,h))RRRRRRRRRRR ≥ε

2

⎞⎠ ≤ 2e−mε28 . (10)

Finally, we combining Equations (7),(8),(9) and (10) we derive the desired result.

Proof of Theorem 1. The Theorem follows immediately by applying the union bound on the inequal-ity obtained in Lemma 3 and by substituting ΠF(2m) according to Lemma 2.

Proof of Lemma 4. The lemma is proven by showing that induced game has a potential functionΦ ∶H → R. Namely, we show a function Φ such that for every i,h and h′i it holds that

πi(h) − πi(h′i,h−i) = Φ(h) −Φ(h′i,h−i).Denote Φ(h) = 1

m ∑mj=1∑N(zj ;h)k=11k

, and observe that

πi(h) − πi(h′i,h−i) = 1

m

m∑j=1

wi(zj ;h) − 1

m

m∑j=1

wi(zj ;h′i,h−i)= 1

m

m∑j=1

Ii(z;hi)∣N(z;h)∣ − 1

m

m∑j=1

Ii(z;h′i)∣N(z;h′i,h−i)∣ +1

m

m∑j=1

N(zj ;h−i)∑k=1

1

k− 1

m

m∑j=1

N(zj ;h−i)∑k=1

1

k

= 1

m

m∑j=1

N(zj ;h)∑k=1

1

k− 1

m

m∑j=1

N(zj ;hi,h−i)∑k=1

1

k= Φ(h) −Φ(h′i,h−i).

Proof of Lemma 5. In each iteration of the dynamics it holds that

Φ(h′i,h−i) −Φ(h) = πi(h′i,h−i) − πi(h) ≥ ε. (11)

Notice that

Φ(h) = 1

m

m∑j=1

N(zj ;h)∑k=1

1

k≤ 1

m

m∑j=1

N∑k=1

1

k≤ lnN + 1. (12)

Since the potential is bounded by lnN + 1 and increases by at least ε per iteration throughout thedynamics, after at most lnN+1

εiterations it will reach its maximum value, thereby obtaining an

ε-PNE.

Proof of Lemma 6. By Equation (2), we look for m that satisfies

4N(2em)10∑Ni=1 die− ε2m8 ≤ δ12

Page 13: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

for given ε, δ; thus

(2em)10∑Ni=1 die− ε2m8 ≤ δ

4N

⇒ 10d log(2em) − ε2m8

≤ logδ

4Nε2m

8≥ 10d log(2em) − log

δ

4N

m ≥ 80d log(2em)ε2

− 8

ε2log

δ

4N

m ≥ 80d

ε2log(m) + 80d log(2e)

ε2+ 8

ε2log

4N

δ(13)

Next,

Claim A3 ([? ], Section A). Let a ≥ 1 and b > 0. If m ≥ 4a log(2a) + 2b, then m ≥ a logm + b.Set a = 80d

ε2and b = 80d log(2e)

ε2+ 8ε2

log 4Nδ

. Due to Claim A3, we know that every m that satisfies

m ≥ 320d

ε2log (160d

ε2) + 160d log(2e)

ε2+ 16

ε2log (4N

δ)

also satisfy Equation (13).

Proof of Lemma 7. Notice that for every i, h′i it holds that

πi(h′i,h−i) − πi(h) = πi(h′i,h−i) − πSi (h′i,h−i) + πSi (h′i,h−i) − πi(h)h is an

ε2 -empirical-PNE≤ πi(h′i,h−i) − πSi (h′i,h−i) + πSi (h) − πi(h) + ε

2;

therefore, if πi(h′i,h−i) − πi(h) > ε then at least one of πi(h′i,h−i) − πSi (h′i,h−i) > ε4

or πSi (h) −πi(h) > ε

4must hold. Overall,

PrS∼Dm (h is not an ε-PNE) = PrS∼Dm (∃i ∈ [N], h′i ∈Hi ∶ πi(h′i,h−i) − πi(h) > ε)≤ PrS∼Dm (∃i ∈ [N], h′i ∈Hi ∶ πi(h′i,h−i) − πSi (h′i,h−i) > ε

4or πSi (h) − πi(h) > ε

4)

≤ PrS∼Dm (∃i ∈ [N], h′i ∈Hi ∶ ∣πi(h′i,h−i) − πSi (h′i,h−i) > ε

4∣ or ∣πSi (h) − πi(h) > ε

4∣)

≤ PrS∼Dm (∃i ∈ [N] ∶ suph′′∈H ∣πi(h′′) − πSi (h′′)∣ ≥ ε

4) m≥m ε

4,δ≤ δ.

B Additional claims and proofs

Proof of Claim A1. We prove the claim for G≥, and by symmetric arguments one can show it holdsfor G≤ as well.

Since Pdim(Hi) = di, for every m ≤ di there is a sample S = (x1, . . . xm) ∈ Xm and a witnessr = (r1, . . . rm) ∈ Rm such that for every binary vector b ∈ {−1,1}m there is a function hb for whichsign(hb(xj) − rj) = bj for all j ∈ [m]. Denote

S ′ = ((x1, r1), . . . , (xm, rm)) ,and focus on a particular b ∈ {−1,1}m. For every j ∈ [m] such that bj = 1 we have

sign(hb(xj) − rj) = 1⇒ hb(xj) − rj > 0⇒ 1hb(x)≥r(xj , rj) = 1.

13

Page 14: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

In addition, if bj = −1 then

sign(hb(xj) − rj) = −1⇒ hb(xj) − rj < 0⇒ 1hb(x)≥r(xj , rj) = 0.

This is true for every b; therefore, we showed that G≥ shatters S ′.In the opposite direction, assume by contradiction that G≥ shatters S ′ = ((x1, r1), . . . , (xm, rm))for m ≥ di + 1. Let H = {hb}b∈{−1,1} ⊂ Hi be a set of functions such that for every b there existsexactly one function hb ∈ H satisfying 1hb(x)≥r(xj , rj) = 1 if bj = 1 and 1hb(x)≥r(xj , rj) = 0 ifbj = 0. Notice that by definition of VC dimension such H must exist, and that ∣H ∣ = 2m.

One cannot claim directly thatHi pseudo-shatters S = (x1, . . . , xm) with witness r = (r1, . . . rm),since hb(xj) = rj may hold for bj = 1, but we need hb(xj) to be strictly greater than rj ; therefore,we construct a new witness: let aj such that

aj = maxhb∈H,bj=−1

hb(xj).Notice that ∣H ∣ is finite so the maximum is well defined. In addition, aj < rj since hb(xj) < rj forevery b such that bj = −1 (recall that if bj = 0 then 1hb(x)≥r(xj , rj) = 0 ).

Denote r∗ = a+r2

. Next, we claim thatHi pseudo-shatters S with the witness r∗. Fix b ∈ {−1,1}m.If bj = −1,

1hb(x)≥r(xj , rj) = 0⇒ hb(xj) ≤ aj ⇒ hb(xj) < r∗j ⇒ 1hb(x)≥r(xj , r∗j ) = 0.

On the other hand, if bj = 1, we have

1hb(x)≥r(xj , rj) = 1⇒ hb(xj) ≥ rj ⇒ hb(xj) > r∗j ⇒ 1hb(x)≥r(xj , r∗j ) = 1.

Combining these two equations, we get that sign(hb(xj) − r∗j ) = bj for all j ∈ [m]. Consequently,Hi pseudo-shatters S with witness r∗; hence we obtained a contradiction.

Overall, we showed that VCdim(G≥) ≥ di and VCdim(G≥) ≤ di; hence VCdim(G≥) = di.Proof of Claim A2. Denote by S = (xj , yj , tj)mj=1 ∈ Zm an arbitrary sample, and let Fi ∩ S be therestriction of Fi to S. Formally,

Fi ∩ S = {(fh(z1), . . . , fh(zm)) ∣ fh ∈ Fi} .In addition, denote by G≥ the restriction of G≥ to (xj , yj − tj)mj=1, and similarly let G≤ be therestriction of G≤ to (xj , yj + tj)mj=1. We now show a one-to-one mapping M ∶ Fi ∩ S → G≥ ×G≤,implying that ∣Fi ∩ S ∣ ≤ ∣G≥ ×G≤∣ (14)

holds, thereby proving the assertion. Notice that for every fh ∈ Fi such that fh(z) = 1 we haveI(z, h) = 1 for the corresponding h ∈Hi; thus

−tj ≤ h(xj) − yj ≤ tj ⇒ {1h(xj)≤yj+tj = 1

1h(xj)≥yj−tj = 1⇒ {g≤h(xj , yj + tj) = 1

g≥h(xj , yj − tj) = 1.3 (15)

Alternatively, if fh(zj) = 0, we have I(zj , h) = 0 and

(h(xj) − yj < −tj) ∨ (h(xj) − yj > tj)⇒ (1h(xj)<yj−tj = 1) ∨ (1h(xj)>yj+tj = 1)⇒ (1h(xj)≥yj−tj = 0) ∨ (1h(xj)≤yj+tj = 0)⇒ (g≥h(xj , yj − tj) = 0) ∨ (g≤h(xj , yj + tj) = 0). (16)

By Equations (15) and (16) we have

{fh(zj) = 1⇒ (g≤h(xj , yj + tj), g≥h(xj , yj − tj)) = (1,1)fh(zj) = 0⇒ (g≤h(xj , yj + tj), g≥h(xj , yj − tj)) ∈ {(0,0), (0,1), (1,0)} . (17)

We define the mapping M such that every vector (I(z1, h1), . . . ,I(zm, h1)) ∈ Fi ∩ S is mapped to

(g≥h(x1, y1 − t1), . . . , g≥h(xm, ym − tm), g≤h(x1, y1 + t1), . . . , g≤h(xm, ym + tm)) ∈ G≥ ×G≤.3Recall the definition of g≤h, g≥h in Equation (4).

14

Page 15: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

Namely, every vector obtained by applying fh on the sample S is mapped to the vector formed byconcatenating the two corresponding (same h) vectors from G≤ and G≥. Let b1,b2 ∈ Fi ∩ S suchthat b1j ≠ b2j for at least one index j ∈ [m], and w.l.o.g. let b1j = 1 . Since b1j = fh(zj) = I(zj , h),Equation (17) implies that M(b1)j =M(b1)j+m = 1, while at least one of {M(b2)j ,M(b2)j+m}equals zero, thus M(b1) ≠M(b2). Hence M is an injection.

Ultimately, notice that S is arbitrary, thus

ΠFi(m) = maxS∈Zm ∣Fi ∩ S ∣ ≤ ∣G≥ ×G≤∣ = ∣G≥∣ ⋅ ∣G≤∣ ≤ ΠG≥(m) ⋅ΠG≤(m).

Claim B1. (nk)k ≤ (n

k) ≤ ( en

k)k.

Proof of Claim B1. We prove the two claims separately.

● (nk)k ≤ (n

k): fix n. We prove by induction for k ≤ n. The assertion holds for k = 1. For k ≥ 2 and

every m such that 0 <m < k ≤ we have

k ≤ n⇒ m

n≤ mk⇒ 1 − m

k≤ 1 − m

n⇒ k −m

k≤ n −m

n⇒ n

k≤ n −mk −m ;

thus

(nk)k = n

k⋯nk≤ nk

n − 1

k − 1⋯n − k + 1

k − k + 1= (n

k).

● (nk) ≤ ( en

k)k: since ek = ∑∞

i=0ki

i!(the Taylor expansion of ek), we have ek > kk

k!; thus, 1

k!< ( e

k)k.

As a result,

(nk) = n ⋅ (n − 1)⋯(n − k + 1

k!≤ nkk!

< (enk

)k .Claim B2. VCdim(Fi) ≥ Pdim(Hi).

Proof of Claim B2. Denote S = (x1, . . . , xm) ∈ Xm and r ∈ Rm such that Hi pseudo-shatters Swith witness r. We prove the claim by showing that we can construct S ′ ∈ Zm that is shattered byFi. For every binary vector b ∈ {−1,1}m there exists hb such that sign(hb(xj) − rj) = bj for everyj ∈ [m].Denote H = {hb ∈Hi}b∈{−1,1}m such that ∣H ∣ = 2m. For every j such that rj ≥ 0, let

yj = min{0, minhb∈H hb(xj)} .

In addition, for every j such that rj < 0, let

yj = max{0,maxhb∈H hb(xj)} ,

and denote S ′ = (xj , yj , ∣rj ∣ − sign(rj)tj)mj=1 .

We now show that Fi shatters S ′. Fix an arbitrary b ∈ {−1,1}m, and observe that in case rj ≥ 0

{bj = 1

bj = −1⇒ {hb(xj) ≥ rj

hb(xj) < rj ⇒ {hb(xj) − yj ≥ rj − yjhb(xj) − yj < rj − yj ⇒ {∣hb(xj) − yj ∣ ≥ rj − yj∣hb(xj) − yj ∣ < rj − yj , (18)

where the last argument holds since hb(xj) ≥ yj . Alternatively, if rj < 0 we have

{bj = 1

bj = −1⇒ {hb(xj) ≥ rj

hb(xj) < rj ⇒ {−hb(xj) + yj ≤ −rj + yj−hb(xj) + yj > −rj + yj ⇒ {∣−hb(xj) + yj ∣ ≤ ∣rj ∣ + yj∣−hb(xj) + yj ∣ > ∣rj ∣ + yj ,

(19)

15

Page 16: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

where again the last set of inequalities holds since hb(xj) ≤ yj . In case one of Equations (18) and (19)holds in equality, we can slightly shift rj (as was done in the proof of Claim A1); hence we assumethese are strict inequalities. The expression in Equation (18) corresponds to I(hb, (xj , yj , rj − yj)),while that of Equation (19) corresponds to I(hb, (xj , yj , ∣rj ∣ + yj)).

This analysis applies for every b; hence, Fi shatters S ′ as required.

Claim B3. For a given h it holds that

PrS′∼Dm (∣πi(h) − πS′i (h)∣ ≤ ε

2) ≥ 1

2. (20)

Proof of Claim B3. Recall Chebyshev’s inequality

Pr (∣X −E[x]∣ ≥ ε) ≤ Var(X)ε2

.

Applying it for our problem, we get

PrS′∼Dm (∣πi(h) − πS′i (h)∣ ≥ ε

2) ≤ Var(πS′i (h))

ε2

4

.

Notice that πS′i (h) is the average of independent random variables bounded in the [0,1] segment;hence, by Popoviciu’s inequality on variances we have

Var(πS′i (h)) ≤ 1

4m.

Finally, for m ≥ 2ε2

it holds that

PrS′∼Dm (∣πi(h) − πS′i (h)∣ ≤ ε

2) = 1 − PrS′∼Dm (∣πi(h) − πS′i (h)∣ ≥ ε

2) ≥ 1 − 1

4mε2

4

= 1 − 1

ε2m≥ 1

2.

Claim B4. Let m ≥ 15, (Xi)mi=1 be a sequence of i.i.d. Bernoulli r.v. with p = 12

, and let X̄ =1m ∑mi=1Xi. Then Pr ( 1

2< X̄ < 3

4) ≥ 1

4.

Proof of Claim B4. By Hoeffding’s inequality we have Pr (X̄ ≥ (p + ε)) ≤ exp (−2ε2m). Therefore

Pr(1

2< X̄ < 3

4) = Pr(X̄ < 3

4) −Pr(X̄ ≤ 1

2) = 1 −Pr(X̄ ≥ 3

4) − 1

2

ε= 14≥ 1

2− e −m8 m≥15≥ 1

4.

C Best response for the linear strategy space

In this section we build on the results of [4] to devise a best response oracle for the linear hypothesisclass. LetHi be the linear hypothesis class. Namely,Hi is the function class such that each hi ∈Hicorresponds to the mapping x↦ hi ⋅x. 4 Notice that under this representation,Hi = Rn.

From here on we assume that the dimension of the input n is fixed. By slightly modifying thealgorithm given in [4], we show how player i can compute a best response against any h−i. Indeed,when n is fixed, our proposed algorithm is guaranteed to run in polynomial time. In particular, the runtime complexity of our algorithm is independent of the hypothesis class complexity of other players.

4Note that now a strategy of player i is itself a vector. To emphasize the vector arithmetics we use boldnotation also for the instances, i.e. x.

16

Page 17: Competing Prediction Algorithms · In this section we formalize the model. We begin with an informal introduction to elementary concepts in both game theory and learning theory that

As a first step, we consider the Partial Vector Feasibility problem (PVF).

Problem: PARTIAL VECTOR FEASIBILITY (PVF)Input: a sequence of examples S = (xj , yj , tj)mj=1, and a vector v ∈ {1, a, b,0}mOutput: a point hi ∈ Rn satisfying

• if vj = 1, then ∣hi ⋅xj − yj ∣ < tj• if vj = a, then hi ⋅xj − yj > tj // above

• if vj = b, then hi ⋅xj − yj < −tj // below

• if vj = 0, there is no constraint for the j’th point

if such exists, and φ otherwise.

Note that PVF is solvable in polynomial time via linear programming. We are now ready to presentthe Best Linear Response (BLR) algorithm. BLR has three main steps:

1. compute the potential payoff from each point in the sample2. find all feasible subsets of points player i can satisfy concurrently (i.e. vectors in Fi ∩ S,

where Fi is as defined in Equation (1))3. return a strategy that achieves the highest possible payoff.

The first step consists of a straightforward computation. To motivate the second step, notice that ifI(zj ,hi) = 0, then either hi ⋅ xj − yj > tj or hi ⋅ xj − yj < −tj holds. Therefore, we identify allvectors v = (v1, . . . vm) ∈ {1, a, b}m such that there exists hi ∈Hi and vj = 1 if I(zj ,hi) = 1; vj = aif hi ⋅ xj − yj > tj (“above”); and vj = b if hi ⋅ xj − yj < −tj (“below”). This is done by recursivepartition of {1, a, b}m, where in each iteration we consider only partial vectors, i.e. vectors withentries masked with “0” (see PVF). At the end of this step we have fully identified the set Fi ∩ S,and have further information regarding zero entries of each vector in it - whether it corresponds to“above” or “below”, in the aforementioned sense. Finally, we have all possible payoffs, so we pick avector corresponding to the highest one. We then find a strategy that attains it by invoking PVF forthe last time.

The above discussion is formulated via the following algorithm.

Algorithm 2: BEST LINEAR RESPONSE (BLR)Input: S = (xj , yj , tj)mj=1, h−iOutput: A best response to h−i

1 for every j ∈ [m], wj ← 1∑i′≠i I(zj ,hi′)+1// player i gets wj when satisfying zj

2 v ← {0}m // v = (v1, v2, . . . , vm)3 R0 ← {v}4 for j = 1 to m do5 Rj ← ∅6 for v ∈Rj−1 do7 for α ∈ {1, a, b} do8 if PVF (S, (v−j , α)) ≠ φ then9 add (v−j , α) toRj // (v−j , α) = (v1, . . . vj−1, α, vj+1, . . . , vm)

10 v∗ ← arg maxv∈Rm ∑mj=1wj1vj=1 // one such vector must exist

11 return PVF(v∗)Theorem 2 in [4] shows that the second step (for loop in line 4) is done in time poly(m), and thatRmis of polynomial size. The first step (line 1) and the last step (lines 10 and 11) are clearly executed inpolynomial time. Overall, BLR runs in polynomial time. In addition, since it considers all possibledistinct strategies using Fi ∩ S and takes the one with the highest payoff, it indeed returns the bestlinear response with respect to h−i in the empirical game.

17