byzantine fault-tolerance in federated local sgd under 2f

1

Byzantine Fault-Tolerance in Federated Local SGDunder 2f -Redundancy

Nirupam Gupta1 Thinh T. Doan2 Nitin Vaidya3

Abstract—We consider the problem of Byzantine fault-tolerance in federated machine learning. In this problem, thesystem comprises multiple agents each with local data, and atrusted centralized coordinator. In fault-free setting, the agentscollaborate with the coordinator to find a minimizer of theaggregate of their local cost functions defined over their localdata. We consider a scenario where some agents (f out of N )are Byzantine faulty. Such agents need not follow a prescribedalgorithm correctly, and may communicate arbitrary incorrectinformation to the coordinator. In the presence of Byzantineagents, a more reasonable goal for the non-faulty agents is tofind a minimizer of the aggregate cost function of only thenon-faulty agents. This particular goal is commonly referred asexact fault-tolerance. Recent work has shown that exact fault-tolerance is achievable if only if the non-faulty agents satisfy theproperty of 2f -redundancy. Now, under this property, techniquesare known to impart exact fault-tolerance to the distributedimplementation of the classical stochastic gradient-descent (SGD)algorithm. However, we do not know of any such techniques forthe federated local SGD algorithm - a more commonly usedmethod for federated machine learning. To address this issue,we propose a novel technique named comparative elimination(CE). We show that, under 2f -redundancy, the federated localSGD algorithm with CE can indeed obtain exact fault-tolerancein the deterministic setting when the non-faulty agents canaccurately compute gradients of their local cost functions. Inthe general stochastic case, when agents can only computeunbiased noisy estimates of their local gradients, our algorithmachieves approximate fault-tolerance with approximation errorproportional to the variance of stochastic gradients and thefraction of Byzantine agents.

Index Terms—Federated optimization, Byzantine fault-tolerance, local gradient-descent

I. INTRODUCTION

We consider a distributed optimization framework wherethere are N agents communicating with a single coordinator.Associated with each agent i is a function qi : Rd → R. Thegoal of the agents is to find x? such that

x? ∈ arg minx∈Rd

N∑i=1

qi(x), (1)

where each qi is given as

qi(x) , Eπi

[Qi(x;Xi)

], (2)

for some random variable Xi defined over a sample set X iwith a distribution πi. We assume that each agent i onlyhas access to a sequence of vectors {Gi(·)}, which can beeither the actual gradients {∇qi(·)} or stochastic gradients{∇Qi(·, Xi)} of its cost function qi(·). This is a common

1EPFL, Switzerland ([email protected]). 2VirginiaTech., USA ([email protected]). 3Georgetown University, USA([email protected]).

distributed machine learning setting when there is a largenumber of data distributed to different agents (or machines).The goal is to design an algorithm that allows these agentsis to jointly minimize a loss function defined over their data,i.e., solve the optimization problem (1).

For solving this problem, we consider the federated localstochastic gradient-descent (local SGD) method, which hasrecently received significant attention due to its applicationin distributed learning [1], [2]. In this method, the coor-dinator maintains an estimate of a solution defined in (1).This estimate is broadcast to all the agents, and each agentupdates its copy of the estimate by running a number oflocal SGD steps. The agents send back to the coordinatortheir local updated estimates. Finally, the coordinator averagesthe received estimates to obtain new global estimate of x?.Eventually, if all the agents are non-faulty, the sequence ofglobal estimates converges to a solution (1). Since the agentsonly share their local estimates and not their data, federatedlocal SGD is widely used in privacy conscious distributedlearning [1].

Our interest is to study the federated local SGD algorithmwhen up to f agents are Byzantine faulty [3]. Such faultyagents may behave arbitrarily, and their identity is a prioriunknown. In particular, Byzantine faulty agents may colludeand share incorrect information with the coordinator in orderto corrupt the output of the algorithm, e.g., see [4]. Weaim to design a new local SGD method that allows allthe non-faulty agents to compute an exact minimum of theaggregate cost of the non-faulty agents, despite the presenceof Byzantine agents. Specifically, we consider the problem ofexact fault-tolerance defined below. For a set H, |H| denotesits cardinality.

Definition 1 (Exact fault-tolerance). Let H with |H| ≥ N −f be the set of non-faulty agents. A distributed optimizationalgorithm is said to have exact fault-tolerance if it allows allthe non-faulty agents to compute

x?H ∈ arg minx∈Rd

∑i∈H

qi(x). (3)

Since the identity of the Byzantine faulty agents is a prioriunknown, in general, exact fault-tolerance is unachievable [5].Indeed, recent work has shown that exact fault-tolerance canbe achieved if and only if the non-faulty agents satisfy theproperty of 2f -redundancy defined as follows [6], [7].

Definition 2 ( 2f -redundancy). A set of non-faulty agents H,with |H| ≥ n − f , is said to have 2f -redundancy if for anysubset S ⊆ H with |S| ≥ N − 2f ,

arg minx∈Rd

∑i∈S

qi(x) = arg minx∈Rd

∑i∈H

qi(x). (4)

arX

iv:2

108.

1176

9v1

[cs

.DC

] 2

6 A

ug 2

021

2

The 2f -redundancy property is critical to our algorithm,presented in Section II, for solving (3). This property impliesthat a minimizer of the aggregate cost of any N − 2f non-faulty agents is also a minimizer of the aggregate cost of allthe non-faulty agents, and vice-versa. This seemingly con-trived condition arises naturally with high probability in manypractical applications of the distributed optimization problem,such as distributed hypothesis testing [8]–[11], and distributedlearning [12]–[15]. In the context of distributed learning, whenall agents have the same data generating distribution (i.e., thehomogeneous data setting), 2f -redundancy hold true trivially.In general, however, 2f -redundancy also holds true in the non-homogeneous data setting as long as we can solve the non-faulty distributed learning problem (3) using information (ordata) from only N − 2f non-faulty agents. More details on2f -redundancy, including a formal proof for its necessity, canbe found in [6], [7], [16].

Prior work indicates that solving the exact fault-toleranceproblem (3) is nontrivial even under the 2f -redundancy prop-erty, especially in the high dimension case, e.g., see [6],[11], [17]. This can be attributed to two main factors; (i) theidentity of Byzantine faulty agents is a priori unknown, and(ii) smart Byzantine agents can inject malicious informationwithout getting detected (e.g., see [18]). Although there existtechniques that impart provable exact fault-tolerance to thedistributed implementation of SGD algorithm (in which theagents simply share their local gradients, and do not maintainlocal estimates) [6], [13], [19], we do not know of such tech-niques for the federated local SGD algorithm. This motivatesus to propose a technique named comparative elimination(CE) that provably robustifies the federated local SGD algo-rithm against Byzantine agents. In the special deterministicsetting, we show that CE can provide exact fault-tolerance. Inthe generic stochastic setting, we achieve approximate fault-tolerance with approximation error proportional to the varianceof stochastic gradients and fraction of Byzantine agents f/N .

Intuitively speaking, the CE techniques allows the coordina-tor to mitigate the detrimental impact of potentially adversarialestimates sent by the Byzantine agents. Specifically, instead ofsimply averaging the agents’ local estimates, the coordinatoreliminates f estimates that are farthest from the current globalestimate maintained by the coordinator. Then, the remainingN −f agents’ estimates are averages to obtain the new globalestimate. Details of our scheme, along with its formal fault-tolerance properties, is presented in Section II.

We present below a summary of our main contributions, andthen discuss the related literature.

A. Main ContributionsWe formally analyse the Byzantine robustness of the feder-

ated local SGD algorithm when coupled with the aforemen-tioned technique of comparative elimination (CE). Specifi-cally, assuming each non-faulty agent’s cost function to be L-smooth, the aggregate non-faulty cost to be µ-strongly convex(but, the local costs may only be convex), and the necessarycondition of 2f -redundancy, we present the following results:• In the deterministic case, when each non-faulty agenti updates its local estimates using actual gradients of its

cost {∇qi(·)}, the CE filter scheme guarantees exact fault-tolerance if f

|H| ≤µ3L . Moreover, the convergence is linear,

similar to the fault-free setting under strong convexity.• In the stochastic case, when a non-faulty agent i can only

compute unbiased noisy estimates of its local gradients{∇Qi(·, Xi)} we guarantee approximate fault-tolerance.Specifically, the sequence of global estimates {xk} satisfythe following for all k:

E[‖xk − x?H‖2] ≤ λkE[‖x0 − x?H‖2] +O(σ2α+

σ2f

N − f

)for some λ ∈ (0, 1) where α is a constant step-size of thealgorithm, and σ is the variance of stochastic gradients.

Specific details of our results are given in Section III.

B. Related Work

In recent years, several schemes have been proposedfor Byzantine fault-tolerance in distributed implementationof SGD algorithm. Prominent schemes include coordinate-wise trimmed mean (CWTM) [5], [11], [19], [20], multi-KRUM [13], geometric median-of-means (GMoM) [21],coordinate-wise median [19], [22], Bulyan [15], minimum-diameter averaging (MDA) [15], phocas [23], Byzantine-robust stochastic aggregation (RSA) [24], signSGD withmajority voting [25], and spectral decomposition based fil-ters [26], [27]. Most of these works, with the exception of [5],[11], [17], [20], [22], [24], only consider the framework ofdistributed SGD wherein the agents send gradients of theirlocal costs, and their results are not readily applicable to thefederated local SGD framework that we consider. Neverthe-less, these works suggest that the aforementioned schemesneed not guarantee exact fault-tolerance in general, even inthe deterministic setting with 2f -redundancy, unless furtherassumptions are made on the non-faulty agents’ costs.

Although [5], [22], [28] implicitly show exact fault-tolerance properties of trimmed-mean and median, they onlyconsider the scalar case where agents’ cost functions areunivariate. The extension of their results to higher-dimensionsis non-trivial and remains poorly understood, e.g., see [5],[11], [17], [29]. For instance, [5] considers a degenerate casewherein the agents’ cost functions have a known commonbasis. The prior work [11] shows that CWTM can guaranteeexact fault-tolerance when solving the distributed linear leastsquares problem provided the agents’ data satisfies a conditionstronger than 2f -redundancy. In [20], they assume that theagents’ costs can be decomposed into independent scalarstrictly convex functions. Recently, [17] studied the fault-tolerance of CWTM in a peer-to-peer setting (a generalizationof federated model) for generic convex optimization problems;their results suggest that CWTM need not provide exact fault-tolerance even under 2f -redundancy.

When applied to the federated local SGD framework,some of the above schemes, including multi-KRUM, Bulyan,CWTM, GMoM and MDA, operate only on the local estimatessent by the agents and disregard the current global estimatemaintained by the coordinator [30]. On the other hand, ourproposed CE filter exploits the (supposed) closeness between

3

the current global estimate and the non-faulty agents’ localupdated estimates to improve robustness against Byzantineagents. The closeness between the global and non-faultyagents’ local estimates exists due to Lipschitz smoothness ofagents’ local cost functions. This is a critical observation forthe fault-tolerance property of our algorithm. Other works thatalso exploit this observation are RSA [24], and [31].

Recently, [32] have shown that geometric median aggre-gation scheme provably provides improved Byzantine fault-tolerance compared to other aggregation schemes in federatedmodel. However, computing geometric median is a challengingproblem, as there does not exist a closed-form formula [33].Moreover, existing numerical algorithms for computing ge-ometric median are only approximate, and computationallyquite complex [34]. Other schemes, such as the verifiablecoding in [35] and manual verification of information sentby agents [36], are not directly applicable to the commonlyused federated framework where inter-agent communication isabsent or there are a large number of agents, and data privacyis a major concern.

Besides federated local SGD framework, the proposed CEfilter can also guarantee exact fault-tolerance in the distributedSGD framework where agents share their gradients instead ofestimates (references omitted to preserve authors’ anonymity).Also, for the distributed SGD framework, recent works haveshown that momentum helps improve the fault-tolerance ofa Byzantine-robust aggregation scheme [37], [38]. However,adaptation of their results to federated local SGD method isnon-trivial and remains to be investigated.

II. LOCAL SGD UNDER BYZANTINE MODEL

We now present the proposed algorithm for solving (1) inthe presence of at most f Byzantine faulty agents. We note thatthe Byzantine agents can observe the values of other agentsand send arbitrarily values to the coordinator. To handle thisscenario, the main idea of our approach is a Byzantine robustaggregation rule (or filter), named comparative elimination(CE) filter, which is implemented at the coordinator. This filtertogether with the local SGD formulates our proposed method,formally presented in Algorithm 1 for solving (3).

In Algorithm 1, each agent i maintains a local variable xi,and the coordinator maintains x, the average of these xi. Atany iteration k ≥ 0, agent i receives xk from the coordinatorand initializes its iterate xik,0 = xk. Here xik,t denotes theiterate at iteration k and local time t ∈ [0, . . . , T − 1] at agenti. Agent i then runs a number T of local SGD steps using time-varying step sizes αk and its local direction Gi(xik,t), whichcan be either the actual gradient ∇qi(xik,t) or a stochasticestimate ∇Qi(·, Xi) of its gradient based on the data {Xi

k,t}sampled i.i.d from πi. After T local SGD steps, the agents thensend their new local updates xik,T to the coordinator. However,Byzantine agents may send arbitrary values to disrupt thelearning process. The coordinator implements the CE filter(in steps 2(a) and 2(b) of Algorithm 1) to dilute the impactof “bad” values sent by the Byzantine agents. The main of thisfilter is to discard f -values (or estimates) that are f -farthestfrom the current global estimate xk. Finally, the coordinator

Algorithm 1: Local SGD with CE Filter

1 Initialization: The coordinator initializes x0 ∈ Rd.Agent i initializes step sizes {αk} and a positiveinteger T .

2 Iterations: For k = 0, 1, 2, ...1) Agent i

a) Receive xk sent by the server and set xik,0 = xkb) For t = 0, 1, . . . , T − 1, implement

xik,t+1 = xik,t − αkGi(xik,t). (5)

2) The coordinator receives xik,T from each agent i andimplement the CE filter as follows.a) Compute the distances of xik,t with its current valuexk, and sort them in an increasing order

‖xk − xi1k,T ‖ ≤ . . . ≤ ‖xk − xiNk,T ‖. (6)

b) Discard the f -largest distances, i.e., it dropsxiN−f+1

k,T , . . . , xiNk,T . Let Fk = {i1, . . . , iN−f}.c) Update its iterate as

xk+1 =1

|Fk|∑i∈Fk

xik,T . (7)

averages the N − f remaining estimates, as shown in (7), tocompute the new global estimate. Note that without the CEfilter (i.e., without steps 2(a) and 2(b), and Fk = [1, N ]),Algorithm 1 reduces to the traditional local GD method.

III. MAIN RESULTS

In this section, we present the main results of this paper,where we characterize the convergence of Algorithm 1 forsolving problem (3). We consider two cases, namely, the de-terministic settings (when Gi(·) = ∇qi(·)) and the stochasticsettings (when Gi(·) = ∇Qi(·, Xi)).1 In both cases, ourtheoretical results are derived when the non-faulty agents’ costfunctions are smooth. Moreover, we also assume the averagenon-faulty cost function, denoted by qH(x), to be stronglyconvex. Specifically,

qH(x) =1

|H|∑i∈H

qi(x). (8)

These assumptions are formally stated as follows.

Assumption 1 (Lipschitz smoothness). The non-faulty agents’functions have Lispchitz continuous gradients, i.e., there existsa positive constant L <∞ such that, ∀i ∈ H,

‖∇qi(x)−∇qi(y)‖ ≤ L‖x− y‖, ∀x, y ∈ Rd.

Assumption 2 (Strong convexity). qH is strongly convex, i.e.,there exists a positive constant µ <∞ such that

(x− y)T(∇qH(x)−∇qH(y)

)≥ µ‖x− y‖2, ∀x, y ∈ Rd

where (·)T denotes the transpose.

1Proofs of all the theorems presented in this section are deferred to theappendix attached after the list of references.

4

To this end, we assume that these assumptions and the 2f -redundancy property always hold true. In addition, withoutloss of generality we consider |H| = N −f . Finally, note thatAssumptions 1 and 2 hold true simultaneously only if µ ≤ L.

Remark 1. Assumption 2 implies that there exists a uniquesolution x?H of problem (3). However, this assumption does notimply that each local function qi is strongly convex. Indeed,each qi can have more than one minimizer. Under the 2f -redundancy property one can show that

x?H ∈⋂i∈H

arg minx∈Rd

qi(x). (9)

Thus, one can view that Algorithm 1 tries to search one pointin the intersection of the minimizer sets of the local functionsqi. However, we do not assume we can compute these setssince this task is intractable in general. Finally, our analysisgiven later will rely on (9), whose proof can be found in [6,Appendix B].

A. Deterministic Settings

In this section, we consider the deterministic setting ofAlgorithm 1, i.e., Gi(·) = ∇qi(·). For convenience, we firststudy the convergence of Algorithm 1 when T = 1 in SectionIII-A1 and generalize to the case T > 1 in Section III-A2.

1) The case of T = 1: When T = 1, Algorithm 1is equivalent to the popular distributed (stochastic) gradientmethod. Indeed, by (5) we have for any i ∈ H

xik,1 = xik,0 − αk∇qi(xik,0) = xk − αk∇qi(xk). (10)

We denote by B the set of Byzantine agents, i.e., N = |B|+|H| and |B| ≤ f . Without loss of generality we assume that|B| = f . Similarly, let Bk be the set of Byzantine agents inFk and Hk be the set of nonfaulty agents in Fk. Then wehave |Bk| = |Fk \ Hk| ≤ f , for any k ≥ 0.

Theorem 1. Let {xk} be generated by Algorithm 1 with T =1. We assume that the following condition holds

f

N − f≤ µ

3L· (11)

Let αk be chosen as

αk = α ≤ µ

4L2· (12)

Then we have

‖xk − x?H‖2 ≤(

1− µα

6

)k‖x0 − x?H‖

2. (13)

Remark 2. In Theorem 1 we show that under the 2f redun-dancy, Algorithm 1 returns an exact solution x?H of problem(3) even under of at most f Byzantine agents. Moreover, theconvergence is linear, which is the same as what we expect inthe non-faulty case (no Byzantine agents).

2) The case of T > 1: We now generalize Theorem 1to the case T > 1, i.e., each agent implements more than 1local GD steps. This is indeed a common practice in federatedoptimization. When T > 1, by (5) we have ∀i ∈ H andt ∈ [0, T )

xik,t+1 = xk − αkt∑`=0

∇qi(xik,`), (14)

Theorem 2. Assume that (11) hold and let αk satisfy

αk = α ≤ µ

16T L2· (15)

Then we have

‖xk − x?H‖2 ≤(

1− µT α6

)k‖x0 − x?H‖

2. (16)

B. Stochastic Settings

We next consider the setting where each agent only hasaccess to the samples of its gradient, i.e., Gi(·) = ∇Qi(·, Xi),where Xi is a sequence of random variables sampled i.i.d fromπi. In the sequel, we denote by

Pk,t = ∪i∈H{x0, . . . , xk, xik,1, . . . , xik,t}

the filtration containing all the history generated by Algorithm1 up to time k+t. To study the convergence of Algorithm 1 weconsider the following assumptions, which is often assumedin the literature of stochastic federated optimization [1].

Assumption 3. The random variables Xik, for all i and k, are

i.i.d., and there exists a positive constant σ such that

E[∇Qi(x,Xik,t) | Pk,t] = ∇qi(x), ∀x ∈ Rd,

E[‖∇Qi(x,Xik,t)−∇qi(x)‖2 | Pk,t] ≤ σ2, ∀x ∈ Rd.

Recall that |Bk| + |Hk| = |Fk| = |H|, for any k ≥ 0.Finally, for convenience we denote by

∇Qi(x;X) =1

|H|∑i∈H∇Qi(x;Xi),

where X = (X1, . . . , X |H|)T .1) The case of T = 1:

Theorem 3. Suppose that Assumptions 3 and condition (11)hold. Moreover, let αk be chosen as

αk = α ≤ µ

12L2· (17)

Then we have

E[‖xk − x?H‖2]

≤(

1− µ

6α)k

E[‖x0 − x?H‖2] +14σ2α

µ+

2σ2f

µL|H|· (18)

Remark 3. Note that in Theorem 3 due to the constant stepsize, the mean square error converges linearly only to a ballcentered at the origin. The size of this ball depends on twoterms: 1) one depends on the step size α which often seenin the convergence of local GD with non-faulty agents and2) the other depends on the level of the gradient noise (orσ). The latter is due to the impact of the Byzantine agents

5

Fig. 1. The plots show the error ‖xk−x∗‖2 in iteration k = 1, . . . , 120 of local GD (cf. Algorithm 1) with four different aggregation schemes; averaging,CE, multi-KRUM, CWTM, and coordinate-wise median. The benchmark corresponds to the fault-free execution of local GD. Solid lines show the meanperformances of the schemes, and the shadows show the variance of their performances, observed over 100 runs. In the clockwise order, f = 8, 12, 16 and20. We observe that, expectedly, all schemes obtain improved accuracy in presence of fewer Byzantine agents. We also observe that the performance of CEfilter is consistently better than other schemes.

Fig. 2. The leftmost plot is for f = 20 and T = 2, the case with T = 1 is shown in Figure 1. The middle and the rightmost plots are for (f = 24, T = 1)and (f = 24, T = 2), respectively. We observe that the accuracy of the local GD method with CE filter improves considerably as the number of local GDsteps T is increased from 1 to 2. The same cannot be observed for other schemes.

and the stochastic gradient samples. Indeed, our comparativefilter is designed to remove the potential bad values sent bythe Byzantine agents, but not the variance of their stochasticsamples. A potential solution for this issue is to let each agentsample a mini-batch of size m. In which case, σ2 in (21) isreplaced by σ2/m. Thus, one can choose m large enough sothat the mean square error can get arbitrarily close to zero.Finally, when αk ∼ 1/k, we can show that the convergencerate is sublinear O(1/k).

2) The case of T > 1: We next generalize Theorem 3 tothe case when each agent implements more than 1 local SGDsteps, i.e.,T > 1. By (5), for all i ∈ H and t ∈ [0, T ),

xik,t+1 = xk − αkt∑`=0

∇Qi(xik,`;Xik,`). (19)

Theorem 4. Suppose that Assumption 3 and condition (11)hold true. Moreover, let αk be chosen as

αk = α ≤ µ

144T L2· (20)

Then we have

E[‖xk − x?H‖2] ≤(

1− µT α18

)kE[∥∥x0 − x?H∥∥2]

+162σ2T

µα+

54σ2f

µL|H|· (21)

IV. EXPERIMENTS

To evaluate the efficacy of our proposed scheme, we sim-ulate the problem of robust mean estimation in the federatedframework. This problem serves as a test-case to empiricallycompare our scheme with others of similar computationalcosts, namely multi-KRUM [13], CWTM [19], [39], and

6

coordinate-wise median [19], [40]. For our experiments, weconsider N = 50 agents and varying number of Byzantinefaulty agents. Each non-faulty agent i has 100 noisy obser-vations of a 10-dimensional vector x∗ with all elements ofunit value. In particular, the sample set X i comprises 100uniformly distributed samples with each sample Xi = x∗+Zi

where Zi ∼ N (0, Id), and Qi(x;Xi) = (1/2)‖x −Xi‖2. Inthis case, x∗ is the unique solution to problem (3) for any setof honest agents H. In our experimental settings, a Byzantinefaulty agent j behaves just like an honest agent with 100uniformly distributed samples, however each of its sampleXj = 2 × x∗ + Zj where Zj ∼ N (0, Id). That is, honestagents send information corresponding to Gaussian noisyobservations of x∗ and Byzantine agents send informationcorresponding to Gaussian noisy observations (with identicalvariance) of 2× x∗.

We simulate the stochastic setting of local GD (cf. Al-gorithm 1) with different number of faulty agents f ∈{8, 12, 16, 20, 24}, different values of T ∈ {1, 2}, and differ-ent aggregation schemes in Step 2: CE, mutli-KRUM, CWTM,coordinate-wise median and simple averaging. The step-sizeαk = 0.1 for all k. Each setting is run 100 times, and theobserved errors ‖xk − x∗‖2 for k = 1, . . . , 120 are shown inFigures 1 and 2.

Conclusion: As suggested from our theoretical results, thefinal error upon using CE aggregation scheme decreases withthe fraction of Byzantine faulty agents. We observe that CEaggregation scheme performs consistently better than multi-KRUM, CWTM and median. Moreover, we also observe thatincreasing the number of local gradient-descent steps, i.e.,T , improves the fault-tolerance of CE aggregation scheme.However, the same cannot be said for other schemes.

SUMMARY

In this paper, we have considered the problem of Byzan-tine fault-tolerance in the federated local stochastic gradient-descent method. We have proposed a new aggregation scheme,named comparative elimination (CE), and studied its fault-tolerance properties in both deterministic and stochastic set-tings. In the deterministic setting, we have shown the CE filterguarantees exact fault-tolerance against a bounded fractionof Byzantine agents f/N , provided the non-faulty agents’costs satisfy the necessary condition of 2f -redundancy. Inthe stochastic setting, we have shown that CE filter obtainsapproximate fault-tolerance where the approximation error isproportional to the variance of the agents’ stochastic gradientsand the fraction of Byzantine agents.

REFERENCES

[1] P. Kairouz and H. B. McMahan, “Advances and open problems infederated learning,” Foundations and Trends® in Machine Learning,vol. 14, no. 1, 2021.

[2] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning:Challenges, methods, and future directions,” IEEE Signal ProcessingMagazine, vol. 37, no. 3, pp. 50–60, 2020.

[3] L. Lamport, R. Shostak, and M. Pease, “The Byzantine generalsproblem,” ACM Transactions on Programming Languages and Systems(TOPLAS), vol. 4, no. 3, pp. 382–401, 1982.

[4] C. Xie, K. Huang, P.-Y. Chen, and B. Li, “Dba: Distributed backdoorattacks against federated learning,” in International Conference onLearning Representations, 2019.

[5] L. Su and N. H. Vaidya, “Fault-tolerant multi-agent optimization:optimal iterative distributed algorithms,” in Proceedings of the 2016ACM symposium on principles of distributed computing. ACM, 2016,pp. 425–434.

[6] N. Gupta and N. H. Vaidya, “Fault-tolerance in distributed optimization:The case of redundancy,” in The 39th Symposium on Principles ofDistributed Computing, 2020, pp. 365–374.

[7] ——, “Resilience in collaborative optimization: Redundant and inde-pendent cost functions,” arXiv preprint arXiv:2003.09675, 2020.

[8] M. S. Chong, M. Wakaiki, and J. P. Hespanha, “Observability of linearsystems under adversarial attacks,” in American Control Conference.IEEE, 2015, pp. 2439–2444.

[9] N. Gupta and N. H. Vaidya, “Byzantine fault tolerant distributed linearregression,” arXiv preprint arXiv:1903.08752, 2019.

[10] S. Mishra, Y. Shoukry, N. Karamchandani, S. N. Diggavi, andP. Tabuada, “Secure state estimation against sensor attacks in thepresence of noise,” IEEE Transactions on Control of Network Systems,vol. 4, no. 1, pp. 49–59, 2016.

[11] L. Su and S. Shahrampour, “Finite-time guarantees for byzantine-resilient distributed state estimation with noisy measurements,” IEEETransactions on Automatic Control, vol. 65, no. 9, pp. 3758–3771, 2019.

[12] D. Alistarh, Z. Allen-Zhu, and J. Li, “Byzantine stochastic gradientdescent,” in Proceedings of the 32nd International Conference on NeuralInformation Processing Systems, 2018, pp. 4618–4628.

[13] P. Blanchard, R. Guerraoui, et al., “Machine learning with adversaries:Byzantine tolerant gradient descent,” in Advances in Neural InformationProcessing Systems, 2017, pp. 119–129.

[14] M. Charikar, J. Steinhardt, and G. Valiant, “Learning from untrusteddata,” in Proceedings of the 49th Annual ACM SIGACT Symposium onTheory of Computing, 2017, pp. 47–60.

[15] R. Guerraoui, S. Rouault, et al., “The hidden vulnerability of distributedlearning in byzantium,” in International Conference on Machine Learn-ing. PMLR, 2018, pp. 3521–3530.

[16] S. Liu, N. Gupta, and N. H. Vaidya, “Approximate byzantine fault-tolerance in distributed optimization,” arXiv preprint arXiv:2101.09337,2021.

[17] K. Kuwaranancharoen, L. Xin, and S. Sundaram, “Byzantine-resilientdistributed optimization of multi-dimensional functions,” in 2020 Amer-ican Control Conference (ACC). IEEE, 2020, pp. 4399–4404.

[18] C. Xie, O. Koyejo, and I. Gupta, “Fall of empires: Breaking byzantine-tolerant sgd by inner product manipulation,” in Proceedings of The35th Uncertainty in Artificial Intelligence Conference, ser. Proceedingsof Machine Learning Research, R. P. Adams and V. Gogate, Eds.,vol. 115. PMLR, 22–25 Jul 2020, pp. 261–270. [Online]. Available:https://proceedings.mlr.press/v115/xie20a.html

[19] D. Yin, Y. Chen, K. Ramchandran, and P. Bartlett, “Byzantine-robustdistributed learning: Towards optimal statistical rates,” in InternationalConference on Machine Learning, 2018, pp. 5636–5645.

[20] Z. Yang and W. U. Bajwa, “Byrdie: Byzantine-resilient distributedcoordinate descent for decentralized learning,” IEEE Transactions onSignal and Information Processing over Networks, vol. 5, no. 4, pp.611–627, 2019.

[21] Y. Chen, L. Su, and J. Xu, “Distributed statistical machine learningin adversarial settings: Byzantine gradient descent,” Proceedings of theACM on Measurement and Analysis of Computing Systems, vol. 1, no. 2,pp. 1–25, 2017.

[22] S. Sundaram and B. Gharesifard, “Distributed optimization under adver-sarial nodes,” IEEE Transactions on Automatic Control, 2018.

[23] C. Xie, O. Koyejo, and I. Gupta, “Phocas: dimensional byzantine-resilient stochastic gradient descent,” CoRR, vol. abs/1805.09682, 2018.[Online]. Available: http://arxiv.org/abs/1805.09682

[24] L. Li, W. Xu, T. Chen, G. B. Giannakis, and Q. Ling, “Rsa: Byzantine-robust stochastic aggregation methods for distributed learning fromheterogeneous datasets,” in Proceedings of the AAAI Conference onArtificial Intelligence, vol. 33, 2019, pp. 1544–1551.

[25] J.-y. Sohn, D.-J. Han, B. Choi, and J. Moon, “Election coding fordistributed learning: Protecting signsgd against byzantine attacks,” Ad-vances in Neural Information Processing Systems, vol. 33, 2020.

[26] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt,and A. Stewart, “Sever: A robust meta-algorithm for stochasticoptimization,” in Proceedings of the 36th International Conferenceon Machine Learning, ser. Proceedings of Machine LearningResearch, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.

https://proceedings.mlr.press/v115/xie20a.html

http://arxiv.org/abs/1805.09682

7

PMLR, 09–15 Jun 2019, pp. 1596–1606. [Online]. Available:http://proceedings.mlr.press/v97/diakonikolas19a.html

[27] A. Prasad, A. S. Suggala, S. Balakrishnan, and P. Ravikumar, “Robustestimation via robust gradient estimation,” Journal of the Royal Sta-tistical Society: Series B (Statistical Methodology), vol. 82, no. 3, pp.601–627, 2020.

[28] L. Su and N. H. Vaidya, “Byzantine-resilient multi-agent optimization,”IEEE Transactions on Automatic Control, 2020.

[29] Z. Yang and W. U. Bajwa, “Byrdie: Byzantine-resilient distributedcoordinate descent for decentralized learning,” 2017.

[30] M. Fang, X. Cao, J. Jia, and N. Gong, “Local model poisoning attacksto byzantine-robust federated learning,” in 29th {USENIX} SecuritySymposium ({USENIX} Security 20), 2020, pp. 1605–1622.

[31] L. Munoz-Gonzalez, K. T. Co, and E. C. Lupu, “Byzantine-robustfederated machine learning through adaptive model averaging,” arXivpreprint arXiv:1909.05125, 2019.

[32] Z. Wu, Q. Ling, T. Chen, and G. B. Giannakis, “Federated variance-reduced stochastic gradient descent with robustness to byzantine at-tacks,” IEEE Transactions on Signal Processing, vol. 68, pp. 4583–4596,2020.

[33] C. Bajaj, “The algebraic degree of geometric optimization problems,”Discrete & Computational Geometry, vol. 3, no. 2, pp. 177–191, 1988.

[34] M. B. Cohen, Y. T. Lee, G. Miller, J. Pachocki, and A. Sidford,“Geometric median in nearly linear time,” in Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, 2016, pp. 9–21.

[35] J. So, B. Guler, and A. S. Avestimehr, “Byzantine-resilient securefederated learning,” IEEE Journal on Selected Areas in Communications,2020.

[36] X. Cao, M. Fang, J. Liu, and N. Z. Gong, “Fltrust: Byzantine-robust federated learning via trust bootstrapping,” arXiv preprintarXiv:2012.13995, 2020.

[37] E. M. E. Mhamdi, R. Guerraoui, and S. Rouault, “Distributedmomentum for byzantine-resilient stochastic gradient descent,” inInternational Conference on Learning Representations, 2021. [Online].Available: https://openreview.net/forum?id=H8UHdhWG6A3

[38] S. P. Karimireddy, L. He, and M. Jaggi, “Learning from historyfor byzantine robust optimization,” CoRR, vol. abs/2012.10333, 2020.[Online]. Available: https://arxiv.org/abs/2012.10333

[39] L. Su and S. Shahrampour, “Finite-time guarantees for Byzantine-resilient distributed state estimation with noisy measurements,” arXivpreprint arXiv:1810.10086, 2018.

[40] C. Xie, O. Koyejo, and I. Gupta, “Generalized Byzantine-tolerant sgd,”arXiv preprint arXiv:1802.10116, 2018.

APPENDIX

A. Proof of Theorem 1

Proof. Using (7) and (10) we have

xk+1 =1

|Fk|∑i∈Fk

xik,1

=1

|H|

[∑i∈H

xik,1 +∑i∈Bk

xik,1 −∑

i∈H\Hk

xik,1

](10)= xk −

αk|H|

∑i∈H∇qi(xk)

+1

|H|

[ ∑i∈Bk

xik,1 −∑

i∈H\Hk

xik,1

]= xk − αk∇qH(xk)

+1

|H|

[ ∑i∈Bk

(xik,1 − xk)−∑

i∈H\Hk

(xik,1 − xk)], (22)

where the last equality is due to |Bk| = |H \ Hk|. Using thepreceding relation we consider

‖xk+1 − x?H‖2

= ‖xk − x?H − αk∇qH(xk)‖2

+∥∥∥ 1

|H|∑i∈Bk

(xik,1 − xk)− 1

|H|∑

i∈H\Hk

(xik,1 − xk)∥∥∥2

+2

|H|∑i∈Bk

(xk − x?H − αk∇qH(xk)

)T(xik,1 − xk)

− 2

|H|∑

i∈H\Hk


)T(xik,1 − xk).

(23)

We next analyze each term on the right-hand side of (23).First, using (10) we have for all i ∈ H

xik,1 − xk = −αk∇qi(xk).

In addition, under 2f-redundancy we have x?H ∈ X ?i , whereX ?i is the set of minimizers of qi(x?H). This implies that∇qi(x?H) = 0. In addition, Assumption 1 implies that ∇qHis also L-Lipschitz continuous. Recall that |Bk| = |H \ Hk|.Then, by Assumption 1 we consider the last term on the right-hand side of (23)

− 2

|H|∑

i∈H\Hk

(xk − x?H − αk∇qH(xk))T (xik,1 − xk)

=2αk|H|

∑i∈H\Hk

(xk − x?H − αk∇qH(xk))T∇qi(xk)

=2αk|H|

∑i∈H\Hk

(xk − x?H − αk(∇qH(xk)−∇qH(x?H))

)T× (∇qi(xk)−∇qi(x?H))

≤ 2αk|H|

∑i∈H\Hk

(‖xk − x?H]‖+ αkL‖xk − x?H‖)L‖xk − x?H‖

=(2L|Bk|αk

|H|+

2L2|Bk|α2k

|H|

)‖xk − x?H‖2. (24)

Second, using our CE filter (a.k.a (6)) and (10) there existsj ∈ H such that for all i ∈ Bk we have

‖xik,1 − xk‖ ≤ ‖xjk,1 − xk‖ = αk‖∇qj(xk)‖

= αk‖∇qj(xk)−∇qj(x?H)‖ ≤ Lαk‖xk − x?H‖, (25)

which implies that

2

|H|∑i∈Bk


)T(xik,1 − xk)

≤ 2

|H|∑i∈Bk

(‖xk − x?H‖+ αk‖∇qH(xk)‖

)‖xik,1 − xk‖

=2

|H|∑i∈Bk

(‖xk − x?H‖+ αk‖∇qH(xk)−∇qH(x?H)‖

)× ‖xik,1 − xk‖

≤(2L|Bk|αk

|H|+

2L2|Bk|α2k

|H|

)‖xk − x?H‖2. (26)

http://proceedings.mlr.press/v97/diakonikolas19a.html

https://openreview.net/forum?id=H8UHdhWG6A3

https://arxiv.org/abs/2012.10333

8

Next, using (10) and (25) we obtain∥∥∥ 1

|H|∑i∈Bk

(xik,1 − xk)− 1

|H|∑

i∈H\Hk

(xik,1 − xk)∥∥∥2

≤ 2|Bk||H|2

( ∑i∈Bk

‖xik,1 − xk‖2 +∑

i∈H\Hk

‖xik,1 − xk‖2)

≤ 4L2|Bk|2

|H|2α2k‖xk − x?H‖2. (27)

Finally, using Assumption 2 we obtain∥∥xk − x?H − αk∇qH(xk)∥∥2

= ‖xk − x?H‖2 − 2αk∇qH(xk)T (xk − x?H)

+ α2k‖∇qH(xk)−∇qH(x?H)‖2

≤ (1− 2µαk + L2α2k) ‖xk − x?H‖

2. (28)

Substituting (24)–(28) into (23) we obtain

‖xk+1 − x?H‖2

≤ (1− 2µαk + L2α2k) ‖xk − x?H‖

2

+4L2|Bk|2

|H|2α2k‖xk − x?H‖2

+(2L|Bk|αk

|H|+

2L2|Bk|α2k

|H|

)‖xk − x?H‖2

+(2L|Bk|αk

|H|+

2L2|Bk|α2k

|H|

)‖xk − x?H‖2

≤ (1− 2µαk + L2α2k) ‖xk − x?H‖

2

+4L2|Bk||H|

(1 +|Bk||Hk|

)α2k‖xk − x?H‖2

+4L|Bk|αk|H|

‖xk − x?H‖2

≤(

1− 2(µ− 2Lf

|H|

)αk

)‖xk − x?H‖

2

+(

1 +4f

|H|+

4f2

|H|2)L2α2

k ‖xk − x?H‖2, (29)

where the last inequality we use |Bk| ≤ f . Using (11) gives

µ− 2Lf

|H|≥ µ− 2µ

3=µ

3,

which when substituting into (29) and using (12) gives (13)

‖xk+1 − x?H‖2

≤(

1− 2µ

3αk

)‖xk − x?H‖

2

+(

1 +4f

|H|+

4f2

|H|2)L2α2

k ‖xk − x?H‖2

≤(

1− 2µ

3αk

)‖xk − x?H‖

2+ 3L2α2

k ‖xk − x?H‖2

≤(

1− µα

6

)‖xk − x?H‖

2

≤(

1− µα

6

)k+1

‖x0 − x?H‖2,

where the second inequality is due to f/|H| ≤ 1/3 and thethird inequality is due to αk = α ≤ µ/(4L2)

B. Proof of Theorem 2

Proof. Due to T > 1, the proof of Theorem 2 is differentto the one in Theorem 1 at the way we quantify the size of‖xik,t − xk‖ for any t ∈ [0, T ]. Thus, to show (16) we firstprovide an upper bound of this quantity. Note that we do notassume the gradient being bounded. By (15) we have

LT αk = LT α ≤ µ

16L≤ 1

16≤ ln(2).

By (5), Assumption 1, and ∇qi(x?H) = 0 we have for allt ∈ [0, T − 1] and i ∈ H

‖xik,t+1 − x?H‖ − ‖x?H − xik,t‖ ≤ ‖xik,t+1 − xik,t‖= αk‖∇qi(xik,t)‖ = αk‖∇qi(xik,t)−∇qi(x?H)‖≤ Lαk‖xik,t − x?H‖,

which using 1 + x ≤ exp(x) for x > 0 implies

‖xik,t+1 − x?H‖ ≤ (1 + Lαk)‖xik,t − x?H‖≤ exp(Lαk)‖xik,t − x?H‖ ≤ exp(Ltαk)‖xik,0 − x?H‖≤ 2‖xk − x?H‖,

where the last inequality we use LT αk ≤ ln(2) and xik,0 =xk. Using the preceding two relations gives for all t ∈ [0, T ]

‖xik,t+1 − xik,t‖ ≤ Lαk‖xik,t − x?H‖ ≤ 2Lαk‖xk − x?H‖.

Using this relation and xik,0 = xk gives ∀t ∈ [0, T − 1]

‖xik,t+1 − xk‖ =∥∥∥ t∑`=0

xik,`+1 − xik,`∥∥∥

≤t∑`=0

∥∥∥xik,`+1 − xik,`∥∥∥ ≤ 2LT αk‖xk − x?H‖. (30)

Next, we consider

xk+1 =1

|Fk|∑i∈Fk

xik,T

=1

|H|

[∑i∈H

xik,T +∑i∈Bk

xik,T −∑

i∈H\Hk

xik,T

](14)= xk −

αk|H|

∑i∈H

T −1∑t=0

∇qi(xik,t)

+1

|H|

[ ∑i∈Bk

xik,T −∑

i∈H\Hk

xik,T

]= xk − T αk∇qH(xk)

− αk|H|

∑i∈H

T −1∑t=0

(∇qi(xik,t)−∇qi(xk)

)+

1

|H|

[ ∑i∈Bk

xik,T −∑

i∈H\Hk

xik,T


− αk|H|

∑i∈H

T −1∑t=0


)+

1

|H|

[ ∑i∈Bk

(xik,T − xk)−∑

i∈H\Hk

(xik,T − xk)], (31)

9

where the last equality is due to |Bk| = |H \ Hk|. Forconvenience, we denote by

V xk =1

|H|∑i∈Bk


i∈H\Hk

(xik,T − xk).

Using (31) gives

‖xk+1 − x?H‖2

=∥∥xk − x?H − T αk∇qH(xk)

∥∥2 + ‖V xk ‖2

+∥∥∥ αk|H|∑

i∈H

T −1∑t=0


) ∥∥∥2− 2αk|H|

∑i∈H

T −1∑t=0

(xk − x?H)T (∇qi(xik,t)−∇qi(xk))

+2T α2

k

|H|∑i∈H

T −1∑t=0

∇qH(xk)T(∇qi(xik,t)−∇qi(xk)

)+ 2(xk − x?H − T αk∇qH(xk)

)TV xk . (32)

We next consider each term on the right-hand sides of (32).First, using Assumptions 1 and 2, and ∇qH(x?H) = 0 we have∥∥∥xk − x?H − T αk∇qH(xk)

∥∥∥2=∥∥∥xk − x?H∥∥∥2 − 2T αk (xk − x?H)

T ∇qH(xk)

+ T 2α2k

∥∥∇qH(xk)∥∥2

=∥∥∥xk − x?H∥∥∥2 − 2T αk (xk − x?H)

T ∇qH(xk)

+ T 2α2k

∥∥∇qH(xk)−∇qH(x?H)∥∥2

≤(1− 2µT αk + T 2L2α2

k

)‖xk − x?H‖

2. (33)

Second, by (6) there exists j ∈ H\Hk such that ‖xik,T −xk‖ ≤‖xjk,T − xk‖ for all i ∈ Bk. Then we have

‖V xk ‖2 =∥∥∥ 1

|H|∑i∈Bk

xik,T − xk −1

|H|∑

i∈H\Hk

xik,T − xk∥∥∥2

≤ 2|Bk||H|2

∑i∈Bk

∥∥xik,T − xk∥∥2 +2|Bk||H|2

∑i∈H\Hk

∥∥xik,T − xk∥∥2≤ 2|Bk|2

|H|2∥∥∥xjk,T − xk∥∥∥2 +

2|Bk||H|2

∑i∈H\Hk

∥∥xik,T − xk∥∥2(30)≤ 16T 2L2|Bk|2

|H|2α2k ‖xk − x?H‖

2. (34)

Third, using (30) yields

∥∥∥ αk|H|∑i∈H

T −1∑t=0


) ∥∥∥2≤ L2T α2

k

|H|∑i∈H

T −1∑t=0

∥∥∥xik,t − xk∥∥∥2≤ 4L4T 4α4

k

∥∥∥xk − x?H∥∥∥2. (35)

Fourth, using Assumption 1 we consider

− 2αk|H|

(xk − x?H

)T ∑i∈H

T −1∑t=0


)≤ 2Lαk|H|

∑i∈H

T −1∑t=0

‖xk − x?H‖‖xik,t − xk‖

(30)≤ 4T 2L2α2

k‖xk − x?H‖2. (36)

Fifth, using Assumption 1 and ∇qH(x?H) = 0 we have

2T α2k

|H|

|H|∑i=1

T −1∑t=0

∇qH(xk)T(∇qi(xik,t)−∇qi(xk)

)≤ 2LT α2

k

|H|

|H|∑i=1

T −1∑t=0

‖∇qH(xk)‖‖xik,t − xk‖

≤ 4L2T 3α3k‖∇qH(xk)‖‖xk − x?H‖

= 4L2T 3α3k‖∇qH(xk)−∇qH(x?H)‖‖xk − x?H‖

≤ 4L3T 3α3k‖xk − x?H‖2. (37)

Next, by using (6) and similar to (34) we have

‖V xk ‖ =∥∥∥ 1

|H|∑i∈Bk

xik,T − xk −1

|H|∑

i∈H\Hk

xik,T − xk∥∥∥

≤ 1

|H|∑i∈Bk

∥∥xik,T − xk∥∥+1

|H|∑

i∈H\Hk

∥∥xik,T − xk∥∥(30)≤ 2T L|Bk|

|H|αk ‖xk − x?H‖ ,

which by using ∇qH(x?H) = 0, (6), and (35) we have

2(xk − x?H − T αk∇qH(xk)

)TV xk

≤ 2(‖xk − x?H‖+ T αk

∥∥∇qH(xk)−∇g(x?)∥∥) ‖V xk ‖

(34)≤ 4T L|Bk|

|H|(1 + LT αk)αk ‖xk − x?H‖

2. (38)

Substituting (33)–(38) into (32) and using |Bk| ≤ f yields

‖xk+1 − x?H‖2

≤(1− 2µT αk + T 2L2α2

k

)‖xk − x?H‖

2

+16T 2L2f2

|H|2α2k ‖xk − x?H‖

2+ 4L4T 4α4

k

∥∥∥xk − x?H∥∥∥2+ 4L2T 2α2

k

∥∥∥xk − x?H∥∥∥2 + 4L3T 3α3k‖xk − x?H‖2

+4T Lf|H|

(1 + LT αk)αk ‖xk − x?H‖2

=(

1− 2T(µ− 2Lf

|H|)αk

)‖xk − x?H‖

2

+ 4L4T 4α4k

∥∥∥xk − x?H∥∥∥2 + 4L3T 3α3k‖xk − x?H‖2

+(

5 +4f

|H|+

16f2

|H|2)T 2L2α2

k ‖xk − x?H‖2. (39)

By (11) we have

µ− 2Lf

|H|≥ µ

3.

10

Moreover, using f/|H| ≤ 13 and αk = α ≤ 1/(16T L) (since

µ/L ≤ 1) yields

16f2

|H|2+ 4L2T 2α2

k + 5 + 4LT αk +4f

|H|≤ 8.

Using the preceding two relations into (16) gives

‖xk+1 − x?H‖2

≤(

1− 2µT3

α)‖xk − x?H‖

2+ 8L2T 2α2

∥∥∥xk − x?H∥∥∥2≤(

1− µT α6

)∥∥∥xk − x?H∥∥∥2≤(

1− µT α6

)k+1∥∥∥x0 − x?H∥∥∥2,where in the second inequality we use αk = α ≤ µ/(16T L2).

C. Proof of Theorem 3

Proof. When T = 1 and by (5) with Gi(·) = ∇Qi(·, Xi) wehave for all i ∈ H

xik,1 = xik,0 − αk∇Qi(xik,0;Xik,0)

= xk − αk∇Qi(xk;Xik,0), (40)

which by (7) gives

xk+1 =1

|Fk|∑i∈Fk

xik,1

=1

|H|

[∑i∈H

xik,1 +∑i∈Bk

xik,1 −∑

i∈H\Hk

xik,1

](40)= xk − αk∇Q(xk;Xk,0)

+1

|H|

[ ∑i∈Bk

xik,1 −∑

i∈H\Hk

xik,1

]= xk − αk∇Q(xk;Xk,0)

+1

|H|

[ ∑i∈Bk

xik,1 − xk −∑

i∈H\Hk

xik,1 − xk],

where the last equality is due to |Bk| = |H \ Hk|. Using thepreceding relation, we consider

‖xk+1 − x?H‖2

= ‖xk − x?H − αk∇Q(xk;Xk,0)‖2

+1

|H|2∥∥∥ ∑i∈Bk

(xik,1 − xk)−∑

i∈H\Hk

(xik,1 − xk)∥∥∥2

+2

|H|∑i∈Bk

(xk − x?H)T

(xik,1 − xk)

− 2αk|H|

∑i∈Bk

∇Q(xk;Xk,0)T (xik,1 − xk)

− 2

|H|∑

i∈H\Hk

(xk − x?H)T

(xik,1 − xk)

+2αk|H|

∑i∈H\Hk

∇Q(xk;Xk,0)T (xik,1 − xk). (41)

We next analyze each term on the right-hand side of (41).First, using Assumption 3 we have

E[‖xk − x?H − αk∇Q(xk;Xk,0)‖2 | Pk,0]

= ‖xk − x?H − αk∇q(xk)‖2

+ E[‖αk(∇Q(xk;Xk,0)−∇q(xk))‖2 | Pk,0]

≤ ‖xk − x?H − αk∇q(xk)‖+ σ2α2k,

which by using Assumption 2 and ∇q(x?H) = 0 gives

E[‖xk − x?H − αk∇Q(xk;Xk,0)‖2 | Pk,0]

= ‖xk − x?H‖2 − 2(∇q(xk)−∇q(x?H))T (xk − x?H)

+ α2k‖∇q(xk)−∇q(x?H)‖2 + σ2αk

≤ (1− 2µαk + L2α2k) ‖xk − x?H‖

2+ σ2α2

k. (42)

Second, by (9) we have ∇qi(x?H) = 0. Then, by Assumption1 and (40) we have

− 2

|H|E[ ∑i∈H\Hk

(xk − x?H)T (xik,1 − xk) | Pk,0]

=2αk|H|

∑i∈H\Hk

(xk − x?H)TE[∇Qi(xk;Xi

k,0) | Pk,0]

=2αk|H|

∑i∈H\Hk

(xk − x?H)T∇qi(xk)

=2αk|H|

∑i∈H\Hk

(xk − x?H)T (∇qi(xk)−∇qi(x?H))

≤ 2L|Bk|αk|H|

‖xk − x?H‖2, (43)

where the last inequality we also use |H \ Hk| = |Bk|. Next,using Assumption 3 we consider for any i ∈ H

E[∇Q(xk;Xk,0)T∇Qi(xk;Xi

k,0) | Pk,0]

=1

|H|E[∑j∈H∇Qj(xk;Xj

k,0)T∇Qi(xk;Xik,0) | Pk,0

]=

1

|H|E[‖∇Qi(xk;Xi

k,0)‖2 | Pk,0]

+1

|H|∑

j∈H,i6=j

∇qj(xk)T∇qi(xk)

=1

|H|‖∇qi(xk)‖2 +

1

|H|∑

j∈H,i6=j

∇qj(xk)T∇qi(xk)

+1

|H|E[‖∇Qi(xk;Xi

k,0)−∇qi(xk)‖2 | Pk,0]

≤ σ2

|H|+L2

|H|∑j∈H‖xk − x?H‖2

≤ σ2

|H|+ L2‖xk − x?H‖2,

11

where the second inequality we use ∇qi(x?H) = 0 for all iand Assumption 1. Using the preceding relation we have

− 2αk|H|

E[ ∑i∈Bk

∇Q(xk;Xk,0)T (xik,1 − xk) | Pk,0]

=2α2

k

|H|E[ ∑i∈Bk

∇Q(xk;Xk,0)T∇Qi(xk;Xik,0) | Pk,0

]≤ 2σ2|Bk|α2

k

|H|2+

2L2|Bk|α2k

|H|‖xk − x?H‖2. (44)

Using the same argument as above, we have

2αk|H|

E[ ∑i∈H\Hk

∇Q(xk;Xk,0)T (xik,1 − xk) | Pk,0]

≤ 2σ2|Bk|α2k

|H|2+

2L2|Bk|α2k

|H|‖xk − x?H‖2, (45)

where we use the fact that |Bk| = |H \Hk|. Fifth, using (40)and our CE filter (6) there exists j ∈ H such that for all i ∈ Bkwe have

‖xik,1 − xk‖ ≤ ‖xjk,1 − xk‖ = αk‖∇Qj(xk;Xj

k,0)‖≤ αk‖∇qj(xk)‖+ αk‖∇Qj(xk;Xj

k,0)−∇qj(xk)‖,

which by Assumption 3 and the Jensen inequality gives

E[‖xik,1 − xk‖ | Pk,0

]≤ Lαk‖xk − x?H‖+ σαk.

Similarly, for all i ∈ Bk there exists j ∈ H such that

E[‖xik,1 − xk‖2 | Pk,0]

≤ α2k‖∇qj(xk)‖2

+ α2kE[‖∇Qj(xk;Xj

k,0)−∇qj(xk)‖2 | Pk,0]

≤ L2α2k‖xk − x?H‖2 + σ2α2

k.

Using the preceding relation we consider

2

|H|E[ ∑i∈Bk

(xk − x?H)T

(xik,1 − xk) | Pk,0]

≤ 2Lαk|H|

∑i∈Bk

‖xk − x?H‖2 +2σαk|H|

∑i∈Bk

‖xk − x?H‖

≤ 3L|Bk|αk|H|

‖xk − x?H‖2 +σ2|Bk|αkL|H|

, (46)

where the last inequality we use the relation 2xy ≤ ηx2 +1/ηy2 for any x, y ∈ R. Finally, we have

1

|H|2E[∥∥ ∑

i∈Bk

xik,1 − xk −∑

i∈H\Hk

xik,1 − xk∥∥2 | Pk,0]

≤ 2|Bk||H|2

∑i∈Bk

E[‖xik,1 − xk‖2 | Pk,0

]+

2|Bk||H|2

∑i∈H\Hk

E[‖xik,1 − xk‖2 | Pk,0

]≤ 2|Bk|2

|H|2(σ2α2

k + 2L2α2k‖xk − x?H‖2

). (47)

Taking the expectation on both sides of (41) and using (42)–(47) we obtain

E[‖xk+1 − x?H‖2]

≤ (1− 2µαk + L2α2k)E[‖xk − x?H‖2] + σ2α2

k

+2L|Bk|αk|H|

E[‖xk − x?H‖2]

+2σ2|Bk|α2

k

|H|2+

2L2|Bk|α2k

|H|E[‖xk − x?H‖2]

+2σ2|Bk|α2

k

|H|2+

2L2|Bk|α2k

|H|E[‖xk − x?H‖2]

+3L|Bk|αk|H|

E[‖xk − x?H‖2] +σ2|Bk|αkL|H|

+2σ2|Bk|2α2

k

|H|2+

4L2|Bk|2α2k

|H|2E[‖xk − x?H‖2]

≤(

1− 2(µ− 5Lf

2|H|)αk

)E[‖xk − x?H‖2] + 7σ2α2

k

+ L2α2kE[‖xk − x?H‖2] +

4L2fα2k

|H|E[‖xk − x?H‖2]

+4L2f2α2

k

|H|2E[‖xk − x?H‖2] +

σ2f

L|H|αk,

where we use |Bk| = f ≤ |H| in the last inequality. Note thatby (11) we have 3Lf/|H| ≤ µ and f/|H| ≤ 1/3, which whensubstituting to the preceding equation gives

E[‖xk+1 − x?H‖2]

≤(

1− µ

3αk

)E[‖xk − x?H‖2] + 7σ2α2

k +σ2f

L|H|αk

+ 4L2α2kE[‖xk − x?H‖2]

≤(

1− µ

6α)E[‖xk − x?H‖2] + 7σ2α2 +

σ2f

L|H|α,

where the last inequality we use αk = α ≤ µ/(12L2). Takingrecursively the previous inequality immediately gives (21).

D. Proof of Theorem 4

Proof. Using (5) we have for all t ∈ [0, T − 1] and i ∈ H

‖xik,t+1 − x?H‖ − ‖x?H − xik,t‖≤ ‖xik,t+1 − xik,t‖ = αk‖∇Qi(xik,t;Xi

k,t)‖≤ αk‖∇qi(xik,t)‖+ αk‖∇Qi(xik,t;Xi

k,t)−∇qi(xik,t)‖,

which by using Assumptions 1 and 3, and ∇qi(x?H) = 0 yields

E[‖xik,t+1 − x?H‖ − ‖x?H − xik,t‖ | Pk,t]≤ E[‖xik,t+1 − xik,t‖ | Pk,t]≤ αk‖∇qi(xik,t)−∇qi(x?H)‖

+ αkE[‖∇Qi(xik,t;Xik,t)−∇qi(xik,t)‖ | Pk,t]

≤ Lαk‖xik,t − x?H‖+ σαk.

12

Using 1 + x ≤ exp(x) for x > 0, the preceding relation gives

E[‖xik,t+1 − x?H‖] ≤ (1 + Lαk)E[‖xik,t − x?H‖] + σαk

≤ (1 + Lαk)tE[‖xik,0 − x?H‖] + σαk

t−1∑`=0

(1 + Lαk)t−1−`

= exp(Ltαk)E[‖xik,0 − x?H‖]

+ σαk(1 + Lαk)tt−1∑`=0

(1 + Lαk)−1−`

≤ exp(Ltαk)E[‖xik,0 − x?H‖] +σαk exp(Ltαk)

Lαk

≤ 2E[‖xk − x?H‖] +2σ

L,

where the last inequality we use LT αk ≤ ln(2) and xik,0 =xk. Thus, we obtain for all t ∈ [0, T − 1]

E[‖xik,t+1 − xik,t‖] ≤ LαkE[‖xik,t − x?H‖] + σαk

≤ 2LαkE[‖xk − x?H‖] + 3σαk,

which by using xik,0 = xk gives ∀t ∈ [0, T − 1] and i ∈ H

E[‖xik,t+1 − xk‖] = E[∥∥ t∑

`=0

xik,`+1 − xik,`∥∥]

≤t∑`=0

E[∥∥xik,`+1 − xik,`

∥∥]≤ 2LT αkE[‖xk − x?H‖] + 3σT αk. (48)

Similarly, we consider i ∈ H

E[‖xik,t+1 − x?H‖2] = E[‖xik,t − x?H − αk∇Qi(xik,t;Xik,t)‖2]

= E[‖xik,t − x?H‖2]− 2E[(xik,t − x?H)T∇Qi(xik,t;Xik,t)]

+ E[‖αk∇Qi(xik,t;Xik,t)‖2]

= E[‖xik,t − x?H‖2]− 2E[(xk,t − x?H)T∇qi(xik,t)]+ α2

kE[‖∇Qi(xik,t;Xik,t)−∇qi(xik,t)‖2]

+ α2kE[‖∇qi(xik,t)−∇qi(x?H)‖2]

≤ (1 + L2α2k)E[‖xik,t − x?H‖2] + σ2α2

k,

where the last inequality we use the convexity of qi, Assump-tions 1 and 3. Using the relation 1 + x ≤ exp(x) for allx > 0 and xik,0 = xk, the preceding relation gives for allt ∈ [0, T − 1]

E[‖xik,t+1 − x?H‖2]

≤ (1 + L2α2k)tE[‖xk − x?H‖2] + σ2α2

k

t−1∑`=0

(1 + L2α2k)t−`−1

≤ exp(L2T α2k)E[‖xk − x?H‖2] +

σ2(1 + L2α2k)

L2

≤ 2E[‖xk − x?H‖2] +2σ2

L2,

where the last inequality is due to exp(L2T 2α2k) ≤ 2. Using

the relation above we now have for all t ∈ [0, T − 1] andi ∈ H

E[‖xik,t+1 − xk‖2] = E[‖t∑`=0

(xik,`+1 − xik,`

)‖2]

≤ tt∑`=0

E[‖xik,`+1 − xik,`‖2] = tα2k

t∑`=0

E[‖∇Qi(xik,`;Xik,`)‖2]

= tα2k

t∑`=0

E[‖∇Qi(xik,`;Xik,`)− qi(xik,`)‖2]

+ tα2k

t∑`=0

E[‖∇qi(xik,`)−∇qi(x?H)‖2]

≤ T 2σ2α2k + L2tα2

k

t∑`=0

E[‖xik,` − x?H‖2]

≤ 2L2T 2α2kE[‖xk − x?H‖2] + 3T 2σ2α2

k. (49)

Note that we have |Fk| = |H| = N − f . We then consider

xk+1 =1

|Fk|∑i∈Fk

xik,T

=1

|H|

∑i∈H

xik,T +∑i∈Bk

xik,T −∑

i∈H\Hk

xik,T

(19)= xk −

αk|H|

∑i∈H

T −1∑t=0

∇Qi(xik,t;Xik,t)

+1

|H|

∑i∈Bk

xik,T −∑

i∈H\Hk

xik,T

= xk − T αk∇qH(xk)

− αk|H|

∑i∈H

T −1∑t=0

(Qi(xik,t;X

ik,t)−∇qi(xk)

)+

1

|H|

[ ∑i∈Bk

xik,T −∑

i∈H\Hk

xik,T


− αk|H|

∑i∈H

T −1∑t=0

(∇Qi(xik,t;Xi

k,t)−∇qi(xk))

+1

|H|

[ ∑i∈Bk


i∈H\Hk

(xik,T − xk)], (50)

where the last equality is due to |Bk| = |H \ Hk|. Forconvenience, we denote by

V xk =1

|H|∑i∈Bk


i∈H\Hk

(xik,T − xk).

13

Using (50), we consider

‖xk+1 − x?H‖2

=∥∥∥xk − x?H − T αk∇qH(xk)

∥∥∥2 + ‖V xk ‖2

+∥∥∥ αk|H|∑

i∈H

T −1∑t=0

∇Qi(xik,t;Xik,t)−∇qi(xk)

∥∥∥2− 2αk|H|

∑i∈H

T −1∑t=0

(xk − x?H

)T (∇Qi(xik,t;Xik,t)−∇qi(xk)

)+

2T α2k

|H|∑i∈H

T −1∑t=0

∇qH(xk)T(∇Qi(xik,t;Xi

k,t)−∇qi(xk))

+ 2(xk − x?H − T αk∇qH(xk)

)TV xk . (51)

We next analyze each term on the right-hand sides of (51).First, using Assumptions 1 and 2 we have

∥∥∥xk − x?H − T αk∇qH(xk)∥∥∥2

=∥∥∥xk − x?H∥∥∥2 − 2T αk

(xk − x?H

)T∇qH(xk)

+ T 2α2k

∥∥∇qH(xk)∥∥2

=∥∥∥xk − x?H∥∥∥2 − 2T αk

(xk − x?H

)T∇qH(xk)

+ T 2α2k

∥∥∇qH(xk)−∇qH(x?H)∥∥2

≤(1− 2µT αk + T 2L2α2

k

)‖xk − x?H‖

2. (52)

Second, by (6) there exists j ∈ H\Hk such that ‖xik,T −xk‖ ≤‖xjk,T − xk‖ for all i ∈ Bk. Then we have

E[‖V xk ‖2]

= E[∥∥∥ 1

|H|∑i∈Bk

(xik,T − xk

)− 1

|H|∑

i∈H\Hk

(xik,T − xk

) ∥∥∥2]≤ 2|Bk||H|2

∑i∈Bk

E[∥∥xik,T − xk∥∥2]

+2|Bk||H|2

∑i∈H\Hk

E[∥∥xik,T − xk∥∥2]

≤ 2|Bk|2

|H|2E[‖xjk,T − xk‖

2]

+2|Bk||H|2

∑i∈H\Hk

E[‖xik,T − xk‖2]

(49)≤ 8T 2L2|Bk|2

|H|2α2kE[‖xk − x?H‖2] +

12T 2σ2|Bk|2

|H|2α2k.

(53)

Third, using Assumptions 1 and 3, and (49) yields

E[‖ αk|H|

∑i∈H

T −1∑t=0

(∇Qi(xik,t;Xi

k,t)−∇qi(xk))‖2]

= E[∥∥ αk|H|

∑i∈H

T −1∑t=0

(∇Qi(xik,t;Xi

k,t)−∇qi(xik,t)) ∥∥2]

+ E[∥∥ αk|H|

∑i∈H

T −1∑t=0


) ∥∥2]≤ T α

2k

H

∑i∈H

T −1∑t=0

E[‖(∇Qi(xik,t;Xi

k,t)−∇qi(xik,t))‖2]

+T α2

k

|H|∑i∈H

T −1∑t=0

E[‖∇qi(xik,t)−∇qi(xk)‖2]

≤ T 2σ2α2k +T L2α2

k

|H|∑i∈H

T −1∑t=0

E[‖xik,t − xk‖2]

(49)≤ 2L4T 4α4

kE[‖xk − x?H‖2] + 3L2σ2T 4α4k + σ2T 2α2

k.(54)

Similarly, we consider

− 2E[(xk − x?H


)| Pk,t

]= −2

(xk − x?H

)T (∇qi(xik,t)−∇qi(xk))

≤ 2L‖xk − x?H‖‖xik,t − xk‖

≤ L2T αk‖xk − x?H‖2 +1

T αk‖xik,t − xk‖2,

where the last inequality is due to the Cauchy-Schwarz in-equality 2xy ≤ ηx2 + y2/η for any η > 0 and x, y ∈ R.Taking the expectation on both sides of the preceding relationand using (49) we obtain

− 2E[(xk − x?H


)]≤ L2T αkE[‖xk − x?H‖2] + 2L2T αkE[‖xk − x?H‖2]

+ 3σ2T αk≤ 3L2T αkE[‖xk − x?H‖2] + 3σ2T αk.

Using the preceding relation we obtain an upper bound for thefourth term on the right-hand side of (51)

− E[2αk|H|

∑i∈H

T −1∑t=0

(xk − x?H


)]≤ 3L2T 2α2

kE[‖xk − x?H‖2] + 3σ2T 2αk. (55)

Similarly, we provide an upper bound for the fifth term onthe right-hand side of (51). Indeed, using Assumption 1 and∇qH(x?H) = 0 gives

2E[∇qH(xk)T

(∇Qi(xik,t;Xi

k,t)−∇qi(xk))| Pk,t

]= 2∇qH(xk)T


)= 2(∇qH(xk)−∇qH(x?H)

)T (∇qi(xik,t)−∇qi(xk))

≤ 2L2‖xk − x?H‖‖xik,t − xk‖

≤ L3T αk‖xk − x?H‖2 +L

T αk‖xik,t − xk‖2,

14

where the last inequality is due to the Cauchy-Schwarz in-equality 2xy ≤ ηx2 +y2/η for any η > 0 and x, y ∈ R. Thus,by taking the expectation on both sides and using (49) gives

2E[∇qH(xk)T

(∇Qi(xik,t;Xi

k,t)−∇qi(xk))]

≤ L3T αkE[‖xk − x?H‖2] + 2L3T αkE[‖xk − x?H‖2]

+ 3Lσ2T αk= 3L3T αkE[‖xk − x?H‖2] + 3Lσ2T αk.

Using the above in the fifth term on the right-hand side of(51) yields

E[2T α2

k

|H|∑i∈H

T −1∑t=0

∇qH(xk)T(∇Qi(xik,t;Xi

k,t)−∇qi(xk))]

≤ 3L3T 3α3kE[‖xk − x?H‖2] + 3Lσ2T 3α3

k. (56)

Finally, we analyze the last term on the right-hand side of (51).Using ∇qH(x?H) = 0, (6), (53), and the relation 2〈x, y〉 ≤η‖x‖2 + ‖y‖2/η for any η > 0 we have

2E[(xk − x?H − T αk∇qH(xk)

)TV xk

]≤ 3T L|Bk|αk

|H|E[‖xk − x?H − T αk∇qH(xk)‖2]

+|H|

3T L|Bk|αkE[‖V xk ‖2]

(53)≤ 3T L|Bk|αk

|H|E[‖xk − x?H − T αk∇qH(xk)‖2]

+8T L|Bk|αk

3|H|E[‖xk − x?H‖2] +

4Tσ2|Bk|L|H|

αk

=17T L|Bk|αk

3|H|E[‖xk − x?H‖2] +

4Tσ2|Bk|L|H|

αk

+3T 3L|Bk|α3

k

|H|E[‖∇qH(xk)‖2]

− 6T 2L|Bk|α2k

|H|E[(xk − x?H)T∇qH(xk)]

≤ 17T L|Bk|αk3|H|

E[‖xk − x?H‖2] +4Tσ2|Bk|L|H|

αk

+3T 3L3|Bk|α3

k

|H|E[‖xk − x?H‖2]. (57)

Substituting from (52)–(57) into (51), and using LT αk ≤ln(2) ≤ 1 we obtain that

E[∥∥xk+1 − x?H

∥∥2≤(1− 2µT αk + T 2L2α2

k

)E[∥∥xk − x?H∥∥2]

+8T 2L2|Bk|2

|H|2α2kE[‖xk − x?H‖2] +

12T 2σ2|Bk|2

|H|2α2k

+ 2L4T 4α4kE[‖xk − x?H‖2] + 3L2σ2T 4α4

k + σ2T 2α2k

+ 3T 2L2α2kE[‖xk − x?H‖2] + 3σ2T 2α2

k

+ 3L3T 3α3kE[‖xk − x?H‖2] + 3Lσ2T 3α3

k

+17T L|Bk|αk

3|H|E[‖xk − x?H‖2] +

4Tσ2|Bk|L|H|

αk

+3T 3L3|Bk|α3

k

|H|E[‖xk − x?H‖2]

≤(

1− 2(µ− 17Lf

6|H|)T αk

)E[∥∥xk − x?H∥∥2]

+8T 2L2f2

|H|2α2kE[‖xk − x?H‖2] +

12T 2σ2f2

|H|2α2k

+ 6T 2L2α2kE[‖xk − x?H‖2] + 7σ2T 2α2

k +4Tσ2f

L|H|αk

+3T 2L2fα2

k

|H|E[‖xk − x?H‖2], (58)

where the last inequality we use |Bk| ≤ f .

E[∥∥xk+1 − x?H

∥∥2≤(

1− 2(µ− 17Lf

6|H|)T αk

)E[∥∥xk − x?H∥∥2]

+8T 2L2f2

|H|2α2kE[‖xk − x?H‖2] +

12T 2σ2f2

|H|2α2k

+ 9T 2L2α2kE[‖xk − x?H‖2] + 10σ2T 2α2

k +4Tσ2f

L|H|αk

+3T 2L2fα2

k

|H|E[‖xk − x?H‖2].

Using (11) gives

µ− 17Lf

6|H|≥ µ

18.

Thus, since f/|H| ≤ 13 we obtain from (58)

E[∥∥xk+1 − x?H

∥∥2≤(1− µT αk

9

)E[∥∥xk − x?H∥∥2]

+(

6 +3f

|H|+

8f2

|H|2)T 2L2α2

kE[‖xk − x?H‖2]

+12T 2σ2f2

|H|2α2k + 7σ2T 2α2

k +3Tσ2f

L|H|αk

≤(1− µT αk

9

)E[∥∥xk − x?H∥∥2]

+ 8T 2L2α2kE[‖xk − x?H‖2] + 9σ2T 2α2

k +3Tσ2f

L|H|αk.

Since αk = α ≤ µ/(144T L2) the preceding relation yields

E[∥∥xk+1 − x?H

∥∥2≤(

1−(µ

9− 8T L2α

)T α)E[∥∥xk − x?H∥∥2]

+ 9σ2T 2α2k +

3T σ2f

L|H|αk

≤(

1− µT α18

)E[∥∥xk − x?H∥∥2] + 9σ2T 2α2 +

3T σ2f

L|H|α

≤(

1− µT α18

)kE[∥∥x0 − x?H∥∥2] +

162σ2Tµ

α+54σ2f

µL|H|,

which conludes our proof.

byzantine fault-tolerance in federated local sgd under 2f

Documents