arxiv:2006.02559v2 [math.oc] 7 jun 2020 · nonmonotone globalization for anderson acceleration...

Nonmonotone Globalization for AndersonAcceleration Using Adaptive Regularization

Wenqing Ouyang Jiong TaoUniversity of Science and Technology of China, Hefei, China

{wq8809,taojiong}@mail.ustc.edu.cn

Andre Milzarek∗The Chinese University of Hong Kong, Shenzhen, China

Shenzhen Research Institute of Big Data, Shenzhen, [email protected]

Bailin Deng†Cardiff University, Cardiff, [email protected]

Abstract

Anderson acceleration (AA) is a popular method for accelerating fixed-point itera-tions, but may suffer from instability and stagnation. We propose a globalizationmethod for AA to improve its stability and achieve global and local convergence.Unlike existing AA globalization approaches that often rely on safeguarding opera-tions and might hinder fast local convergence, we adopt a nonmonotone trust-regionframework and introduce an adaptive quadratic regularization together with a tai-lored acceptance mechanism. We prove global convergence and show that ouralgorithm attains the same local convergence as AA under appropriate assumptions.The effectiveness of our method is demonstrated in several numerical experiments.

1 Introduction

In applied mathematics, many problems can be reduced to solving a nonlinear fixed-point equationg(x) = x, where x ∈ Rn and g : Rn → Rn is a given function. If g is a contractive mapping, i.e.,

‖g(x)− g(y)‖ ≤ κ‖x− y‖ ∀ x, y ∈ Rn,where κ < 1, then the iteration

xk+1 = g(xk)

is ensured to converge to the fixed-point of g by Banach’s fixed-point theorem. Anderson acceleration(AA) [2, 45, 3] is a technique for accelerating the convergence of such an iterative process. Instead ofusing xk+1 = g(xk), it generates xk+1 as an affine combination of the latestm+1 steps. Specifically,it first solves an optimization problem

minα

∥∥∥f(xk) +∑m

i=1αi(f(xk−i)− f(xk))

∥∥∥2

, (1)

where f(x) := g(x) − x denotes the residual function and α = (α1, ..., αm)T ∈ Rm are thecombination coefficients to be optimized. Then using a solution α∗ to (1), AA generates xk+1 via:

xk+1 = βk

(g(xk) +

m∑i=1

α∗i (g(xk−i)− g(xk))

)+ (1− βk)

(xk +

m∑i=1

α∗i (xk−i − xk)

). (2)

∗A. Milzarek is partly supported by the Fundamental Research Fund – Shenzhen Research Institute for BigData (SRIBD) Startup Fund JCYJ-AM20190601.†Corresponding author.

Preprint. Under review.

arX

iv:2

006.

0255

9v2

[m

ath.

OC

] 7

Jun

202

0

The parameter βk is typically set to zero and we will assume such conventional setting in this paper.AA was initially proposed to solve integral equations [2] and has gained popularity in recent years foraccelerating fixed-point iterations [45]. Examples include tensor decomposition [38], linear systemsolving [31], and reinforcement learning [16], among many others [24, 6, 1, 47, 27, 26, 29, 22, 49].

On the theoretical side, AA has been shown to be equivalent to a quasi-Newton method for finding aroot of the residual function [12, 14, 34]. A local convergence analysis of AA for general nonlinearproblems was first given in [42, 41], although the provided convergence rate is no faster than theoriginal iteration. More recent analysis in [11] shows that AA can indeed accelerate the local linearconvergence of a fixed-point iteration.

AA can suffer from instability and stagnation [30, 35]. Different techniques have been proposed toimprove its stability or to achieve convergence. For example, safeguarding checks were introducedin [27, 49] to only accept an AA step if it meets certain criteria, but without theoretical guarantee forconvergence. Another direction is to introduce regularization to the problem (1) for computing thecombination coefficients. In [15], a quadratic regularization is used together with a safeguarding stepto achieve global convergence of AA on Douglas-Rachford splitting, but there is no guarantee thatthe local convergence is faster than the original solver. In [35, 36], a similar quadratic regularizationis introduced to achieve local convergence, although no global convergence proof is provided.

As far as we are aware of, none of the existing approaches can guarantee both global convergenceand accelerated local convergence of AA. In this paper, we propose a novel AA globalization schemethat achieves these two goals simultaneously. Specifically, we apply a quadratic regularizationwith its weight adjusted automatically according to the effectiveness of the AA step. We adapt thenonmonotone trust-region framework in [43] to update the weight and to determine the acceptanceof the AA step. Our approach can not only achieve global convergence, but also attains the samelocal convergence rate as the one given in [11] for AA without regularization. Furthermore, our localresults also cover applications where the mapping g is nonsmooth and differentiability is only requiredat a target fixed-point of g. To the best of our knowledge, this is the first globalization technique forAA that achieves the same local convergence rate as the original AA scheme. Numerical experiments,including both smooth and nonsmooth problems, verify effectiveness and efficiency of our method.

Notations Throughout this paper, we restrict our discussion on the n-dimensional Euclidean spaceRn. For a vector x, ‖x‖ denotes its Euclidean norm, and Bε(x) := {y : ‖y − x‖ ≤ ε} denotes theEuclidean ball centered at x with radius ε. For a matrix A, ‖A‖ is the operator norm with respect tothe Euclidean norm. We use I to both denote the identity mapping, i.e., I(x) = x, and the identitymatrix. For a function h : Rn → R`, the mapping h′ : Rn → R`×n represents its Fréchet derivative.h is called L-smooth if it is differentiable and ‖h′(x)− h′(y)‖ ≤ L‖x− y‖ for all x, y ∈ Rn. Anoperator h : Rn → Rn is called nonexpansive if for all x, y ∈ Rn we have ‖h(x)−h(y)‖ ≤ ‖x−y‖.We say that the operator h is ρ-averaged for some ρ ∈ (0, 1) if there exists a nonexpansive operatorR : Rn → Rn such that h = (1− ρ)I + ρR. The set of fixed-points of the mapping h is defined viaFix(h) := {x : h(x) = x}. The reader is referred to [4] for further details on operator theory.

For a given sequence {xk} and a set of m + 1 distinct (not necessarily sorted) integers Π :={k0, k1, . . . , km} ⊂ N, ki 6= kj , i, j ∈ {0, . . . ,m}, i 6= j, and α ∈ Rm, we introduce the notation:

x[Π](α) := xk0 +∑m

i=1αi(x

ki − xk0),

gk := g(xk), g[Π](α) := gk0 +∑m

i=1αi(g

ki − gk0),

fk := f(xk), f[Π](α) := g[Π](α)− x[Π](α).

2 Algorithmic Framework and Convergence Analysis

2.1 Adaptive Regularization for Anderson Acceleration

We first note that the accelerated iterate xk+1 computed via (1) and (2) is invariant under permutationsof the indices for {f j} and {gj}. Concretely, let Πk := {k0, k1, . . . , km} be any permutation of theindex set {k, k − 1, . . . , k −m}. Then the value xk+1 computed in (2) with βk = 0 also satisfies

xk+1 = gk0 +∑m

i=1αki (gki − gk0) = g[Πk](α

k),

2

where the coefficients αk = (αk0 , . . . , αkm) are computed via

αk ∈ argminα

∥∥∥fk0 +∑m

i=1αi(f

ki − fk0)∥∥∥2

= argminα

∥∥∥f[Πk](α)∥∥∥2

,

which amounts to solving a linear system. In the following, we assume a particular choice of thepermutation where k0 := max{j : j ∈ argmini∈{k,k−1...k−m} ‖f i‖}, i.e., k0 is chosen to be thelargest index that attains the minimum residual norm among ‖fk−m‖, ‖fk−m+1‖, . . . , ‖fk‖. Fromnow on, we will denote xk(α) = x[Πk](α), gk(α) = g[Πk](α), and fk(α) = f[Πk](α) for simplicity.

One potential cause of instability of AA is the (near) linear dependence of the vectors {fki − fk0 :i = 1, . . . ,m}, which can cause (near) rank deficiency of the linear system matrix. To address thisissue, we introduce an quadratic regularization for the variables αk and compute them via:

minα‖fk(α)‖2 + λk‖α‖2, (3)

where λk is the regularization weight. Such a regularization approach has been suggested in a recentpaper by Anderson [3] and has also been applied in [15, 36]. A major difference between our methodand those in [15, 36] is the choice of λk: in [15] it is set in a heuristic manner, whereas in [36]it is either fixed or specified via grid search. In contrast, we update λk adaptively based on theeffectiveness of the latest AA step. Specifically, we observe that a larger value of λk can makethe solution more stable, but may also hinder the fast convergence of AA if it is already effectivein reducing the residual without regularization. Thus we adjust λk based on the reduction of theresidual using the current step. We further note that the regularized quadratic problem (3) as wellas the automatic update of the regularization weight are similar to the Levenberg-Marquardt (LM)algorithm [18, 23], a popular approach for stabilizing the Gauss-Newton method for nonlinear leastsquares problems. The LM algorithm determines the acceptance of a trial step and adjusts theregularization weight by comparing the predicted reduction of the target function with the actualreduction. Following this idea, we define two functions predk and aredk that measure the predictedand actual reduction of residuals resulting from the solution αk to (3) respectively:

predk(αk) :=

m∑i=0

γki‖fki‖2− c2‖fk(αk)‖2, aredk(αk) :=

m∑i=0

γki‖fki‖2−‖f(gk(αk))‖2, (4)

where c ∈ (0, 1) and {γki : i = 0, 1, . . . ,m} are convex combination coefficients that satisfy

γki ≥ γ > 0 ∀ i ∈ {0, 1, . . . ,m} and∑m

i=0γki = 1. (5)

In this paper, we choose γk0 = 1 −mγ and γk1 = γk2 = . . . = γkm = γ for simplicity, althoughother choices are also applicable. The ratio ρk = aredk(αk)/predk(αk) indicates the effectivenessof the trial step xk+1 = gk(αk) in reducing the residual. In particular, if this ratio is sufficientlyclose to or larger than 1, then the trial step satisfies ‖f(xk+1)‖2 / c2‖fk(αk)‖2. Thus, we acceptthe trial step if ρk ≥ p1 with a threshold p1 ∈ (0, 1) and we say the iterate is successful in this case.Otherwise, we discard the trial step and choose xk+1 = gk0 = g(xk0) which, based on our choiceof permutation, is the fixed-point iteration step with the smallest residual among the latest m + 1iterates. For the regularization weight, we set

λk = µk min{‖fk0‖δ, C1},

where δ ≥ 2, C1 > 0, and the factor µk > 0 is automatically adjusted based on the ratio ρk. Thevalue ‖fk0‖δ is inspired by [13] which relates the LM regularization weight to the residual norm. Theconstant C1 is utilized to prevent excessively large regularization weights in the initial iterations dueto large residuals. The parameter µk is adjusted as follows. If ρk < p1, then we consider the decreaseof the residual to be insufficient and we increase the factor in the next iteration via µk+1 = η0µkwith η0 > 1. On the other hand, if ρk > p2 with a threshold p2 ∈ (p1, 1), then we consider thedecrease to be high enough and reduce the factor via µk+1 = max{η1µk, µmin} with 0 < η1 < 1and µmin ≥ 0. Otherwise, in the case ρk ∈ [p1, p2], the factor remains the same in the next iteration.Algorithm 1 summarizes the full method. Note that our acceptance strategy allows the residual forxk+1 to be larger than the minimum residual fk0 among the past m + 1 iterates. Therefore, ourscheme is nonmonotone and follows the procedure investigated in [43]. In the next subsections, we

3

Algorithm 1 Adaptive regularization for Anderson accelerationInitialization: Choose 0 < p1 < p2 < 1, η0 > 1 > η1 > 0, µmin ≥ 0, γ > 0, δ ≥ 2, 0 < c < 1,

C1 ≥ 0, and εf > 0. Initialize x0 ∈ Rn and µ0 ≥ µmin. Select m ∈ N+ and set k = 0.1: while ‖fk‖ > εf do2: For m = min{m, k} construct a permutation Πk = {k0, k1, ..., km} of the index set {k, k−

1, . . . , k − m}, with k0 = max{j : j ∈ argmini∈{k,k−1...k−m} ‖f i‖}.3: Set λk = µk min{‖fk0‖δ, C1} and solve (3) to obtain αk.4: Choose γk0 , . . . , γkm satisfying (5) and compute the reduction ratio: ρk := aredk(αk)

predk(αk).

5: Update the parameter µk+1 via: µk+1 =

η0µk if ρk < p1,

µk if ρk ∈ [p1, p2],

max{η1µk, µmin} if ρk > p2.

6: If ρk ≥ p1 accept the step xk+1 = gk(αk). Otherwise, reject it and set xk+1 = gk0 .7: Set k ← k + 1.8: end while

will see that this nonmonotone strategy allows us to establish fast local convergence that achieves theacceleration effect of the original AA scheme.

The main computational overhead of our method is in the optimization problem (3), which amountsto constructing and solving an m × m linear system (JTJ + λkI)αk = −DT fk0 where J =[fk1 − fk0 , fk2 − fk0 , . . . , fkm − fk0 ] ∈ Rn×m. A naïve implementation that re-computes thewhole matrix J in each iteration will result in O(m2n) time for setting up the linear system, whereasthe system itself can be solved in O(m3) time. In practice, we often have m � n, thus we wouldlike to reduce the computational cost of constructing the linear system to improve efficiency. Tothis end, we maintain a matrix Q = ETE ∈ R(m+1)×(m+1) where E ∈ Rn×(m+1) containsfk−m, fk−m+1, . . . , fk in its columns, so that the entries of Q are inner products between the latestresidual vectors. The matrix Q can be updated in O(mn) time in each iteration, since we only needto compute the inner products between fk and its preceding m residuals, while the other entriesof Q have been determined in previous iterations. Note that each entry of JTJ is computed as(fki − fk0)T (fkj − fk0) = (fki)T fkj + (fk0)T fk0 − (fki)T fk0 − (fkj )T fk0 . Thus we evaluatethe entry in O(1) time by directly retrieving the inner products from the matrix Q, with a total timeof O(m2) for all entry evaluations. Similarly, the entries of JT fk0 are evaluated in O(m) total time.In this way, the linear system setup only requires O(mn) computational time.

2.2 Global Convergence Analysis

We now present our main assumptions on g and f allowing us to establish global convergenceof Algorithm 1. Our conditions are mainly based on a monotonicity property and on point-wiseconvergence of the iterated functions g(k) : Rn → Rn, g(k)(x) := (g ◦ · · · ◦ g

k times

)(x), for k ∈ N.

Assumption 2.1. We work with the following conditions:

(A.1) It holds that ‖f(g(x))‖ ≤ ‖f(x)‖ for all x ∈ Rn.(A.2) We have ‖f(g(k)(x))‖ → 0 as k →∞ for all x ∈ Rn.

It is easy to see that Assumption 2.1 holds for any contractive g. In the following, we will verify thatit is also satisfied for ρ-averaged operators which define a broader class of mappings.

Proposition 2.2. Assume that g : Rn → Rn is a ρ-averaged operator with ρ ∈ (0, 1) and supposeFix(g) 6= ∅. Then Assumption 2.1 is satisfied for g.

Proof. By definition, the ρ-averaged operator g is also nonexpansive and hence, (A.1) holds for g.To prove (A.2), let us fix x∗ ∈ Fix(g). Due to [4, Proposition 4.35(iii)], we obtain

‖g(x)− x∗‖2 +1− ρρ‖f(x)‖2 ≤ ‖x− x∗‖2 ∀ x ∈ Rn,

4

where we have utilized the fact f(x∗) = 0. Setting x ≡ g(k−1)(x), we can deduce that

‖g(k)(x)− x∗‖2 +1− ρρ‖f(g(k−1)(x))‖2 ≤ ‖g(k−1)(x)− x∗‖2 ∀ k ≥ 1, x ∈ Rn.

Summing this inequality for all k ∈ N yields:

1− ρρ

∞∑k=0

‖f(g(k)(x))‖2 ≤ ‖x− x∗‖2 ∀ x ∈ Rn,

which implies ‖f(g(k)(x))‖ → 0 as k →∞.

Remark 2.3. The condition Fix(g) 6= ∅ is necessary in Proposition 2.2: as a counterexample,g(x) = x+ 1 is a ρ-averaged operator for all ρ ∈ (0, 1) and has no fixed-point, and it does not satisfy(A.2). Furthermore, as mentioned in the proof of Proposition 2.2, (A.1) is always satisfied if g is anonexpansive operator. However, nonexpansiveness is not a necessary condition for (A.1). In fact,we can construct an operator g that is not nonexpansive but satisfies (A.1) and (A.2), e.g.,

g : R→ R, g(x) :=

{0.5x if x ∈ [0, 1],

0 otherwise.

For any x ∈ [0, 1], it follows that g(k)(x) = 2−kx, f(g(k)(x)) = −2−(k+1)x and it is easy to verify(A.1) and (A.2). For any x /∈ [0, 1] we have f(g(x)) = f(0) = 0, thus (A.1) and (A.2) also hold inthis situation. However, since g is discontinuous, it cannot be nonexpansive.Example 2.4. Let us consider the nonsmooth optimization problem:

minx∈Rn

r(x) + ϕ(x), (6)

where both r, ϕ : Rn → (−∞,∞] are proper, closed, and convex functions, and r is L-smooth. It iswell known that x∗ is a solution to this problem if and only if it satisfies the nonsmooth equation:

x∗ −Gµ(x∗) = 0, Gµ(x) := proxµϕ(x− µ∇r(x)), (7)

where proxµϕ(x) := argminy ϕ(y) + 12µ‖x− y‖

2, µ > 0, is the proximity operator of ϕ, see also [4,Corollary 26.3]. Gµ is also known as the forward-backward splitting operator, and it is a ρ-averagedoperator for all µ ∈ (0, 2

L ), see, e.g., [7]. Hence, Assumption 2.1 is satisfied and our theory can beused to study the global convergence of Algorithm 1 applied to (7).Remark 2.5. For the problem (6), it is easy to show that Douglas-Rachford splitting, as well asits equivalent form of ADMM, can both be written as a ρ-averaged operator with ρ ∈ (0, 1) (see,e.g., [19]). Therefore, the problems considered in [15] are also covered by Assumption 2.1.

We can now show the global convergence of Algorithm 1:Theorem 2.6. Suppose Assumption 2.1 is satisfied and let {xk} be generated by Algorithm 1. Then

limk→∞

‖fk‖ = 0.

Proof. To simplify the notation, we introduce the function P : N→ N:

P(k) := max{j : j ∈ argmini∈{k,k−1...k−m}‖f i‖

}.

If Algorithm 1 terminates after finitely many steps, the conclusion simply follows from the stoppingcriterion. So, in the following, we assume that a sequence of iterates of infinite length is generated.We consider two different cases:

Case 1: |S| <∞. Let k denote the index of the last successful iterate in S (we set k = 0 if S = ∅).We first show P(k) = k for all k ≥ k+ 1. Since k+ 1 /∈ S , we have xk+1 = g(xP(k)) and by (A.1),this implies ‖f(xk+1)‖ ≤ ‖f(xP(k))‖. From the definition of P , we have ‖f(xP(k))‖ ≤ ‖f k−i‖ forevery 0 ≤ i ≤ min{m, k} and hence P(k+1) = k+1. An inductive argument then yields P(k) = kfor all k ≥ k + 1. Notice that for any k ≥ k + 1, we have k /∈ S and xk+1 = g(xP(k)) = g(xk).Utilizing (A.2), it follows that ‖fk‖ = ‖f(g[k−k](xP(k)))‖ → 0 as k →∞.

5

Case 2: |S| =∞. Let us set Wk := maxk−m≤i≤k ‖f i‖2. If k ∈ S, then we have:

p1 ≤ ρk =aredk(αk)

predk(αk).

By taking α = 0 in (3), we obtain ‖fk(αk)‖ ≤ ‖fP(k)‖ = mink−m≤i≤k ‖f i‖. Due to (5), thisyields

predk(αk) ≥ ‖fP(k)‖2 − c2‖f(xP(k))‖2 = (1− c2)‖f(xP(k))‖2 > 0.

Since p1 is positive, it also holds that aredk(αk) > 0. Hence, if c2‖fP(k)‖2 ≤ ‖fk+1‖2 and using∑mi=0 γki‖fki‖2 ≤Wk, we can deduce:

p1 ≤aredk(αk)

predk(αk)= 1 +

c2‖fP(k)‖2 − ‖fk+1‖2

predk(αk)≤ Wk − ‖fk+1‖2

Wk − c2‖fP(k)‖2≤ Wk − ‖fk+1‖2

(1− c2)Wk,

which implies‖fk+1‖2 ≤ cpWk, cp := 1− (1− c2)p1.

Otherwise, if c2‖fP(k)‖2 > ‖fk+1‖2, we readily obtain ‖fk+1‖2 ≤ c2Wk ≤ cpWk. Moreover, inthe case k /∈ S, we have xk+1 = gP(k), and by assumption (A.2), it then follows that

‖fk+1‖2 ≤ ‖fP(k)‖2 ≤Wk.

Combining these estimates, we can infer that {Wk} is non-increasing. Next, we show Wk+m+1 ≤cpWk for all k ∈ S. It suffices to prove that for any k + 1 ≤ i ≤ k + m + 1, we have ‖f i‖2 ≤cpWk. Since we consider a successful iteration k ∈ S, our previous discussion has already shown‖fk+1‖2 ≤ cpWk. We now assume ‖f i‖ ≤ cpWk for some k + 1 ≤ i ≤ k + m. If i ∈ S, weobtain ‖f i+1‖2 ≤ cpWi ≤ cpWk. Otherwise, it follows that ‖f i+1‖2 ≤ ‖fP(i)‖2 ≤ ‖f i‖2 ≤ cpWk.Hence, by induction, we have Wk+m+1 ≤ cpWk for all k ∈ S. Since {Wk} is non-increasing andwe assumed |S| =∞, this establishes Wk → 0 and ‖fk‖ → 0.

2.3 Local Convergence Analysis

Next, we analyze the local convergence of our proposed approach, starting with some assumptions.

Assumption 2.7. The function g : Rn → Rn satisfies the following conditions:

(B.1) g is Lipschitz continuous with constant κ < 1.(B.2) g is differentiable at x∗, where x∗ is the fixed-point of g and there exist constants L, ε > 0

such that for any x ∈ Bε(x∗) it holds that

||g(x)− g(x∗)− g′(x∗)(x− x∗)|| ≤ L‖x− x∗‖2.Remark 2.8. (B.1) is a standard assumption widely used in the local convergence analysis of AA [42,11, 35, 36]. The existing analyses typically rely on the smoothness of g. In contrast, (B.2) allows g tobe nonsmooth and only requires it to be differentiable at the fixed-point x∗, allowing our assumptionsto cover a wider variety of methodologies such as forward-backward splitting and Douglas-Rachfordsplitting under appropriate assumptions. We note that AA for nonsmooth g has also been investigatedin [48, 15], but without local convergence analysis. In Section 3, we will verify the conditions (B.1)and (B.2) on the numerical examples, with a more detailed discussion in Appendix A.Remark 2.9. (B.1) implies that the function g is contractive, which is a sufficient condition for (A.1)and (A.2). Thus, a function g satisfying Assumption 2.7 will also fulfill Assumption 2.1.

Similar to the local convergence analyses in [42, 11], we also work with the following condition:

Assumption 2.10. Let αk denote the solution to the problem:

minα

∥∥∥fk0 +∑m

i=1αi(f

ki − fk0)∥∥∥2

= minα‖f[Πk](α)‖2. (8)

We assume that there exists M > 0 such that ‖αk‖∞ ≤M for all k sufficiently large.

Remark 2.11. The assumptions given in [42, 11] are formulated without permuting the last m+ 1indices. We further note that we do not need the solution αk to be unique.

6

The acceleration effect of the original AA scheme has only been studied very recently in [11, Theorem4.4] based on slightly stronger assumptions. In particular, their result can be stated as

‖f(gk(αk))‖ ≤ κθk‖fk0‖+∑m

i=0O(‖fk−i‖2), (9)

whereθk := ‖fk(αk)‖/‖fk0‖ (10)

is an acceleration factor. Since αk is a solution to the problem (8), we have ‖fk(αk)‖ ≤ ‖fk(0)‖ =‖fk0‖ so that θk ∈ [0, 1]. Then (9) implies that for a fixed-point iteration that converges linearly witha contraction constant κ, AA can improve the convergence rate locally. In the following, we willshow that our globalized AA method possesses similar characteristics under weaker assumptions.

2.3.1 Anderson Acceleration Improves the Convergence Rate

We first show that AA can enhance convergence of the underlying fixed-point iteration under As-sumptions 2.7 and 2.10. We first give an estimate for the term g(xk(α))− gk(α).

Proposition 2.12. Suppose that Assumption 2.7 is satisfied. For any k ∈ N with xk, . . . , xk−m,xk(α) ∈ Bε(x∗) and α ∈ Rm with ‖α‖∞ ≤M , we have:

‖g(xk(α))− gk(α)‖ ≤ C(M)∑m

i=0‖fk−i‖2, (11)

where C(M) := L(Mm+ 1)(Mm2 +M(m+ 1) + 2)/(1− κ)2.

Proof. Defining ν = (ν0, ..., νm) ∈ Rm+1 with ν0 = 1−∑mi=1 αi and νj = αj for 1 ≤ j ≤ m, we

obtain

gk(α) =

m∑i=0

νig(xki), xk(α) =

m∑i=0

νixki , and ‖ν‖∞ ≤ 1 +mM.

By Assumption (B.2) and using∑mi=0 νi = 1 and ‖

∑mi=0 yi‖2 ≤ (m+ 1)

∑mi=0 ‖yi‖2, we have:

‖g(xk(α))− gk(α)‖ = ‖g(xk(α))− g(x∗) + g(x∗)− gk(α)‖≤ ‖g′(x∗)(xk(α)− x∗) + g(x∗)− gk(α)‖+ L‖xk(α)− x∗‖2

≤∥∥∥g′(x∗)(xk(α)− x∗) +

∑m

i=0νi(g(x∗)− g(xki))

∥∥∥+ L∥∥∥∑m

i=0νi(x

ki − x∗)∥∥∥2

≤∥∥∥∑m

i=0νi[g

′(x∗)(xki − x∗) + g(x∗)− g(xki)]∥∥∥+ L(m+ 1)‖ν‖2∞

∑m

i=0‖xki − x∗‖2

≤ L(Mm+ 1)(1 + (m+ 1)‖ν‖∞)∑m

i=0‖xki − x∗‖2. (12)

Notice that the contraction property in (B.1) implies ‖f(xki)‖ ≥ (1− κ)‖xki − x∗‖ which togetherwith (12) yields (11).

Next, we establish the convergence rate as given in (9) under Assumptions 2.7 and 2.10.

Corollary 2.13. Suppose the conditions stated in Proposition 2.12 and Assumption 2.10 are satisfiedand choose α = αk. Then it holds that

‖f(gk(αk))‖ ≤ κθk‖fk0‖+ C(M)∑m

i=0‖fk−i‖2.

Proof. Using the definition of fk and gk, we have:

‖f(gk(αk))‖ = ‖g(gk(αk))− gk(αk)‖ ≤ ‖g(gk(αk))− g(xk(αk))‖+ ‖g(xk(αk))− gk(αk)‖≤ κ‖fk(αk)‖+ ‖g(xk(αk))− gk(αk)‖.

The result then follows from Proposition 2.12 and (10).

7

2.3.2 Transition to Pure Regularized Anderson Acceleration

Next, we verify that our approach has favorable global-to-local properties, i.e., after a finite numberof iterations every trial step xk+1 = gk(αk) will be accepted, so that the method eventually reducesto a pure regularized AA scheme.

Proposition 2.14. Suppose that Assumptions 2.7 and 2.10 are satisfied and let k ∈ N be given withxk, . . . , xk−m, xk(αk) ∈ Bε(x∗), ‖αk‖∞ ≤M , and consider y := gk(αk). Then, we have:

‖f(y)‖2 ≤ κ2‖fk(αk)‖2 + 2C(M)‖fk0‖∑m

i=0‖fk−i‖2 + C(M)2(m+ 1)

∑m

i=0‖fk−i‖4.

Proof. First, we obtain:

‖f(y)‖2 = ‖g(y)− y‖2 ≤ ‖g(y)− g(xk(αk))‖2 + ‖g(xk(αk))− y‖2

+ 2‖g(y)− g(xk(αk))‖‖g(xk(αk))− y‖. (13)

By condition (B.1) of Assumption 2.7, and using the definition of αk in (3), we have:

‖g(xk(αk))− g(y)‖ ≤ κ‖xk(αk)− gk(αk)‖ = κ‖fk(αk)‖ ≤ κ‖fk0‖.

Hence, applying Proposition 2.12, it follows:

2‖g(y)− g(xk(αk))‖‖g(xk(αk))− y‖ ≤ 2C(M)‖fk0‖m∑i=0

‖fk−i‖2. (14)

The statement in Proposition 2.14, is then a consequence of (13), (14), and Proposition 2.12.

We now present one of our main results of this work.

Theorem 2.15. Suppose that Assumptions 2.7 and 2.10 hold and let the constant c in (4) be chosensuch that c ≥ κ. Then, there exists some k′ ∈ N such that ρk ≥ p2 for all k ≥ k′. In particular, everyiteration k ≥ k′ is successful with xk+1 = gk(αk).

Proof. We will prove that there exist some k′ such that for all k ≥ k′ we have k ∈ S. Due thealgorithmic construction and the definition of αk and αk, it follows

‖fk(αk)‖2 + λk‖αk‖2 ≤ ‖fk(αk)‖2 + λk‖αk‖2 ≤ ‖fk(αk)‖2 + λk‖αk‖2, (15)

which implies ‖αk‖∞ ≤ ‖αk‖ ≤ ‖αk‖ ≤√mM for all k. Next, we set ψk := maxk−m≤i≤k ‖f i‖

and Cg := C(√mM). Theorem 2.6 guarantees convergence in the sense that limk→∞ ‖fk‖ = 0 and

thus, we can choose k′ sufficiently large such that:

2Cg(m+ 1)ψ3k + [Cg(m+ 1)]2ψ4

k < (1− p2) min{γ, 1− c2}ψ2k ∀ k ≥ k′.

Due to the contraction property (B.1), we have ‖f(xk)‖ ≥ (1− κ)‖xk − x∗‖ for all k which yieldsxk → x∗ and gk → x∗. Hence, by increasing k′ (if necessary) we can assume gk ∈ Bε(x∗) for anyk ≥ k′ −m, where ε satisfies:

ε ≤ (1 + 2Mm12 )−1ε.

Then, for all k ≥ k′, we have:

‖xk(αk)− x∗‖ ≤ ‖gk0 − x∗‖+ ‖αk‖∞m∑i=1

‖gk0 − gki‖ ≤ ε+ 2Mm12 ε ≤ ε.

As a consequence, Proposition 2.14 is applicable for y = gk(αk) and we obtain:

‖f(gk(αk))‖2 ≤ c2‖fk(αk)‖2 + 2Cg‖fk0‖m∑i=0

‖fk−i‖2 + C2g (m+ 1)

m∑i=0

‖fk−i‖4

≤ c2‖fk(αk)‖2 + 2Cg(m+ 1)ψ3k + [Cg(m+ 1)]2ψ4

k.

8

where we used c ≥ κ. Hence, setting Qk := 2Cg(m+ 1)ψ3k + [Cg(m+ 1)]2ψ4

k, it follows

aredk(αk) =

m∑i=0

γki‖fki‖2 − ‖f(gk(αk))‖2 ≥m∑i=0

γki‖fki‖2 − c2‖fk(αk)‖2 −Qk

≥ predk(αk)− (1− p2) min{γ, 1− c2}ψ2k.

Similarly, for the predicted reduction predk(αk) and using γki‖fki‖2 ≥ γ‖fki‖2 +(γki −γ)‖fk0‖2,we can show:

predk(αk) =

m∑i=0

γki‖fki‖2 − c2‖fk(αk)‖2 ≥ γψ2k + (1− γ)‖fk0‖2 − c2‖fk0‖2

≥ min{γ, 1− c2}ψ2k.

Combining the last estimates, we can finally deduce

aredk(αk)

predk(αk)≥ predk(αk)− (1− p2) min{γ, 1− c2}ψ2

k

predk(αk)≥ p2,

which completes the proof.

Note that the proof of this theorem further illustrates the importance of the nonmonotone acceptancemechanism as it allows to balance the additional quadratic error terms caused by an AA step.

2.3.3 Acceleration Guarantee

We can finally show that our proposed approach basically has the same local convergence rate as theoriginal AA method as given in [11].Theorem 2.16. Suppose that the Assumptions 2.7, and 2.10 hold and the parameters c, µmin inAlgorithm 1 satisfy c ≥ κ and µmin = 0. Then, for all sufficiently large k it holds that:

‖fk+1‖ ≤ κθk‖fk0‖+ o(‖fk0‖ δ2 ) +∑m

i=0O(‖fk−i‖2),

where θk := ‖fk(αk)‖/‖fk0‖ is the corresponding acceleration factor.

Proof. Theorem 2.15 implies ρk ≥ p2 for all k sufficiently large and hence, by the update rule ofAlgorithm 1, it follows µk = max{η1µk−1, µmin}. Then we can infer λk = o(‖fk0‖δ). Using (15)and Assumption 2.10, this shows

‖fk(αk)‖ ≤ ‖fk(αk)‖+√λk‖αk‖ = ‖fk(αk)‖+ o(‖fk0‖ δ2 ).

Moreover, mimicking the proof of Corollary 2.13, we have:

‖fk+1‖ = ‖g(xk+1)− xk+1‖ = ‖g(xk+1)− gk(αk)‖ ≤ κ‖fk(αk)‖+ ‖g(xk(αk))− gk(αk)‖.

Following the proof of Theorem 2.15, it holds that ‖αk‖∞ ≤√mM and xk, . . . , xk−m, xk(αk) ∈

Bε(x∗) and thus, Proposition 2.12 yields:

‖fk+1‖ ≤ κ‖fk(αk)‖+ o(‖fk0‖ δ2 ) +

m∑i=0

O(‖fki‖2)

≤ κ‖fk(αk)‖+ o(‖fk0‖ δ2 ) +

m∑i=0

O(‖fki‖2) + o(‖fk0‖2)

= κθk‖fk0‖+ o(‖fk0‖ δ2 ) +

m∑i=0

O(‖fki‖2),

as desired.

Remark 2.17. Theorem 2.16 shows that if δ ≥ 4, then we have the same local convergence rate asin [11]. If δ ≥ 2, then the convergence rate is the same as in [11] asymptotically.

9

0 200 400 600 800 1000 1200 1400 1600 1800 2000Iteration

10-15

10-10

10-5

100

(F-F

*)/F

*0 5 10 15 20 25 30

time(s)

10-15

10-10

10-5

100

(F-F

*)/F

*

0 200 400 600 800 1000 1200 1400 1600 1800 2000Iteration

10-15

10-10

10-5

100

(F-F

*)/F

*

0 5 10 15time(s)

10-15

10-10

10-5

100

(F-F

*)/F

*

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000Iteration

10-12

10-10

10-8

10-6

10-4

10-2

100

(F-F

*)/F

*

0 10 20 30 40 50 60 70 80time(s)

10-12

10-10

10-8

10-6

10-4

10-2

100

(F-F

*)/F

*

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000Iteration

10-15

10-10

10-5

100

105

(F-F

*)/F

*

0 10 20 30 40 50 60 70 80time(s)

10-15

10-10

10-5

100

105

(F-F

*)/F

*

Iteration IterationTime (s) Time (s)Gradient Descent LM-AA (m = 10) RNA (k = 5) RNA (k = 10) RNA (k = 20)

Figure 1: Comparison between gradient descent, RNA, and our method using the logistic regressionproblem (16) on the covtype and sido0 datasets, with a different choice of parameter τ in each row.

3 Numerical Experiments

In this section, we verify the effectiveness of our method with numerical experiments, and compareits performance with other globalization approaches. All experiments are carried out on a Windowslaptop with a Core-i7 9750H at 2.6GHz and 16GB of RAM. For all examples, we choose p1 = 0.01,p2 = 0.25, η0 = 2, η1 = 0.25, δ = 2, C1 = 0, and γ = 10−4 in Algorithm 1.

Following [36], we first consider gradient descent for a logistic regression problem, with the followingtarget function for the variable parameters x ∈ Rn:

F (x) =1

N

∑N

i=1log(1 + exp(−biaTi x)) +

τ

2‖x‖2, (16)

where ai ∈ Rn and bi ∈ {−1, 1} are the attributes and label of the data point i, respectively. Weconsider a fixed step-size gradient descent xk+1 = g(xk) with g(x) = x − 2

LF+τ∇F (x), whereLF = τ + ‖A‖22/(4N) is the Lipschitz constant of ∇F and A = [a1, ..., aN ]T ∈ RN×n. Then g isLipschitz continuous with modulus κ = (LF − τ)/(LF +τ) < 1 and locally Lipschitz differentiable,which satisfies Assumption 2.7. This method has also been used in [36] to evaluate their regularizednonlinear acceleration (RNA) scheme. Therefore, we apply our approach (denoted by “LM-AA”)and RNA to compare their performance on two datasets: covtype (54 features, 581012 points), andsido0 (4932 features, 12678 points). For each dataset, we normalize the attributes and solve theproblem with τ = LF /105 and τ = LF /109, respectively. We modify the source code released bythe authors of [36]3 to implement both RNA and our method. We set c = κ, µ0 = 100, and m = 10for our method. RNA performs an acceleration step every k iterations, and we test k = 5, 10, 20,respectively. All other RNA parameters are set to their default values as provided in the source code(in particular, with grid-search adaptive regularization weight and line search enabled). Fig. 1 plotsfor each method the normalized target function value (F (xk)− F ∗)/F ∗ with respect to the iterationcount and computational time, where F ∗ is the ground-truth global minimum. Overall, our methodoutperforms all three variants of RNA. One exception is the sido0 dataset with τ = LF · 10−5,where RNA with k = 5 takes fewer iterations to converge than our method. However, our methodoutperforms RNA in the actual timing thanks to its low computational overhead.

Next, we consider a nonsmooth problem of total variation (TV) based image deconvolution [46]:

minw,u

∑N2

i=1‖wi‖2 +

β

2

∑N2

i=1‖wi −Diu‖22 +

µ

2‖Ku− s‖22, (17)

where s ∈ [0, 1]N2

is anN×N input image, u ∈ RN2

is the output image to optimize,K ∈ RN2×N2

is a convolution operator, Di ∈ R2×N2

represents the discrete gradient operator at pixel i, w =

3https://github.com/windows7lover/RegularizedNonlinearAcceleration

10

https://github.com/windows7lover/RegularizedNonlinearAcceleration

0 200 400 600 800 1000 1200 1400 1600 1800 2000Iterations

10-15

10-10

10-5

100

||f(w

)||

0 50 100 150 200 250Time(s)

10-15

10-10

10-5

100

||f(w

)||

0 5000 10000 15000Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

||f(w

)||

0 500 1000 1500Time(s)

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

||f(w

)||

Input

Iteration

Time (s)

Iteration

Time (s)

Alternating Minimization LM-AA (m = 1) LM-AA (m = 3) LM-AA (m = 5)

Figure 2: We apply our globalized AA scheme to an alternating minimization solver for the imagedeconvolution problem (17), which achieves a notable increase in performance.

(wT1 , ...wTN2)T ∈ R2N2

are auxiliary variables for the image gradients, µ > 0 is a fidelity weight, andβ > 0 is a penalty parameter. The method in [46] performs alternating minimization (AM) betweenu and w. When β and µ are fixed, this can be treated as a fixed-point iteration wk+1 = g(wk),and it satisfies Assumption 2.7 (see Appendix A.2 for a detailed derivation of g and verification ofAssumption 2.7). We test our method with K = I and µ = 4 on a 1024× 1024 image with Gaussiannoise (mean: 0, variance: σ = 0.05), using the source code released by the authors of [46]4 for theimplementation. In this case, the condition (B.1) is satisfied with κ = 1− (1 + 4β/µ)−1, and weapply Algorithm 1 using c = κ, µ0 = 1, and m = 1, 3, 5, respectively. We choose smaller valuesof m than in the logistic regression example, because the AM solver has a low computational costper iteration, and choosing a large m can actually increase the computational time due to the largerelative overhead. Fig. 2 plots the residual norm ‖f(w)‖ for AM and our accelerated solvers withβ = 100 and β = 1000, with a notable reduction of iteration counts from our accelerated solvers.Although our solvers with m = 3 and m = 5 can take a longer time than AM to achieve the samelevel of residual norm initially due to the low computational cost per iteration of AM, all acceleratedsolvers eventually outperforms AM in computational time as they converge to the solution.

Finally, we consider a nonnegative least squares (NNLS) problem minz≥0 ‖Hz− t‖22, where z ∈ Rq ,H ∈ Rp×q , and t ∈ Rp. In [15], this problem is reformulated as

minx

ψ(x) + ϕ(x), (18)

where x = (x1, x2) ∈ R2q, ψ(x) = ‖Hx1 − t‖22 + Ix2≥0(x2), and ϕ(x) = Ix1=x2(x), with IS

denoting the indicator function of the set S. The Douglas-Rachford splitting (DRS) solver for thereformulated problem can be written as

vk+1 = g(vk) :=1

2((2proxβϕ − I)(2proxβψ − I) + I)vk (19)

where vk = (vk1 , vk2 ) ∈ R2q is an auxiliary variable for DRS and β is the penalty parameter. In [15],

the authors apply their regularized AA method (A2DR) to accelerate the DRS solver. To apply ourmethod, we verify in Appendix A.3 that if H is of full column rank, then g satisfies condition (B.1)with κ =

√3 + c21/2 < 1 where c1 = max{βσ1−1

βσ1+1 ,1−βσ0

1+βσ0} and σ0, σ1 are the minimal and maximal

eigenvalues of HTH respectively, and g also satisfies (B.2) under a mild condition. We compare ourmethod (µ0 = 1, c = κ,m = 10) with A2DR m = 10 on a 600× 300 sparse random matrix H with1% nonzero entries and a random vector t, using the source code released by the authors of [15]5

for the implementation. All parameters for A2DR are set to their default values. While A2DR andDRS are implemented using parallel evaluation of the proximity operators in the released code, weimplement our method as a single-threaded application for simplicity. The solvers are run using aLinux virtual machine with four cores and 8GB of RAM on the above-mentioned laptop. Fig 3 plotsthe residual norm ‖f(v)‖ for DRS and the two acceleration methods. It also plots the norm of theoverall residual r = (rprim, rdual) used in [15] for measuring convergence, where rprim and rdual

are the primal and dual residuals as defined in [15, Equations (7) & (8)]. For both residual measures,the single-threaded implementation of our method outperforms the parallel A2DR in terms of bothiteration count and computational time.

4https://www.caam.rice.edu/~optimization/L1/ftvd/v4.1/5https://github.com/cvxgrp/a2dr

11

https://www.caam.rice.edu/~optimization/L1/ftvd/v4.1/

https://github.com/cvxgrp/a2dr

0 500 1000 1500Iteration

10-15

10-10

10-5

100

DR

resi

dual

0 1 2 3 4 5 6 7 8 9 10time(s)

10-15

10-10

10-5

100

DR

resi

dual

0 500 1000 1500Iteration

10-15

10-10

10-5

100

Nom

aliz

ed to

tal r

esid

ual

0 1 2 3 4 5 6 7 8 9 10time(s)

10-15

10-10

10-5

100

Nor

mal

ized

tota

l res

idua

l

Iteration IterationTime (s) Time (s)

DRS

LM-AA (m = 10)

A2DR (m = 10)

Figure 3: Comparison between different solvers for the NNLS problem (19) on an 600× 300 sparserandom matrix H and a random vector t.

4 Conclusions

In this paper, we propose a globalization technique for Anderson acceleration using an adaptivequadratic regularization together with a nonmonotone acceptance strategy. We prove its globalconvergence under weaker assumptions than existing AA globalization methods. We also show thatit has the same local convergence rate as the original AA scheme, so that the globalization does nothinder the acceleration. This is the first AA globalization method that achieves global convergenceand fast local convergence simultaneously. Thanks to its weak assumptions required for convergence,our method can be applied on a variety of numerical solvers to improve their efficiency.

References[1] H. AN, X. JIA, AND H. F. WALKER, Anderson acceleration and application to the three-

temperature energy equations, J. Comput. Phys., 347 (2017), pp. 1–19.

[2] D. G. ANDERSON, Iterative procedures for nonlinear integral equations, J. ACM, 12 (1965),pp. 547–560.

[3] D. G. M. ANDERSON, Comments on “Anderson acceleration, mixing and extrapolation”,Numer. Algorithms, 80 (2019), pp. 135–234.

[4] H. H. BAUSCHKE AND P. L. COMBETTES, Convex analysis and monotone operator theory inHilbert spaces, CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC, Springer,New York, 2011.

[5] A. BECK, Introduction to nonlinear optimization, vol. 19 of MOS-SIAM Series on Optimiza-tion, Society for Industrial and Applied Mathematics; Mathematical Optimization Society,Philadelphia, 2014.

[6] J. W. BOTH, K. KUMAR, J. M. NORDBOTTEN, AND F. A. RADU, Anderson acceleratedfixed-stress splitting schemes for consolidation of unsaturated porous media, Comput. Math.Appl., 77 (2019), pp. 1479–1502.

[7] C. BYRNE, An elementary proof of convergence of the forward-backward splitting algorithm, J.Nonlinear Convex Anal., 15 (2014), pp. 681–691.

[8] F. H. CLARKE, Optimization and nonsmooth analysis, vol. 5 of Classics in Applied Math-ematics, Society for Industrial and Applied Mathematics (SIAM), Philadelphia, second ed.,1990.

[9] C. DING, D. SUN, J. SUN, AND K.-C. TOH, Spectral operators of matrices: Semismoothnessand characterizations of the generalized Jacobian, SIAM J. Optim., 30 (2020), pp. 630–659.

[10] A. L. DONTCHEV AND R. T. ROCKAFELLAR, Implicit functions and solution mappings,Springer Series in Operations Research and Financial Engineering, Springer, New York, sec-ond ed., 2014.

[11] C. EVANS, S. POLLOCK, L. G. REBHOLZ, AND M. XIAO, A proof that Anderson accelerationimproves the convergence rate in linearly converging fixed-point methods (but not in thoseconverging quadratically), SIAM J. Numer. Anal., 58 (2020), pp. 788–810.

12

[12] V. EYERT, A comparative study on methods for convergence acceleration of iterative vectorsequences, J. Comput. Phys., 124 (1996), pp. 271–285.

[13] J.-Y. FAN, A modified Levenberg-Marquardt algorithm for singular system of nonlinear equa-tions, J. Comput. Math., 21 (2003), pp. 625–636.

[14] H.-R. FANG AND Y. SAAD, Two classes of multisecant methods for nonlinear acceleration,Numer. Linear Algebr. Appl., 16 (2009), pp. 197–221.

[15] A. FU, J. ZHANG, AND S. BOYD, Anderson accelerated Douglas-Rachford splitting, arXivpreprint arXiv:1908.11482, (2019).

[16] M. GEIST AND B. SCHERRER, Anderson acceleration for reinforcement learning, arXivpreprint arXiv:1809.09501, (2018).

[17] P. GISELSSON AND S. BOYD, Linear convergence and metric selection for douglas-rachfordsplitting and admm, IEEE Transactions on Automatic Control, 62 (2016), pp. 532–544.

[18] K. LEVENBERG, A method for the solution of certain non-linear problems in least squares,Quart. Appl. Math., 2 (1944), pp. 164–168.

[19] J. LIANG, J. FADILI, AND G. PEYRÉ, Local convergence properties of Douglas–Rachford andADMM, arXiv preprint arXiv:1606.02116, (2016).

[20] J. LIANG, J. FADILI, AND G. PEYRÉ, Activity identification and local linear convergence offorward-backward-type methods, SIAM J. Optim., 27 (2017), pp. 408–437.

[21] V. V. MAI AND M. JOHANSSON, Anderson acceleration of proximal gradient methods, arXivpreprint arXiv:1910.08590, (2019).

[22] , Nonlinear acceleration of constrained optimization algorithms, in ICASSP 2019-2019IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE,2019, pp. 4903–4907.

[23] D. W. MARQUARDT, An algorithm for least-squares estimation of nonlinear parameters, J.Soc. Indust. Appl. Math., 11 (1963), pp. 431–441.

[24] S. MATVEEV, V. STADNICHUK, E. TYRTYSHNIKOV, A. SMIRNOV, N. AMPILOGOVA, ANDN. V. BRILLIANTOV, Anderson acceleration method of finding steady-state particle sizedistribution for a wide class of aggregation–fragmentation models, Comput. Phys. Commun.,224 (2018), pp. 154–163.

[25] A. MILZAREK, Numerical methods and second order theory for nonsmooth problems, PhDthesis, Technische Universität München, 2016.

[26] A. L. PAVLOV, G. W. OVCHINNIKOV, D. Y. DERBYSHEV, D. TSETSERUKOU, AND I. V.OSELEDETS, AA-ICP: Iterative closest point with Anderson acceleration, in 2018 IEEE Inter-national Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 1–6.

[27] Y. PENG, B. DENG, J. ZHANG, F. GENG, W. QIN, AND L. LIU, Anderson acceleration forgeometry optimization and physics simulation, ACM Trans. Graph., 37 (2018), p. 42.

[28] R. A. POLIQUIN AND R. T. ROCKAFELLAR, Generalized Hessian properties of regularizednonsmooth functions, SIAM J. Optim., 6 (1996), pp. 1121–1137.

[29] S. POLLOCK, L. G. REBHOLZ, AND M. XIAO, Anderson-accelerated convergence of Picarditerations for incompressible Navier–Stokes equations, SIAM J. Numer. Anal., 57 (2019),pp. 615–637.

[30] F. A. POTRA AND H. ENGLER, A characterization of the behavior of the Anderson accelerationon linear problems, Linear Alg. Appl., 438 (2013), pp. 1002–1011.

[31] P. P. PRATAPA, P. SURYANARAYANA, AND J. E. PASK, Anderson acceleration of the jacobiiterative method: An efficient alternative to Krylov methods for large, sparse linear systems,Journal of Computational Physics, 306 (2016), pp. 43–54.

13

[32] L. Q. QI AND J. SUN, A nonsmooth version of Newton’s method, Math. Program., 58 (1993),pp. 353–367.

[33] R. T. ROCKAFELLAR AND R. J.-B. WETS, Variational analysis, vol. 317, Springer Science &Business Media, 2009.

[34] T. ROHWEDDER AND R. SCHNEIDER, An analysis for the DIIS acceleration method used inquantum chemistry calculations, J. Math. Chem., 49 (2011), pp. 1889–1914.

[35] D. SCIEUR, A. D’ASPREMONT, AND F. BACH, Regularized nonlinear acceleration, in Ad-vances In Neural Information Processing Systems, 2016, pp. 712–720.

[36] , Regularized nonlinear acceleration, Math. Program., 179 (2020), pp. 47–83.

[37] L. STELLA, A. THEMELIS, AND P. PATRINOS, Forward-backward quasi-Newton methods fornonsmooth optimization problems, Comput. Optim. Appl., 67 (2017), pp. 443–487.

[38] H. D. STERCK, A nonlinear gmres optimization algorithm for canonical tensor decomposition,SIAM Journal on Scientific Computing, 34 (2012), pp. A1351–A1379.

[39] D. SUN AND J. SUN, Strong semismoothness of eigenvalues of symmetric matrices and itsapplication to inverse eigenvalue problems, SIAM J. Numer. Anal., 40 (2002), pp. 2352–2367(2003).

[40] , Strong semismoothness of the Fischer-Burmeister SDC and SOC complementarityfunctions, Math. Program., 103 (2005), pp. 575–581.

[41] A. TOTH, J. A. ELLIS, T. EVANS, S. HAMILTON, C. KELLEY, R. PAWLOWSKI, ANDS. SLATTERY, Local improvement results for Anderson acceleration with inaccurate functionevaluations, SIAM J. Sci. Comput., 39 (2017), pp. S47–S65.

[42] A. TOTH AND C. KELLEY, Convergence analysis for Anderson acceleration, SIAM J. Numer.Anal., 53 (2015), pp. 805–819.

[43] M. ULBRICH, Nonmonotone trust-region methods for bound-constrained semismooth equationswith applications to nonlinear mixed complementarity problems, SIAM J. Optim., 11 (2001),pp. 889–917.

[44] , Semismooth Newton methods for variational inequalities and constrained optimizationproblems in function spaces, vol. 11 of MOS-SIAM Series on Optimization, Society forIndustrial and Applied Mathematics (SIAM); Mathematical Optimization Society, Philadelphia,2011.

[45] H. F. WALKER AND P. NI, Anderson acceleration for fixed-point iterations, SIAM J. Numer.Anal., 49 (2011), pp. 1715–1735.

[46] Y. WANG, J. YANG, W. YIN, AND Y. ZHANG, A new alternating minimization algorithm fortotal variation image reconstruction, SIAM J. Imaging Sci., 1 (2008), pp. 248–272.

[47] J. WILLERT, H. PARK, AND W. TAITANO, Using Anderson acceleration to accelerate theconvergence of neutron transport calculations with anisotropic scattering, Nucl. Sci. Eng., 181(2015), pp. 342–350.

[48] J. ZHANG, B. O’DONOGHUE, AND S. BOYD, Globally convergent type-I Anderson accelera-tion for non-smooth fixed-point iterations, arXiv preprint arXiv:1808.03971, (2018).

[49] J. ZHANG, Y. PENG, W. OUYANG, AND B. DENG, Accelerating ADMM for efficient simulationand optimization, ACM Trans. Graph., 38 (2019).

A Verification of Local Convergence Assumptions

In this section, we briefly discuss different situations that allow us to verify and establish the localconditions stated in Assumption 2.7.

14

A.1 The Smooth Case: Lipschitz Continuity of g′

As we have already seen in Section 3, if the mapping g is continuously differentiable in a neighborhoodof its associated fixed-point x∗, then (B.2) holds if the Fréchet derivative g′ is locally Lipschitzcontinuous around x∗, i.e., for any x, y ∈ Bε(x∗) we have

‖g′(x)− g′(y)‖ ≤ L‖x− y‖. (20)Assumption (B.2) can then be shown via utilizing a Taylor expansion. Let us notice that for (B.2)it is enough to fix y = x∗ in (20). Such a condition is known as outer Lipschitz continuity at x∗.Furthermore, assumption (B.1) holds if supx∈Rn ‖g′(x)‖ < 1, see, e.g., [5, Theorem 4.20].

A.2 TV Based Image Deconvolution

The alternating minimization solver for image deconvolution problem (17) can be written as afixed-point iteration

wk+1 = g(wk) := (Φ ◦ h)(wk), (21)with

Φ(w) := (sβ(w1)T , ..., sβ(wN2)T )T ∈ R2N2

,

sβ(x) := max

{‖x‖ − 1

β, 0

}x

‖x‖,

andh(w) := DM−1(DTw + (µ/β)KT f),

D := (DT1 , ...D

TN2)T ,

M := DTD +µ

βKTK.

Let us notice that the mapping h and the fixed point iteration (21) are well-defined when the nullspaces of K and D have no intersection, see, e.g., [46, Assumption 1].

Next, we verify Assumption 2.7 for the mapping g in (21).Proposition A.1. Suppose that the operator KTK is invertible. Then, the spectral radius ρ(T ) ofT := DM−1DT fulfills ρ(T ) < 1 and condition (B.1) is satisfied.

Proof. Utilizing the Sherman-Morrison-Woodbury formula and the invertibility of KTK, we obtain

(I + ξD(KTK)−1DT )−1 = I − ξD(KTK)−1(I + ξDTD(KTK)−1)−1DT = I − T, (22)where ξ := β/µ. Due to λmin(I + ξD(KTK)−1DT ) ≥ 1 and λmax(I + ξD(KTK)−1DT ) ≤1 + ξ‖D(KTK)−1DT ‖, it then follows

σ((I + ξD(KTK)−1DT )−1) ⊂[

1

1 + ξ‖D(KTK)−1DT ‖, 1

],

where σ(·) denotes the spectral set of a matrix. Combining this observation with (22), we can infer

σ(T ) ⊂[0, 1− 1

1 + ξ‖D(KTK)−1DT ‖

]=⇒ ρ(T ) < 1.

Furthermore, following the proof of [46, Theorem 3.6], it holds that

‖g(w)− g(v)‖ ≤ ρ(T )‖w − v‖ ∀ w, v ∈ R2N2

and hence, assumption (B.1) is satisfied.

Concerning assumption (B.2), it can be shown that g is twice continuously differentiable on the setW = {w : ‖[h(w)]i‖ 6= 1/β, ∀ i = 1, ..., N2}. (In this case the max-operation in the shrinkageoperator sβ is not active). Moreover, since h is continuous, the setW is open. Consequently, forevery point w ∈ W , we can find a bounded open neighborhood N(w) of w such that N(w) ⊂ W .Hence, if the mapping g has a fixed-point w∗ satisfying w∗ ∈ W , then we can infer that g islocally Lipschitz differentiable on N(w∗) and assumption (B.2) has to hold at w∗. Finally, if K isidentity, then notice that the finite difference matrix D satisfies that ‖D‖ ≤ 2, so we can infer thatρ(T ) ≤ 1− (1 + 4β/µ)−1 which justifies the choice of c in our algorithm.

15

A.3 Nonnegative Least Squares

We first note that given the specific form of ϕ, we can calculate the proximal operator ϕ explicitly as

proxβϕ(v) =1

2((v1 + v2)T , (v1 + v2)T )T ,

where v = (vT1 , vT2 )T . Consequently, we obtain [2proxβϕ − I](v) = (vT2 , v

T1 )T . Similarly, setting

ψ1(v1) := ‖Hv1 − t‖22 and ψ2(v2) := Iv2≥0(v2), we have

proxβψ(v) =

(proxβψ1

(v1)proxβψ2

(v2)

), proxβψ1

(v1) = (HTH + (2β)−1I)−1(HT t+ v1/(2β)),

and proxβψ2(v2) = P[0,∞)q (v2), where P[0,∞)q denotes the Euclidean projection onto the set of

nonnegative numbers [0,∞)q . In the next proposition, we give a condition to establish (B.1) for g.Proposition A.2. Let σ0 and σ1 denote the minimum and maximum eigenvalue ofHTH , respectivelyand let us suppose σ0 > 0. Then, the mapping g is Lipschitz continuous with modulus

√3 + c21/2,

where c1 = max{βσ1−1βσ1+1 ,

1−βσ0

1+βσ0} < 1.

Proof. We can explicitly calculate g as follows

g(v) =1

2((2proxβϕ − I)(2proxβψ − I) + I)v =

1

2(R2

β(v2) + v1;R1β(v1) + v2) (23)

whereRiβ = 2proxβψi − I for i = 1, 2. By [4, Proposition 4.2], the reflected operatorsR1β andR2

β

are nonexpansive. Moreover, due to σ0 > 0, ψ1 is strongly convex with modulus σ0 and σ1-smooth.Then by [17, Theorem 1], R1

β is Lipschitz continuous with modulus c1. Next, let v, v ∈ R2q bearbitrary; we have

‖g(v)− g(v)‖ =1

2

√‖R2

β(v2)−R2β(v2) + v1 − v1‖2 + ‖R1

β(v1)−R1β(v1) + v2 − v2‖2

≤ 1

2

√(c21 + 1)‖v1 − v1‖2 + 2(c1 + 1)‖v1 − v1‖‖v2 − v2‖+ 2‖v2 − v2‖2

≤ 1

2

√(c21 + 1)‖v1 − v1‖2 +

(c1 + 1)2

c21 + 1‖v1 − v1‖2 + (c21 + 1)‖v2 − v2‖2 + 2‖v2 − v2‖2

≤ 1

2

√(c21 + 3)‖v1 − v1‖2 + (c21 + 3)‖v2 − v2‖2

≤√

3 + c212

‖v − v‖,

where we have used Cauchy’s inequality, the nonexpansiveness ofR2β and the Lipschitz continuity of

R1β in the first inequality. The second estimate follows from Young’s inequality.

Hence, assumption (B.1) is satisfied if H has full column rank. Using the special form of themapping g in (23), we see that g is twice continuously differentiable at v = (vT1 , v

T2 )T if and only if

v ∈ V := Rq ×∏qi=1 R\{0}, i.e., if none of the components of v2 are zero. As before, we can then

infer that assumption (B.2) has to hold at every fixed-point v∗ of g satisfying v∗ ∈ V .

A.4 Further Extensions

We now formulate a possible extension of the conditions presented in section A.1 to the nonsmoothsetting. Assume that g is locally Lipschitz continuous and let us consider the following properties:

• The function g is differentiable at x∗.• The mapping g is strongly (or 1-order) semismooth at x∗, [32], i.e., we have

supM∈∂g(x)

‖g(x)− g(x∗)−M(x− x∗)‖ = O(‖x− x∗‖2) as x→ x∗.

Here, the multifunction ∂g : Rn ⇒ Rn×n denotes Clarke’s subdifferential of the locallyLipschitz continuous (and possibly nonsmooth) function g, see, e.g., [8, 32].

16

• There exists an outer Lipschitz continuous selection M∗ : Rn → Rn×n of ∂g in a neighbor-hood of x∗, i.e., for all x sufficiently close to x∗ it follows

M∗(x) ∈ ∂g(x) and ‖M∗(x)−M∗(x∗)‖ ≤ LM‖x− x∗‖

for some constant LM > 0.

Then, the mapping g satisfies the condition (B.2) at x∗.

Proof. The combination of differentiability and semismoothness implies that g is strictly Fréchetdifferentiable at x∗ and as a consequence, Clarke’s subdifferential at x∗ reduces to the singleton∂g(x∗) = {g′(x∗)}. We refer to [32, 33, 25] for further details. Thus, we can inferM∗(x∗) = g′(x∗)and we obtain

‖g(x)− g(x∗)− g′(x∗)(x− x∗)‖≤ ‖g(x)− g(x∗)−M∗(x)(x− x∗)‖+ ‖[M∗(x)−M∗(x∗)](x− x∗)‖≤ O(‖x− x∗‖2) + LM‖x− x∗‖2,

for x→ x∗, where we used the strong semismoothness and outer Lipschitz continuity in the last step.This establishes (B.2).

The class of strongly semismooth functions is rather rich and includes, e.g., piecewise twice con-tinuously differentiable (PC2) functions [44], eigenvalue and singular value functions [39, 40], andcertain spectral operators of matrices [9]. Let us further note that the stated selection property isalways satisfied when g is a piecewise linear mapping. In this case, the sets ∂g(x) are polyhedral andouter Lipschitz continuity follows from [10, Theorem 3D.1]. If the mapping g has more structure andis connected to an underlying optimization problem like in forward-backward and Douglas-Rachfordsplitting, nonsmoothness of g typically results from the proximity operator or projection operators. Insuch a case, further theoretical tools are available and for certain function classes it is possible to fullycharacterize the differentiability of g at x∗ via a so-called strict complementarity condition. (In fact,the condition w∗ ∈ W from section A.2 is equivalent to such a strict complementarity condition).We refer the interested reader to [28, 25, 20, 37].

Finally, let us mention that a local convergence result for AA has been presented in [21] for potentiallynonsmooth g under the assumption that g is strongly semismooth (in a neighborhood of a fixed-point x∗). In the proof of their main result, the authors infer that the mapping g then has to satisfya condition that is equivalent to assumption (B.2), see display (24) in [21]. Unfortunately, ourlatter discussion demonstrates that strong semismoothness alone is not enough to establish such animplication.

17

arxiv:2006.02559v2 [math.oc] 7 jun 2020 · nonmonotone globalization for anderson acceleration...

Documents