modeling, control and optimization for big data systems ...141.85.225.150/raport2016.pdf · smart...

139
Modeling, Control and Optimization for Big Data Systems (MoCOBiDS) Contract Nr. 176/2015 Unitatea Executiva pentru Finantarea Invatamantului Superior, a Cercetarii, Dezvoltarii si Inovarii (UEFISCDI) Human Resources 2015-2017 Scientific Report 2016 Modeling and control of big data systems Ion Necoara, Andrei Patrascu, Valentin Nedelcu, Dragos Clipici November 2016 1

Upload: others

Post on 20-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Modeling, Control and Optimization for Big Data Systems

(MoCOBiDS)

Contract Nr. 176/2015

Unitatea Executiva pentru Finantarea Invatamantului Superior, a

Cercetarii, Dezvoltarii si Inovarii (UEFISCDI)

Human Resources 2015-2017

Scientific Report 2016

Modeling and control of big data systems

Ion Necoara, Andrei Patrascu, Valentin Nedelcu, Dragos Clipici

November 2016

1

Page 2: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Abstract

In this report we present the main scientific results obtained in the second phase of the project Mo-COBiDS. The main objectives of this phase were to develop efficient modeling and control techniquesfor big data systems. The main tasks were:

T1: Modeling techniques for big data systems

T2: Control techniques for big data systems

T3: Structural analysis of optimization problems coming from modeling and control applications

Modeling techniques for big data systems: In this task we develop new modeling techniques forbig data systems. In particular, we propose new model reduction algorithms for big network systems(see paper [P1]). Using moment matching techniques and Sylvester equations, we introduce aframework to compute families of parametrized reduced order models that achieve moment matchingand preserve the structure (topology) of the to-be-reduced model of the network. Then, using balancetruncation techniques we also reduce the number of subsystems in the network. The result is a loworder approximation of the linear network system with a reduced number of subsystems that exhibitproperties similar to the given big network system. We also analyze several modeling strategies forpower systems that allow us to solve efficiently some important problems in this field, such as theoptimal voltage control for loss minimization (see paper [P2]) or the direct current optimal powerflow problem (see paper [P3]). The main results can be found in the following papers:

P1: T. Ionescu, I. Necoara, A scale-free moment matching-based model reduction technique oflinear networks, submitted to IFAC World Congress, 2017.

P2: I. Necoara, V. Nedelcu, D. Clipici, L. Toma, C. Bulac, Optimal voltage control for loss minimiza-tion based on sequential convex programming, In Proceedings of IEEE Conference InnovativeSmart Grid Technologies Europe, 2016.

P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma, On fully distributed dual first order methods forconvex network optimization, submitted to IFAC World Congress, 2017.

Control techniques for big data systems: In this task we develop new optimization based-controltechniques for big data systems. In paper [P4] we analyze a family of general random block co-ordinate descent iterative hard thresholding based methods for the minimization of l0 regularizedoptimization problems, i.e. the objective function is composed of a smooth convex function andthe l0 regularization. This type of optimization problems arise e.g. in packetized predictive controlfor networked control systems with unreliable (or rate-limited) communications. In paper [P5] weanalyze the convergence of inexact projection primal first order methods for convex minimization.We show that these algorithms can be used efficiently for solving model predictive control prob-lems arising in embedded applications. We prove that we can still achieve similar convergence ratesfor these inexact projection first order algorithms with those given in the exact projection settings,provided that the approximate projection is sufficiently accurate. Our convergence analysis allowsto derive explicitly the accuracy of the inexact projection and the number of iterations we need toperform in order to obtain an approximate solution for our convex problem. Finally, we also presentin paper [P6] a constructive solution for the inverse optimality problem for the class of continuouspiecewise affine functions. The main idea is based on the convex lifting concept. Regarding linearoptimal control, we show that any continuous piecewise affine control law can be obtained via alinear optimal control problem with the control horizon at most equal to 2 prediction steps. Themain results can be found in the following papers:

P4: I. Necoara, A. Patrascu, Iteration complexity analysis of coordinate descent methods for l0regularized convex problems, in preparation, 2016.

P5: A. Patrascu, I. Necoara, On the convergence of inexact projection first order methods forconvex minimization, submitted to IEEE Transactions on Automatic Control, November 2016.

2

Page 3: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

P6: N. Nguyen, S. Olaru, P. Rodriguez-Ayerbe, M. Hovd, I. Necoara, Constructive solution ofinverse parametric linear/quadratic programming problems, Journal of Optimization Theoryand Applications, DOI 10.1007/s10957-016-0968-0, 2016.

Structural analysis of optimization problems: Finally, in this task we analyze the structural prop-erties of optimization problems, coming e.g. from modeling and control applications, and deriveefficient optimization algorithms that take into account the specific structure arising in these prob-lems. For example, in paper [P7] we derive linear convergence rates of several first order methods forsolving smooth non-strongly convex constrained optimization problems, i.e. involving an objectivefunction with a Lipschitz continuous gradient that satisfies some relaxed strong convexity condition.In particular, in the case of smooth constrained convex optimization, we provide several relaxationsof the strong convexity conditions and prove that they are sufficient for getting linear convergencefor several first order methods such as projected gradient, fast gradient and feasible descent meth-ods. We also provide examples of functional classes that satisfy our proposed relaxations of strongconvexity conditions. Finally, we show that the proposed relaxed strong convexity conditions coverimportant applications ranging from solving linear systems, Linear Programming, and dual formula-tions of linearly constrained convex problems arising in model predictive control. In paper [P8] weemploy a parallel version of a randomized (block) coordinate descent method for minimizing the sumof a partially separable smooth convex function and a fully separable non-smooth convex function.Under the assumption of Lipschitz continuity of the gradient of the smooth function, this methodhas a sublinear convergence rate. Linear convergence rate of the method is obtained for the newlyintroduced class of generalized error bound functions. We prove that the new class of generalizederror bound functions encompasses both global/local error bound functions and smooth strongly con-vex functions. We also show that the theoretical estimates on the convergence rate depend on thenumber of blocks chosen randomly and a natural measure of separability of the smooth componentof the objective function. The main results can be found in the following papers:

P7: I. Necoara, Yu. Nesterov and F. Glineur, Linear convergence of first order methods for non-strongly convex optimization, submitted to Mathematical Programming, July 2016.

P8: I. Necoara, D. Clipici, Parallel random coordinate descent methods for composite minimization:convergence analysis and error bounds, SIAM Journal on Optimization, 26(1): 197–226, 2016.

Other journal papers related to the previous tasks and already published are (see Section 4 of thisreport for a complete list):

P9: I. Necoara, A. Patrascu, Iteration complexity analysis of dual first order methods for conicconvex programming, Optimization Methods and Software, 31(3): 645-678, 2016 (related toTask 3).

P10: A. Patrascu, I. Necoara, Q. Tran-Dinh, Adaptive inexact fast augmented Lagrangian methodsfor constrained convex optimization, Optimization Letters, DOI:10.1007/s11590–016–1024–6:1–18, 2016 (related to Task 3).

P11: I. Necoara, Yu. Nesterov, F. Glineur, Random block coordinate descent for linearly-constrainedoptimization over networks, Journal of Optimization Theory and Applications, to appear, 2016(related to Task 2).

Due to the length of this report, we do not present them here, but they are available on journalwebsite/internet. Several other journal papers are already under review or it is work in progress, seeSection 4.2 for more details.

3

Page 4: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Contents

1 Modeling techniques for big data systems 61.1 A scale-free moment matching-based model reduction technique of large-scale linear network systems . . . 6

1.1.1 Moments and moment matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.1.2 Linear network systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.1.3 Model reduction of linear network systems preserving network structure . . . . . . . . . . . . . . . 91.1.4 Reduction of the number of subsystems of linear network systems . . . . . . . . . . . . . . . . . . 121.1.5 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2 Optimal voltage control for loss minimization based on sequential convex programming . . . . . . . . . . . 151.2.1 Modeling of optimal voltage control problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.2.2 Sequential convex programming (SCP) framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.2.3 Local convergence of the SCP method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.2.4 Extension of SCP framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.2.5 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.3 Distributed DC optimal power flow based on dual fast gradient methods . . . . . . . . . . . . . . . . . . . 231.3.1 Modeling of DC optimal power flow problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.3.2 Dual Lagrangian framework for network utility maximization problems . . . . . . . . . . . . . . . . 251.3.3 Distributed dual fast gradient using an average primal sequence . . . . . . . . . . . . . . . . . . . 281.3.4 Distributed hybrid dual fast gradient algorithm using the last iterate . . . . . . . . . . . . . . . . . 281.3.5 Distributed implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311.3.6 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2 Control techniques for big data systems 342.1 Random coordinate descent methods for ℓ0 regularized convex problems: application to sparse control . . . 34

2.1.1 Notations and preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.1.2 Characterization of local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.1.3 Strong local minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.1.4 Random coordinate descent type methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.1.5 Global convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.1.6 Random data experiments on sparse learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.1.7 Sparse packetized predictive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2 Inexact projection primal first order methods for convex minimization: application to embedded MPC . . . 522.2.1 Motivation: MPC problem for linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532.2.2 General problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.2.3 Dual approach in convex constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 552.2.4 Primal approach in convex constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 562.2.5 Primal Gradient Method with inexact projections . . . . . . . . . . . . . . . . . . . . . . . . . . . 572.2.6 Primal Fast Gradient Method with inexact projections . . . . . . . . . . . . . . . . . . . . . . . . . 642.2.7 Algorithms for inexact projection onto intersection of convex sets . . . . . . . . . . . . . . . . . . 712.2.8 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3 Structural analysis of optimization problems 743.1 Linear convergence of first order methods for non-strongly convex optimization . . . . . . . . . . . . . . . 74

3.1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753.1.2 Non-strongly convex conditions for a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763.1.3 Functional classes in qSLf ,κf

(X), GLf ,κf(X) and FLf ,κf

(X) . . . . . . . . . . . . . . . . . . . . 833.1.4 Linear convergence of projected gradient method (GM) . . . . . . . . . . . . . . . . . . . . . . . . 893.1.5 Linear convergence of fast gradient method (FGM) . . . . . . . . . . . . . . . . . . . . . . . . . . 933.1.6 Linear convergence of feasible descent methods (FDM) . . . . . . . . . . . . . . . . . . . . . . . . 983.1.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003.1.8 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.2 Parallel random coordinate descent for composite minimization: convergence analysis and error bounds . . 1043.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.2.2 Motivating practical applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.2.3 Parallel random coordinate descent method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4

Page 5: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3.2.4 Sublinear convergence for smooth convex minimization . . . . . . . . . . . . . . . . . . . . . . . . 1103.2.5 Linear convergence for error bound convex minimization . . . . . . . . . . . . . . . . . . . . . . . 1133.2.6 Conditions for generalized error bound functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.2.7 Numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4 Papers having acknowledgement the MoCOBiDS project 1334.1 Papers published in ISI journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334.2 Journal papers under review/in progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334.3 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334.4 Book chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1344.5 Papers accepted/submitted in conferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

5

Page 6: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

1 Modeling techniques for big data systems

In Task 1 we develop new modeling techniques for big data systems. In particular, we propose new modelreduction algorithms for big data network systems (see paper [P1]). Using moment matching techniquesand Sylvester equations, we introduce a framework to compute families of parametrized reduced ordermodels that achieve moment matching and preserve the structure (topology) of the to-be-reduced modelof the network. Then, using balance truncation techniques we also reduce the number of subsystemsin the network. The result is a low order approximation of the linear network system with a reducednumber of subsystems that exhibit properties similar to the given big data network system. We alsoanalyze several modeling strategies for power systems that allow us to solve efficiently some importantproblems in this field, such as the optimal voltage control for loss minimization (see paper [P2]) or thedirect current optimal power flow problem (see paper [P3]).

1.1 A scale-free moment matching-based model reduction technique of large-scalelinear network systems

We live and operate in a networked world. We get electricity through power networks, drive to workon networks of roads, meanwhile communicate with each other using an elaborate set of devices suchas phones, computers, that connect wirelessly and through the internet. Traditional networks includeutilities (electricity, gas) and transportation (road, rail). Recent examples of the increasing impactof networks include information technology networks (internet), social networks (communities), andbiological and genetic networks (brain connectivity, regulatory networks) [11]. Complex network systemsconsist of multiple interacting dynamical subsystems, interconnected through a graph which enables thesubsystems to share information, coordinate their activities and have self-control mechanisms. However,the corresponding models of network systems are too complex and difficult to analyse, so that it is almostimpossible to develop operating and/or open/closed-loop control algorithms in a systematic way for thistype of dynamical systems. Therefore, we need approximation models of the original network system,useful for analysis, simulation and control. In particular, the computational complexity of synthesizinga controller for a large-scale network system can be broken down by designing a scale-free model, whereneighboring subsystems in the original network are merged using model order reduction techniques.The problem of model order reduction of interconnected subsystems with special structure has long beenstudied in the literature in various model reduction frameworks. Most of the earlier results are in theframework of stability preserving balanced truncation, i.e. posing the problem as a frequency weightedL∞ approximation problem, see, e.g. [69,70]. Variations of the problem and the results were introducedin, e.g. [1, 71] where it has been shown that closed-loop balanced truncation with preservation of theclosed-loop properties is equivalent to frequency weighted balanced truncation with certain selectionsof the weighting systems. In [74], the solution of the balancing-related frequency-weighted model andcontroller reduction problems has been considered using accuracy enhanced numerical algorithms. Newstability-enforcing choices of the frequency-weighted Grammians, which can guarantee the stability ofreduced models for two-sided frequency weights have been obtained. A different approach based oninterpolation has been recently taken in e.g. [31, 59, 73], where the problem of model reduction ofinterconnected system has been studied. Both Krylov and Grammian-based model reduction techniquesthat preserve the structure of the interconnections between different subsystems have been presented.Another type of model reduction problem of block structured systems has been studied in [9, 37] withrespect to the class of second order systems.In the following we revisit the problem of model reduction of linear network systems under the recent

6

Page 7: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

moment matching-based framework developed in [3, 24] for linear systems, see also [20, 22]. It is wellknown that the approximation resulted from applying moment matching-based reduction on a givensystem per se, does not preserve any block structure or topology of the system. Hence, we adaptthe results from [3] to the case of linear network systems having a general block structure, with thescope of either preserving the given block structure or even reduce the number of subsystems (blocks)in the network. We first refer to the model reduction of each subsystem in the network, while thetopology of the network remains unaltered. The resulting family of low-order approximations consists ofapproximating subsystems interconnected in the prescribed unaltered topology. Further, we extend thisapproach to networks where we reduce even the number of subsystems (blocks) of the original systemthrough balanced truncation techniques. In that, super-subsystems are created and represented byaggregated variables, but the block structure may be lost. This leads to a scale-free modeling algorithmfor large-scale networks, using specific features of the network, such as the dynamical interactionsbetween subsystems and the concepts from the model reduction field. Consequently, observing onlyaggregated variables of the network system allows to cut-off the complexity of synthesizing a controller,analyzing or simulating the network system.

1.1.1 Moments and moment matching

Let us first recall the idea of moment matching for linear, single-input, single-output (SISO) systemsfrom a time domain point of view, as presented in [3, 24]. Note that the results are directly applicableto the multiple-input multiple-output (MIMO) case, see [24], hence generality is not lost. Consider theSISO, minimal, linear, time-invariant system:

x = Ax+Bu and y = Cx, (1)

where x(t) ∈ Rn denotes the states of the system, u(t) ∈ R denotes the input and y(t) ∈ R denotesthe output, respectively. We assume that the previous representation is a minimal realization of thetransfer function:

W (s) = C(sI −A)−1B.

The moments of linear system (1) at a point s1 ∈ C on the complex plane is defined as follows:

Definition 1.1. [2, 3] The 0-moment of linear system (1) at s1 ∈ C is the complex number η0(s1) =W (s1). For the integer k ≥ 1, the k−moment of system (1) at s1 is the complex number defined as

ηk(s1) =(−1)k

k!· d

kW (s)

dsk

∣∣∣∣s=s1

.

Let the matrices S ∈ Rν×ν and L ∈ R1×ν , where ν ≤ n, be such that the pair (L, S) is observable.Since the system is assumed minimal, the following Sylvester equation:

AΠ+BL = ΠS, (2)

in the unknown Π ∈ Cn×ν has the unique solutions Π with rank Π = ν provided that σ(A)∩σ(S) = ∅,see e.g. [15]. According to the recent results from [22] and restudied in [3], from a systems theoreticperspective, the moments of a system can be characterized as follows:

Theorem 1.2. [3, 22] At the interpolation points given by eigenvalues of S, that is {s1, s2, ..., sν} =σ(S), the moments η0(si), with i = 1, ..., ν, of system (1) are characterized by the elements of thematrix CΠ ∈ R1×ν , where Π is the unique solution of Sylvester equation (2).

7

Page 8: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Remark 1.3. For the sake of clear exposition, we work only with 0-order moments. The arguments aresimilar for higher order moments, see [3, 24].

Now, we are ready to present the main result for moment matching model reduction:

Theorem 1.4. [3] A linear system of the form:

ξ = Fξ +Gu and ψ = Hξ, (3)

with ξ(t) ∈ Rν and ν ≤ n, matches the moments of original system (1) at σ(S) if and only if:

HP = CΠ, (4)

where the invertible matrix P ∈ Cν×ν is the unique solution of the Sylvester equation:

FP +GL = PS. (5)

Applying the moment matching conditions (4) on (3), and noting that the invertible matrix P is acoordinate transformation, a family of parametrized models of order ν ≤ n that achieve momentmatching at the interpolation points σ(S) is computed taking P = Iν .

Theorem 1.5. [3] Let the pair (L, S) be observable and assume σ(A) ∩ σ(S) = ∅. Let ξ(t) ∈ Rν ,with ν ≤ n, and consider the linear system:

ξ = (S −GL)ξ +Gu and ψ = CΠξ, (6)

parametrized in G ∈ Cν , where Π is the unique solution of (2). Assume σ(S −GL) ∩ σ(S) = ∅. Then(6) describes a family of reduced order models of (1) parametrized in G, that achieve moment matchingCΠ at the points σ(S).

1.1.2 Linear network systems

Our goal in this work is to perform model reduction for linear network systems which are comprised ofN interconnected subsystems, whose dynamics are defined by the following linear state space equations:

xi =∑j∈Ni

Aijxj +Biu ∀i = 1 : N, (7)

where xi(t) ∈ Rni represents the state of the ith subsystem, u(t) ∈ R is the common input, Aii ∈Rni×ni , Bi ∈ Rni and Aij ∈ Rni×nj . The index set Ni ⊆ [N ] contains the index i and all the indices ofthe subsystems which interact with the subsystem i. Thus, in (87) we consider that each subsystem isinfluenced through the states of the neighboring subsystems. Consider for example the network systemin Figure 1, where the arrows indicate the interactions between the subsystems Σ1,Σ2,Σ3 and Σ4. Ifwe consider the fourth subsystem Σ4, we have N4 = {3, 4} and therefore:

x4 = A44x4 +A43x3 +B4u and A41 = A42 = 0.

For the analysis, we also express the dynamics of the entire network system in the compact form (1):

x = Ax+Bu and y = Cx,

8

Page 9: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

S1

S2

S3

S4

PSfrag replacements

Σ1

Σ2

Σ3

Σ4

Figure 1: An example of a network systems.

where x(t) = [xT1 (t) · · ·xTN (t)]T ∈ Rn, with n =∑N

i=1 ni, denotes the states of the entire network,u(t) ∈ R and y(t) ∈ R. Moreover, the system matrices A ∈ Rn×n, B ∈ Rn and C ∈ R1×n are given by:

[A B

C 0

]=

A11 A12 . . . A1N B1

A21 A22 . . . A2N B2...

.... . .

......

AN1 AN2 . . . ANN BN

C1 C2 . . . CN 0

, (8)

where we have that the block (i, j) satisfies:

Aij = 0 if j ∈ Ni.

Note that the dimension n of the entire network system is usually too large, so that it is almostimpossible to develop open or closed-loop control algorithms in a systematic way for the whole system.Therefore, we need to obtain approximation models of the original network system (87), useful foranalysis, simulation and control.

1.1.3 Model reduction of linear network systems preserving network structure

In this section, we consider the component model reduction problem of the linear network system (87)with the network structure given in (8). Recall that Aij = 0 in (8) if j ∈ Ni and the dimension of

the entire network system is n =∑N

i=1 ni. The goal is to perform model order reduction such that theblock structure is preserved, i.e. compute reduced order models for each subsystem of the form:

ξi =∑j∈Ni

Fijξj +Giu ∀i = 1 : N, (9)

where ξi(t) ∈ Rνi , with νi ≤ ni, represents the reduced state of the ith subsystem, u(t) ∈ R is theinput, Fij ∈ Cνi×νj and Gi ∈ Cνi . Note that the dimension of the whole reduced model is ν =

∑Ni=1 νi.

Moreover, we need to enforce the network structure, that is Fij = 0 if j ∈ Ni. We also define theoutput of the reduced network:

ψ = Hξ,

where ξ(t) = [ξT1 (t) · · · ξTN (t)]T ∈ Rν and H ∈ C1×ν . The reduced matrices of the network system (9)are written in a compact form as:

[F G

H 0

]=

F11 F12 . . . F1N G1

F21 F22 . . . F2N G2...

.... . .

......

FN1 FN2 . . . FNN GN

H1 H2 . . . HN 0

, (10)

9

Page 10: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where Hj ∈ C1×νj and Fij = 0 if j ∈ Ni. In the rest of this section we investigate different proceduresfor computing a reduced model of dimension ν of the network system, which preserves the topology ofthe network.

Krylov projections-based network reductionThe reduced model (9) can be computed using Krylov projection-based methods, as in e.g. [19,59,73].That is, there exists a procedure to compute a reduced order model for (87) with low-order componentsprojected onto a well-defined block structured Krylov subspace. To this end, let Vi ∈ Cni×νi be theKrylov projection built for the set of interpolation points {si1, . . . , siνi} ⊂ C \ (σ(A) ∪ σ(Aii)) andWi ∈ Cni×νi , such that W ∗

i Vi = Iνi for all i = 1 : N . Let us further define V = diag{V1 ... VN}and W = diag{W1 ...WN}. Then, a family of reduced order models for the network system (87), thatmatch the moments CiVi at {si1, . . . , siνi}, for all i = 1 : N , and preserve the block structure of (8),is given by (9), with the blocks described by equations of the form:

Fii =W ∗i AiiVi, Fij =W ∗

i AijVj , (11)

Gi =W ∗i Bi, Hj = CjVj ∀i, j = 1 : N.

Note that if Aij = 0, then from the definition of Fij =W ∗i AijVj it follows that Fij = 0. Therefore, the

employed Krylov projections allow for the preservation of the topology of the network in the reducedmodel (9). Furthermore, the block Krylov projections Vi are essentially the well-defined solutions ofSylvester equations of the form:

AiiΠi +BiLi = ΠiSi ∀ i = 1 : N, (12)

with Si ∈ Cνi×νi satisfying σ(Si) = {si1, . . . , siνi} and Li ∈ C1×νi , such that the pairs (Li, Si) areobservable for each i = 1 : N . Note that since σ(Aii)∩σ(Si) = ∅, Sylvester equation (12) has a uniquesolution Πi. The connection between Πi and Vi is given by the following lemma:

Lemma 1.6. Consider the matrices Πi, solutions of the Sylvester equations (12) and the projectorsVi. Then, there exist square, non-singular matrices Ti ∈ Cνi×νi , such that we have Πi = ViTi for alli = 1 : N .

Proof: The result is an immediate consequence of [4, Lemma 3], see also [22].Note that in general, the computation of the matrices Wi, to enforce additional properties on theblocks Fii or on the entire reduced model of the network, may be difficult. Furthermore, enforcingadditional constraints require a specific choice of interpolation points, which is, in general, difficult andconservative. Hence, we seek families of parametrized reduced order models, where the free parameteris easier to find such that additional requirements are met, e.g., stability, passivity, etc (see also [3] formore details).

Sylvester equations-based network reductionIn the sequel we construct a family of reduced order models (9) based on Sylvester equations. Ourconstruction provides also sufficient conditions that allow to identify the reduced order models from thisfamily which preserve the given topology of the network. Consider the matrix S = [Sij ]i,j=1:N ∈ Cν×ν ,with the blocks Sij ∈ Rνi×νj and the matrix L = [Li]i=1:N , with the blocks Li ∈ C1×νi , such that thepair (L, S) is observable. Assume further that:

σ(S) ∩ σ(A) = ∅.

10

Page 11: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Furthermore, let Π = [Πij ]i,j=1:N ∈ Cn×ν , with the blocks Πij ∈ Cni×νj , be the solution of the Sylvesterequation (2). Based on our assumptions, it follows that the matrix Π is unique. We construct a familyof block-structured systems (9), using the results from Section 1.1.1. Note that the moments of theoriginal network system (8) at σ(S) are described by the blocks of the matrix CΠ = [

∑Nj=1CjΠji]i=1:N .

Theorem 1.7. A family of reduced order models (10) parametrized in Gi ∈ Cνi that match the momentsCΠ of (8) at σ(S) is given by the following block matrices:

S11−G1L1 S12−G1L2 . . . S1N−G1LN G1

S11−G2L1 S22−G2L2 . . . S2N−G2LN G2...

.... . .

......

SN1−GNL1 SN2−GNL2 . . . SNN−GNLN GN∑j CjΠj1

∑j CjΠj2 . . .

∑j CjΠjN 0

(13)

Furthermore, if (13) preserves the topology of the given system, then it is necessary that there existmatrices Gi such that Sij = GiLj for j ∈ Ni, with i = 1 : N .

Proof: Imposing the moment matching condition (4) and taking the matrix P = Iν in the Sylvesterequation (5), that is in FP +GL = PS, we get:

Fij +GiLj = Sij ∀i, j = 1 : N,

and the claim follows. Moreover, the blocks Fij satisfies Fij = 0 for j ∈ Ni provided that there existmatrices Gi such that we have Sij = GiLj .The free parameters Gi may be used to enhance additional properties or structure on the block structuredapproximation (13). Note that the results from Theorem 1.7 are generally applicable, reduction takingplace for all the subsystems, i.e. each subsystem is reduced by interpolation at a prescribed set ofinterpolation points, separately. These results take into account the peculiarities of the system, such ase.g. reduction of a prescribed number of subsystems, as it is often the case in practice.Now, we give a sufficient condition such that the reduced model (13) from Theorem 1.7 preserves thestructure of the given network (8). In particular, we provide a construction for the matrices S,G andL such that the network topology is preserved and still matches the moments CΠ of network system(8) at σ(S):

Theorem 1.8. Assume that the interpolation points are chosen such that Si from (12) are invertiblefor all i = 1 : N . A reduced order model (13) matching the moments CΠ of (8) at σ(S) = ∪iσ(Si)preserves the topology of the given network (8) provided that:

Fij = Sij − GijBjLj , (14)

where the blocks of the matrices S,L and Gij ∈ Cνi×nj are defined as:

Sij=W∗i AijAjjVjS

−1j , Gij=−W ∗

i Aij , Lj= LjS−1j , (15)

with Vi and Wi such that (11) holds, for all i, j = 1 : N . In particular, the diagonal blocks become:

Fii = Si −W ∗i BiLi.

11

Page 12: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof: By Lemma 1.6, we may write that AiiVi + BiLi = ViSi, since Si may be chosen in thecanonical form, e.g. diagonal. By (11), we have that Fij = W ∗

i AijVj , with Wi such that W ∗i Vi = Iνi .

Postmultiplying Fij with Sj yields:

FijSj = W ∗i AijVjSj =W ∗

i Aij(AjjVj +BjLj)

= W ∗i AijAjjVj +W ∗

i AijBjLj .

Since Sj is assumed invertible, then we have:

Fij = W ∗i AijAjjVjS

−1j − (−W ∗

i Aij)BjLjS−1j

= Sij − GijBjLj .

Finally, noting that if for some Aij = 0, then Fij = 0, which yields the claim.

Since σ(S) = ∪iσ(Si), then the moments CΠ contain the same information as the moments givenby C diag{V1...VN}. Furthermore, we conjecture that the previous construction of the matrix S =[Sij ]i,j=1:N given in (15) can meet the spectrum requirement imposing additional constraints on the

free matricesWi (to be investigated in future work). Note that the selection of the matrices Sj invertiblecauses no loss of generality on the procedure, only a slight restriction on the choice of interpolationpoints, e.g., we do not interpolate at zero. In conclusion, we have provided conditions for obtainingfamilies of parametrized reduced order models that achieve moment matching and preserve the topologyof the to-be-reduced model of the network.

1.1.4 Reduction of the number of subsystems of linear network systems

In practice, it is often required that the given network system of N interconnected subsystems ofdimension ni be approximated with a network of µ < N subsystems of low dimension, say νi ≤ ni forall i. In the previous section, the reduced model (13) does not answer the question of reducing thenumber of subsystems N , which in the context of large-scale network systems can be very big. Since, inmoment matching techniques there is no structured way of computing a number µ < N of subsystemsof the network, we use the balanced truncation method [2]. For this we assume, based on the resultsfrom Section 1.1.3, that we have obtained component model reduction subsystems (13) of dimensionνi = 1. Then, we propose the following procedure for reducing also the number of subsystems N :

Algorithm 1.9. Let S = [sij ]i,j=1:N ∈ CN×N and L = [li]i=1:N ∈ C1×N be such that σ(S) ={s1 · · · sN}.

Step 1: Compute, based on the results from Section 1.1.3, the family of models (13) of order N anddimension νi = 1 with the structure:

s11 − g1l1 s12 − g1L2 . . . s1N − g1lN g1s11 − g2l1 s22 − g2l2 . . . s2N − g2lN g2

......

. . ....

...sN1 − gN l1 sN2 − gN l2 . . . sNN − gN lN gN∑

CjΠj1∑CjΠj2 . . .

∑CjΠjN 0

, (16)

where gi are free parameters.

12

Page 13: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Step 2: Compute a balanced realization of this model (16). Assuming that there exists µ such thatσµ > σµ+1, where σi are the Hankel singular values of (16), we perform truncation at order µ.Note that this procedure in general, does not preserve the given topology structure of (8), i.e.the given sparsity is lost. Hence, a network with µ < N fully interconnected systems may beobtained. Furthermore, the error of the balanced truncation approximation of (16) has an upperbound of 2(σµ+1 + · · ·+ σN ), see [2] for more details.

Step 3: Since µ is assumed small, there exists a coordinate transformation that transform the balancedtruncation model closer to the desired form.

In conclusion, we proposed a scale-free modeling algorithm for large-scale networks. With this algorithm,super-subsystems are created and represented by aggregated variables, but the block structure may belost. However, with some coordinate transformation we can find a reduced model closer to the desiredtopology. As a consequence, observing only aggregated variables of the network system allows to cut-offthe complexity to synthesize a controller, to analyze the network or to simulate it.

1.1.5 Illustrative example

We consider a linear network systems, consisting of a chain of N interconnected subsystems, describedby a system (8) with the particular block bidiagonal structure:

Z =

A− Φ Φ 0 . . . 0 0

0 A− Φ. . .

. . ....

......

. . .. . . Φ 0 0

0 . . . 0 A− Φ Φ 00 . . . 0 0 A− Φ InIn . . . In In In 0

, (17)

where Φ = Nh In and dimZ = Nn, with h > 0 a parameter. Usually, such a model is used for the

rational approximation of distributed delays [77] or for modeling leader-follower systems [42]. The aimis to determine a low number µ of reduced order subsystems that can approximate (17) of order Nlarge. A family of models (13) parametrized in G, of order Nν < Nn, that match the moments andpreserve the structure of (17) is given by:

Z=

S −GL −GL 0 . . . 0 0

0 S −GL −GL. . .

......

.... . .

. . . −GL 0 00 . . . 0 S −GL −GL 00 . . . 0 0 S −GL GΠ1 Π2 . . . ΠN−1 ΠN 0

. (18)

Yet, the reduced model (18) does not answer the question of reducing the number N of subsystems(17). Hence, we apply Algorithm 1.9 to reduce N to µ < N . Let S = s0 ∈ C and L = 1, i.e. weinterpolate at one point in the complex plane. Then, we get a procedure having the following steps:

13

Page 14: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Step 1: Compute a family of reduced models (18) of order N with scalar subsystems:

s0 − g g 0 . . . 0 00 s0 − g g . . . 0 0...

. . .. . .

. . ....

...0 . . . 0 s0 − g g 00 . . . 0 0 s0 − g g

Π1 . . . ΠN−2 ΠN−1 ΠN 0

(19)

Note that g being a free parameter may be chosen in connection with h and N .

Step 2 We compute a balanced realization of this model. Assuming that there exists µ such thatσµ > σµ+1, we perform truncation at order µ.

Step 3 Since µ is assumed small, there exists a coordinate transformation that transform the balancedtruncation model closer to the bidiagonal, desired form.

The final result is a network with a number of µ < N interconnected subsystems. The computation ofan additional coordinate transformation at Step 3 is not desirable, since it may be hard to compute bothanalytically and numerically and it is not required, actually. We may easily apply a moment matchingalgorithm once again, avoiding the reduction step. Since the input output behavior of both realizationsis identical, it means that they are equivalent via a coordinate transformation. To this end, let S be aJordan block of order µ, i.e., Sii = s∗ for all i = 1 : µ and Sj,j+1 = g for all j = 1 : µ− 1 and zero forthe rest of the entries and L =

[l1 0 . . . 0 lµ

]∈ C1×µ. Note that in this particular form, the pair

(L, S) is observable. Hence a model of order µ < N that has approximately the bidiagonal form of theoriginal network system (17) is described by equations with the following parameters:

s∗ g 0 . . . 0 00 s∗ g . . . 0 0...

. . .. . .

. . ....

...0 . . . 0 s∗ g 0

−gl1 . . . 0 0 s∗ − glµ g

η1 . . . ηµ−2 ηµ−1 ηµ 0

(20)

This model has approximately the structure of the given original system (17), with order µ < N .The free parameters s∗ and g, as well as l1 and lµ can be tuned in order to achieve the best desiredperformance. Furthermore, for very small values of g the approximation resembles the structure of thegiven original system.For simulation we start with N = 20, h = 1 and the matrix A = −1 in Z as in [77]. By Step 2 ofthe algorithm described above, we find µ = 8. Furthermore, the balanced truncation step allows usto compute a bound of the error of the approximation of Z. In the infinity norm the upper bound is1.8657 ∗ 10−4. Employing Step 3, we approximately recreate a family of chain of subsystems structuredmodels described by (20), of order 8, parametrized in s∗ and g. On a trial and error basis we getgood results for s∗ = −1 and g = 1, see Figure 2. Searching for better results, we get significantimprovements for s∗ = −3.9021 and g = 8. Note that s∗ and g have been obtained on a trial and errorbased argument. For future work we intend to find more systematic ways of determining these veryimportant parameters.

14

Page 15: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

-50

-40

-30

-20

-10

0

10

Mag

nitu

de (

dB)

10-2 10-1 100 101 102-180

-90

0

90

180

270

360

Phas

e (d

eg)

Bode Diagram

Frequency (rad/s)

Figure 2: Bode diagram Z(s) for N = 20 (blue solid line) approximated with a model of µ = 8 scalarsubsystems with s∗ = −1 and g = 1, (red dash-dotted line) and a model of µ = 8 scalar subsystemswith s∗ = −3.9021 and g = 8 (dashed black line).

1.2 Optimal voltage control for loss minimization based on sequential convex pro-gramming

In this section, a new method is presented that overcomes some of the limitations of the widely-usedmethods for solving the optimal voltage control for loss minimization appearing in the optimal powerflow (OPF) problem. Although the main application we consider in this work is the optimal voltagecontrol in a power system, the numerical optimization algorithm we propose can be used also in otheroptimization based applications. We consider that the corresponding optimization problem has thefollowing structure: convex cost function, simple convex constraints and nonlinear equality constraints.This structure arises for example in optimal voltage control problem for loss minimization in powersystems, where the nonlinear constraints describe the power balance equations. The algorithm wepropose to address the numerical solution of this optimization problem is based on the exploitationof the convex problem structure using a sequential convex programming framework that linearizes thenonlinear equality constraints at each iteration. We show that combining sequential convex programmingwith efficient algorithms for solving the inner problem, we can obtain a new optimization algorithm thatgeneralizes the results from [40, 41] corresponding to the convex case. Moreover, we show that thesequential convex programming method locally converges linearly to a local minimizer and, thereby theoptimization algorithm we propose has the same property. The newly developed algorithm can runover large power grids, achieving a (local) economical optimum at system level, as we show on severalnumerical simulations using the classical IEEE bus test cases.

15

Page 16: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

1.2.1 Modeling of optimal voltage control problem

Let us discuss the active optimal power flow problem for a power system [76]. We assume that thenetwork consists of M buses and M branches. We consider a power system whose structure is charac-terized by a directed bipartite graph G = (V1, V2, Y ), where V1 = {i | i = 1, . . . ,M} denotes the setof buses, V2 = {l = (i, j) | i, j ∈ V1, l = 1, . . . , M} ⊆ V1 × V1 represent the sets of transmission lines(branches) between two buses and the matrix Y denotes the bus admittance matrix.

Bus representation. Four state quantities are associated to each bus i of the network, that is the activepower P , the reactive power Q, the voltage magnitude V and the voltage angle θ. The network busesare classified in terms of their characteristics as shown in table below (Table 1). The known quantitiescan be scheduled by various technical means, and the unknown variables usually result depending onthe goal of the engineering problem to be considered and may take any value between the technicallimits. A peculiar case is the generator bus where the reactive power can be controlled by means of theexcitation system, but it is a consequence of the scheduled voltage and the power system needs.

Table 1: Characteristics of a network buses: “sch” denotes the scheduled quantities.

Bus type Set Known quantities Control variablesGenerator G P V,Q, θLoad L P,Q V, θTransit L P = 0, Q = 0 V, θSlack S θ = 0 P,Q, V

There is only one slack bus chosen to balance the system powers, which usually is a generator bus. Anydifference between the total generation and total load in the power system is balanced on the slack bus.Power losses are also balanced by the slack bus when there is no other scheduled generator to provideadditional power based on estimations to cover the losses. Note that no constraint is usually definedfor the active and reactive powers, P and Q, at the slack bus.Figure 3 shows a network bus. The local active and reactive powers are presented on the left side. Theycan be scheduled by various means, and thus they are denoted by P sch

i and Qschi . While the load power

cannot be controlled, a generator can control both the active and reactive powers. However, in ourapplication, only the reactive power of the generators, which is required to ensure the scheduled voltage,will be controlled. On the right side, the branches between bus i and other buses in the network arerepresented, together with the active and reactive powers exchanged, Pik and Qik. The sum of theseexchanged powers are called nodal powers and are denoted by P exch

i and Qexchi . According to first

Kirchhoff’s law, at any instant of time, the powers entering a bus equals the powers out of the bus, thatis P sch

i = P exchi and Qsch

i = Qexchi . Any change in the scheduled powers will result in variations in the

exchanged powers. In Figure 3, Pgi is the generated active power at bus i, Qgi is the generated reactivepower at bus i; Pli is the active load at bus i, Qli is the reactive load at bus i; Vi is the magnitude ofvoltage at bus i, θi is the angle of voltage at bus i. Therefore, for generator buses the control vector isgiven by xgi = [Vi θi Qgi], where i ∈ G, for load and transit busses the control vector is xli = [Vi θi],where i ∈ L, whereas for the slack bus the control vector is xi = [Vi Pgi Qgi], where i = sl. Finally,the decision vector can be represented as:

x = [(xgi)i∈G (xli)i∈L xsl]T .

Branch representation. A network branch can be mainly an electrical line or a transformer. Any of them

16

Page 17: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Bus iP Qgi gi+j

P Qli li+j

Bus k

Bus 1

Bus n

P Qi i1 1+j

P Qik ik+j

P Qin in+j

Scheduled powers

P =P Pi gi li-

Q =Qi gi li-Q

Vi i i=V q

V1 1 1=V q

Vk k k=V q

Vn n n=V q

Nodal powers

P = Pi ikSk=1

n

Q = Qi ikSk=1

n

Load

Generator

sch

sch

exch

exch

Figure 3: Representation of a network bus.

can be simply represented by a π-quadripole (see Figure 4), where:

yik

= gik − j bik =1

rik + jxik

is the series admittance, rik is the resistance, xik is the reactance, while yik0

is the shunt admittance,and Nik is the transformation ratio. For transformers Nik is around 1 when calculation is performedin per unit, and the shunt admittance can be neglected without affecting the accuracy because theyare small as compared to the shunt admittance of electrical lines. For electrical lines Nik is alwaysequal to 1. The bus admittance matrix Y is calculate based on the following rules: the diagonal terms

ki

Vi yik0

Vk

2

yki0

2

y Nik ik

Figure 4: Representation of a network branch.

(i, i) are sums of the incidence admittances to the corresponding bus i, the off-diagonal terms (i, k) aregiven by the series admittances taken with “-” sign. The terms filled-in of the admittance matrix arethe same with those of the incidence matrix. A term of the bus admittance matrix can be written asY ik = Gik + jBik.Relational constraints. Nodal active and reactive powers balance states that the sum of powers enteringin a bus is equal to the sum of powers going out of that bus for all i ∈ [M ]:

Pgi−Pli=

M∑j=1

ViVj (Gij cos(θi−θj)+Bij sin(θi−θj))

Qgi−Qli=M∑j=1

ViVj (Gij sin(θi−θj)−Bij cos(θi−θj)) .

(21)

The reactive power produced by any generator at bus i, where i ∈ G, is subjected to capability limits:

Qmingi ≤ Qgi ≤ Qmax

gi .

17

Page 18: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

For network security reasons, under normal operating conditions, the voltage at any bus i ∈ [M ], shouldnot exceed the admissible limits, that is:

V mini ≤ Vi ≤ V max

i .

The power flow on any branch should not exceed the transmission limits, that is:

−Smaxij ≤ Sij ≤ Smax

ij , (22)

where Sij is the apparent power flow.In real interconnected power systems, the slack bus is chosen outside the analyzed system such thatit will not affect the power flows within the system. On the other had, since power losses cannot bedetermined in advance, they results as balance at the slack bus. The small the active power at the slackbus, the small the active power losses. Thereby, the objective function is:

f(x) = P 2gi, with i = sl.

Other choices for the objective function can be considered, e.g. in [38] a quadratic strongly convexobjective function is considered. Then, in a compact form we can write the corresponding optimizationproblem for the power system as:

minx∈[l, u]

f(x)

s.t. h(x) = 0,(23)

where h : Rn → Rnh is obtained by stacking all the balance constraints (21), g : Rn → Rng denotesthe power flow constraints (22), f is a quadratic convex function and [l, u] is a box imposing physicallimits on the decision variables.

1.2.2 Sequential convex programming (SCP) framework

Many network problems lead to solving the following structured nonconvex optimization problem (seealso the previous section for a optimal voltage control application):

minx∈X

f(x)

s.t. h(x) = 0,(24)

where f : Rn → R is a convex function, h : Rn → Rnh is a hard nonconvex function and X ⊆ Rn is asimple convex set (i.e. the projection onto this set can be done efficiently, e.g. a box X = [l, u]). Weconsider the set X as being described as the intersection of the sublevel sets of convex functions:

X = {x : cj(x) ≤ 0 j = 1, . . . nc}.

Notice that the nonconvexity in the problem (24) is concentrated in the equality constraints h(x) = 0.For simplicity of the exposition we assume that all the functions f, h and cj are twice continuouslydifferentiable. Our approach can deal also with non-convex inequality constraints g(x) ≤ 0, as we willsee in Section 1.2.4. However, for simplicity of the exposition, in this section we drop the inequalityconstraints in the formulation (24).

18

Page 19: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

The underlying idea of sequential convex programming (SCP) is that we can solve the nonconvexoptimization problem (24) by iteratively solving a convex approximation of the original problem, seealso [46]. Starting from an initial guess x0 ∈ X, the SCP algorithm calculates a sequence {xk}k≥0 bysolving the convex approximation of (24):

xk+1 =argminx∈X

f(x)

s.t. h(xk) +∇h(xk)T (x− xk) = 0,(25)

where ∇h(x) denotes the Jacobian of h. Note that we implicitly assumed that the function h isdifferentiable. Note that if the solution of this convex subproblem is unique, then we can define thesolution map from xk to xk+1 by ΠSCP, i.e. ΠSCP(xk) = xk+1 (conditions guaranteeing uniquenessof the subproblem are given in Section 1.2.3). The optimization subproblem we have to solve in eachiteration is convex and structured, and therefore is suitable for solving efficiently with existing algorithms.The remainder of this section establishes local convergence properties of the SCP method. For this,we consider a local minimum x∗ of nonconvex problem (24) and investigate under which conditionsthe previous SCP method converges to this solution. We will make a standard technical assumptionregarding regularity of the solution x∗.

Assumption 1.10. At the local solution x∗ of problem (24) linear independence constraint qualification(LICQ) holds.

Then, a basic result in nonconvex optimization field states the following first order necessary conditions(FONC):

Lemma 1.11. Under Assumption 1.10 there exist unique multipliers λ∗ ∈ Rnh and ν∗ ∈ Rnc such thatthe following nonlinear equalities (KKT conditions) hold:∇f(x∗) +∇h(x∗)Tλ∗ +∇c(x∗)T ν∗

h(x∗)max(c(x∗),−ν∗)

= 0. (26)

In the last condition – which is non-smooth and a short reformulation of the complementarity condi-tions ν∗ ≥ 0, c(x∗) ≤ 0, c(x∗)T ν∗ = 0 – the maximum is taken componentwise, so that it has nccomponents.

Assumption 1.12. The primal-dual solution (x∗, λ∗, ν∗) satisfies strict complementarity, i.e. ν∗ −c(x∗) > 0.

Strict complementarity allows us to divide the inequalities and their multipliers into two disjoint sets of“active” and “inactive” components, with ca(x∗) = 0 and ν∗i = 0:

c(x) =

[ca(x)ci(x)

]and ν =

[νaνi

]. (27)

In the neighborhood of the solution (x∗, λ∗, ν∗), the first order necessary conditions for optimality cannow compactly be written as the differentiable nonlinear system

∇f(x∗) +∇h(x∗)Tλ∗ +∇c(x∗)T ν∗h(x∗)ca(x

∗)−ν∗i

= 0. (28)

19

Page 20: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

1.2.3 Local convergence of the SCP method

In this section we show that the SCP method when started sufficiently close to the solution (x∗, λ∗, ν∗),converges linearly under the additional assumptions that the Jacobian of the equation system (28) isinvertible, which would follow from second order sufficient conditions for optimality, and that the spectralradius of the Jacobian of the map ΠSCP in x∗ is smaller than one. For this aim we formulate a peculiarvariation of the above smooth nonlinear conditions, and start by defining a map F : Rn+nh+nc+n →Rn+nh+nc as follows:

F (x, λ, ν, x) =

∇f(x) +∇h(x)λ+∇c(x)νh(x) +∇h(x)T (x− x)

ca(x)−νi

. (29)

Note that the argument x enters the nonlinear equation only via the linearization of h, and that thesystem F (x∗, λ∗, ν∗, x∗) = 0 would be equivalent to eq. (28), the FONC of the original problem (24).More importantly, the nonlinear residual F helps us to define the relation between the SCP iterates. Wedenote the Jacobian matrix of F with respect to its first three components by:

J(x, λ, ν, x) =∂F

∂(x, λ, ν)(x, λ, ν, x).

Assumption 1.13. The matrix J(x∗, λ∗, ν∗, x∗) is invertible.

It is important to note that the top-left block of the matrix J(x, λ, ν, x) is given by ∇2f(x) +∑ncj=1 νi∇2ci(x), which is a positive semi-definite matrix.

Lemma 1.14. Under Assumptions 1-3 and when ∥xk − x∗∥ is sufficiently small, the next SCP iteratexk+1 generated by solution of (25) is unique and there exist unique multipliers λk+1 ∈ Rnh andνk+1 ∈ Rnc so that:

F (xk+1, λk+1, νk+1, xk) = 0. (30)

Proof. Due to Assumption 3 and the implicit function theorem, for xk sufficiently close to x∗, thenonlinear system:

F (x, λ, ν, xk) = 0 (31)

in unknowns (x, λ, ν) admits a solution in a neighborhood of (x∗, λ∗, ν∗). If the distance ∥xk − x∗∥is sufficiently small, this solution – that we already call (xk+1, λk+1, νk+1) – still satisfies, due toAssumption 1.12 and continuity of all involved functions, the strict complementarity condition νk+1 −c(xk+1) > 0, and J(xk+1, λk+1, νk+1, xk) remains invertible. It remains to be shown that this solution isthe unique solution of the convex subproblem (25). This follows from the fact that eq. (30) is equivalentto the FONC of the subproblem, and that invertibility of J implies uniqueness of the solution.

Note that if the objective function is strongly convex, then the solution of the convex subproblem xk+1 isunique and we do not need Assumption 3 in this case. Thus, we have provided conditions when the SCPiteration is well defined in the neighborhood of the solution. Now, we can analyze contractivity of themap ΠSCP(x), which will allow us to prove the main result of this section regarding linear convergenceof SCP method.

20

Page 21: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Theorem 1.15 (Local Convergence of SCP). Under Assumptions 1-3, the SCP iteration mappingΠSCP(x) is differentiable in a neighborhood of x∗, and its derivative at x∗ is given by the matrixM∗ = ∂ΠSCP

∂x (x∗) having the expression:

M∗=−[I 0 0

]J(x∗,λ∗,ν∗,x∗)−1

nh∑j=1

λ∗j∇2hj(x∗)

00

(32)

If the spectral radius of M∗ is smaller than one, i.e.

ρ(M∗) < 1, (33)

then the SCP iteration is locally linear convergent with asymptotic contraction rate ρ(M∗).

Proof. In view of the previous lemma, only the formula (32) needs to be shown, and the rest followsfrom standard stability results from nonlinear discrete time systems theory. To prove the formula, recallthat the map ΠSCP(x) is defined via the implicit equation F (ΠSCP(x), λ(x), ν(x), x) = 0 which can bedifferentiated w.r.t. to x to yield:

0 = J(·)

∂ΠSCP(x)∂x∗∗

+∂F

∂x(·) (34)

which evaluated at x = x∗ gives by invertibility of J and projected by[I 0 0

]the result (32).

Note that if one eigenvalue ofM∗ has a modulus larger than one, then the (unregularized) SCP iterationdoes not converge. Moreover, the technical Assumptions 1-3 are in practice not restrictive. However, thelast condition, ρ(M∗) < 1 may be conservative. It is satisfied, however, whenever the nonlinear equalityconstraints are only weakly nonlinear, i.e. have only small second derivatives, or when the correspondingmultipliers, λ∗, are small. If condition (33) is not satisfied, a possible remedy would be to add aquadratic regularization term α∥x − xk∥22 to the objective function of the convex subproblem (25),which by a more detailed analysis can be shown to render the SCP method convergent. Note that inthis case the regularized convex subproblem (25) will have a strongly convex objective function.

1.2.4 Extension of SCP framework

The SCP framework can be extended to more general nonconvex structured problems that incorporatealso nonconvex inequalities:

minx∈X

f(x)

s.t. h(x) = 0, g(x) ≤ 0,(35)

where f : Rn → R is a convex function, h : Rn → Rnh and g : Rn → Rng are nonconvex functionsand X ⊆ Rn is a simple convex set. In these settings, starting from an initial guess x0 ∈ X, the SCPalgorithm calculates a sequence {xk}k≥0 by solving the convex approximation of (35):

xk+1 = argminx∈X

f(x)

s.t. h(xk) +∇h(xk)T (x− xk) = 0, u(x;xk) ≤ 0,(36)

21

Page 22: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where the convex function u(x;xk) is a tight upper bound on g(x). For example, when each componentgi of g has bounded hessian for all x ∈ X with constant Li > 0 we can consider for the components ofu quadratic upper bounds of the form:

ui(x, xk)=gi(xk)+⟨∇gi(xk), x− xk⟩+Li

2∥x− xk∥2 ∀i = 1 : ng.

Under similar regularity assumptions as in previous section, the resulting algorithm is guaranteed toconverge, see e.g. [60].We need also to investigate possible algorithms for solving the the convex subproblem (25). Note thatthis convex subproblem (25) can be written in a more compact form as:

f∗ =minx∈X

f(x)

s.t. Ax = b,(37)

where A = ∇h(xk)T and b = ∇h(xk)Txk − h(xk) and recall that we assume X to be a simple set (i.e.we can project easily onto this set). If X is a box and the objective function f is linear or quadraticwe can apply efficient algorithms for linear or quadratic programming for solving (37). Our optimalpower flow application (23) fits into this scenario. In conclusion, we can use standard linear or quadraticsolvers for solving (37)

1.2.5 Illustrative example

In order to emphasize the advantages of the proposed optimization method, we perform simulationson several test networks: IEEE14, IEEE30, IEEE57, IEEE118 and Case2736 (from Matpower [76]).For these test cases the number of control variables ranges from few tens to several thousands. Theobjective of the optimization problem, applied to these test cases, is to reduce the network power lossesby achieving an optimal voltage profile.Voltage control relies mainly on the reactive power capability of generators. Changing the voltageset-point at the generator terminals requires a change in the reactive power. Thereby, the freedom ofvoltage control depends on the total reactive power reserve available in the system and the position ofthe generators. Because in some test networks the generator’s initial voltage set-point may be muchgreater than the usual maximum limit of 1.05 p.u., in all simulations the minimum and maximum voltagelimits were set to 0.95 p.u. and 1.09 p.u. For the test networks with voltage set-points greater than1.09 p.u., this reduction in the upper bound limit may reduce the effectiveness of loss minimization. Itis know that the higher the operating voltage the lower the current and thus the active power losses aresmaller.

Table 2: Simulation results.

Network Time[s] P initg,sl[MW] P opt

g,sl[MW ]

IEEE14 1.42 232.48 231.87

IEEE30 3.73 260.89 258.61

IEEE57 9.93 423.22 421.65

IEEE118 48.23 516.4 497.74

Case2736 153.11 643.13 622.45

22

Page 23: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

While the simulation time depend on the method effectiveness and the network size, the objectivefunction is strictly related to the generator’s performances. From Table 2 we see that the improvements,decrease in the slack bus active power, compared to steady-state values P init

g,sl, ranges from 0.3 up to3.6%, which are reasonably accepted in practice. Due to space limitations, we leave for the future tovalidate experimentally the effectiveness of this approach in more detailed simulations.

1.3 Distributed DC optimal power flow based on dual fast gradient methods

In comparison with the existing approaches available in the literature, we propose in this section anovel remodeling of the direct current optimal power flow (DC-OPF) problem, which takes into accountthe fact that it can be included in the wider class of network utility maximization problems. Thismodeling approach allows an optimal dispatch for each bus to be done independently and in parallelwhile still achieving the global economical optimum of the whole electric energy system. Thus, it isnot necessary to set up a common control center, but it is sufficient to interchange a small amountof information among the involved buses. This is possible by means of the Lagrangian relaxationdecomposition. In particular, the power balance constraints and the limits on the line flows are movedinto the cost using the Lagrange multipliers, which have an interesting economic interpretation as theoptimal energy trading prices [12] at the buses of the network. We propose novel distributed dual fastgradient methods generating approximate primal feasible and optimal solutions: a dual fast gradientmethod with convergence rate O(1/k2) in an average primal sequence; an hybrid dual fast gradientmethod with convergence rate O(1/k3/2) in the last primal iterate. For both methods we providecomplete estimates on primal/dual suboptimality and feasibility violation of the generated approximatesolutions. In particular, the estimates on primal suboptimality and feasibility violation for our distributedalgorithms are with an order of magnitude better that the ones of algorithms given e.g. in [6,12,36,40].Moreover, in comparison with other approaches (e.g. [12,36,41]) we do not require a centralized step sizeand thus we derive a fully distributed implementation for our algorithms, using parallel computations,and test them on several numerical simulations using the classical IEEE bus test cases.

1.3.1 Modeling of DC optimal power flow problem

Let us discuss the direct current optimal power flow (DC-OPF) problem for a power system [76]. DC-OPF can be modeled using the tools presented in Section 1.2.1 on top of which we put a bipartite graph.More precisely, we consider a power system whose structure is characterized by a directed bipartite graphG = (V1, V2, E), where V1 = {i | i = 1, . . . ,M} denotes the set of buses, V2 = {l = (i, j) | i, j ∈V1, l = 1, . . . , M} ⊆ V1×V1 represent the sets of transmission lines (branches) between two buses andthe matrix E denotes its incidence matrix. In these settings we define two sets:

Si={l∈V2 | Eli = 0}={l∈V2 | ∃j∈V1 s.t. (i, j) ∨ (j, i) = l}

denoting the set of all transmission lines from or to bus i and

Nl = {i ∈ V1 | Eli = 0} = {i, j ∈ V1 | (i, j) ∨ (j, i) = l}

which denotes the set comprised of buses i and j which define the branch l. We also introduce:

Si =∪l∈V2

{j ∈ V1 | Eli = 0 ∧ Elj = 0}

= {j ∈ V1 | ∃l ∈ V2 s.t. (i, j) ∨ (j, i) = l}

23

Page 24: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

which denotes the sets of all buses directly linked with bus i. It is straightforward to notice that theset Si can be obtained from the sets Nl and Si. We define further the diagonal matrix R ∈ RM×M ,whose diagonal elements Rll represent the reactance of the lth transmission line between two busses iand j ∈ V1. For each bus i we denote the phase angle of the voltage and the generated power, if thebus i is directly connected to a generator, by

θi ∈ Θi = [θi, θi] and Pgi ∈ Pi = [P g

i , Pgi ].

Under this model, the active power flow from a bus i to a bus j is given by

Fl =1

Rll(θi − θj) , (38)

where l = (i, j) and we recall that Rll represent the reactance of the transmission line connecting buses

i and j. We impose lower and upper line flows limits F = [F 1 · · ·F M ]T and F =[F 1 · · ·F M

]T,

respectively. We also assume that each bus i is characterized by a local load P di and we denote by

P d =[P d1 · · ·P d

M

]Tthe overall vector of loads. We introduce further the notations:

θ = [θ1 · · · θM ]T and P g =[P g1 · · ·P g

Mg

]T,

whereMg denotes the number of generators. We also define the matrix Ag ∈ [0, 1]M×Mg having Agij = 1

if P gj is directly linked with the bus i and the rest of its entries equal to zero. Note that if we consider

that each bus i is directly coupled with a generator unit, than M = Mg and Ag = IM . Using furtherthese notations, the DC nodal power balance can be derived from (21) using the assumption that thevariations in the angles is small; so that it can be written in the following form [76]:

ETREθ = AgP g − P d, (39)

where the matrix ETRE denotes the weighted Laplacian and its entries have the following expressions:

[ETRE]ij =

s∈SiRll, l = (i, s) ∨ (s, i) if i = j

−Rll, l = (i, j) ∨ (j, i) if i = j0 otherwise.

We can observe that the structure of the Laplacian matrix is given by the structure of the incidencematrix E through the sets Si, which, at its turn depend on the sets Si and Nl for all i ∈ V1 and l ∈ V2.Using further the relation between the the power flow and the phase angle of the voltages, we can writethe lower and upper limits imposed on the line flows in the following matrix form:

F ≤ REθ ≤ F. (40)

We also define reference values θrefi for the phase angle of the voltage of each bus and P g,refi for the

generated power of each generator. Further, for each bus i we define a local decision variable xi asfollows:

xi =

[θiP gi

]if the bus i is connected to a generator,

θi otherwise

24

Page 25: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

and the corresponding reference values xrefi . In comparison with the approach made in [76], where theauthors consider the lower and upper limits of the form θrefi ≤ θi ≤ θrefi , in our approach we do notimpose such constraints but use instead a weighted quadratic cost, which, depending on the value ofthe parameter qi, requires the solution to be close to the reference value θrefi . The main motivationbehind this approach consist in the fact that constraints of this form usually induce numerical problemsdue to the fact that the optimization problem which has to be solved is ill conditioned (for example, theSlater constraint qualification does not hold in this case). Therefore, for each bus i directly connectedto a generator unit we impose a local cost of the form:

fi(xi) = 0.5∥xi − xrefi ∥2Qi− γi log(βi + P g

i ), (41)

where the diagonal matrix Qi =

[qi 00 pi

]∈ R2×2 and the positive scalar γi are used in order to weight

the local cost. Also, the positive scalar βi is used to avoid numerical instability when P gi is closed to 0.

In comparison with the existing approaches for (DC-OPF) problems we add to the classic quadratic terma weighted logarithmic term, which is used in many resource allocation problems (see e.g. [75]) in orderto reduce the absolute risk aversion. The logarithmic utility function also exhibit diminishing returnswith the rate of resources, in our case the generated power, that is, as rate increases the incrementalutility grows by smaller amounts. For the buses that are not connected to a generator unit we imposea simple quadratic local cost of the form:

fi(xi) = 0.5qi

(xi − θrefi

)2, (42)

where in this case qi is a positive scalar. Note that for these choices the local costs fi are stronglyconvex for both cases. In conclusion, the (DC-OPF) problem can be cast as the following large-scaleseparable convex optimization problem:

f∗ = minθi∈Θi,P

gi ∈Pi

∑i1

fi1(θi1) +∑i2

fi2(θi2 , Pgi2) (43)

s.t.: ETREθ −AgP g = −P d, F ≤ REθ ≤ F ,

which is a particular case of the network utility maximization problem [13].

1.3.2 Dual Lagrangian framework for network utility maximization problems

Many network problems, as the DC-OPF (43), can be recast as a linearly constrained separable convexoptimization of the form (known also as the network utility maximization problem) [13]:

f∗ = minxi∈Xi

f(x)

(=

M∑i=1

fi(xi)

)(44)

s.t.: Ax = b, Cx ≤ c,

where the utility functions fi are convex, Xi ⊆ Rni are simple convex sets (i.e. the projection onto

these sets can be easily computed), the decision variables x =[xT1 · · ·xTM

]Tdenote the rates of the

sources, A ∈ Rp×n, C ∈ Rq×n, b ∈ Rp and c ∈ Rq. We also use X = X1 × · · · × XM . As toDC-OPF problem, we associate to problem (44) a communication bipartite graph G = (V1, V2, E),

25

Page 26: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where V1 = {1, . . . ,M} denotes the set of some resources, V2 ={1, . . . , M

}the set of links shared by

the resources and E ∈ {0, 1,−1}(M)×M is an incidence matrix. We also introduce the index sets Si

(denotes the set of all links utilized by source i) and Nl (the set of all sources which utilize the link l)as in the previous section. Therefore, the local information structure imposed by the graph G shouldbe considered as part of the problem formulation. We assume that A and C are block matrices with

the blocks Ali ∈ Rpl×ni and Cli ∈ Rql×ni , where∑M

i=1 ni = n,∑M

l=1 pl = p and∑M

l=1 ql = q. Wealso assume that if Eli = 0, then both blocks Ali and Cli are zero. In these settings we allow a blockAli or Cli to be zero even if Eli = 0. For network problems the inequalities Cx ≤ c usually denotethe constraints imposed on the flow of each link, while Ax = b denotes a certain conservation law. Wemake the following assumptions on problem (44):

Assumption 1.16. (a) The functions fi are σi-strongly convex w.r.t. Euclidean norm ∥ · ∥ (see [53]).(b) The feasible set of problem (44) is nonempty and there exists x such that Ax = b and Cx < c.

Note that these assumptions are standard for the dual settings, as we will also see below. Assumption1.16 (b) implies that strong duality holds for optimization problem (44) and the set of optimal Lagrangemultipliers is bounded [21]. In particular, we have:

f∗ = maxν∈Rp,µ∈Rq

+

d(ν, µ), (45)

where d(ν, µ) denote the dual function of (44):

d(ν, µ) = minx∈X

L(x, ν, µ), (46)

with the Lagrangian function

L(x, ν, µ) = f(x) + ⟨ν,Ax− b⟩+ ⟨µ,Cx− c⟩.

In the context of network problems the multipliers (ν, µ) have the interpretation of congestion/relabilityprices. For example, in the optimal power flow problems ν multipliers are associated to the powerbalance equation and they have the economic interpretation as the optimal energy trading prices at thebuses of the network. For simplicity we introduce further the notations:

G =[AT CT

]Tand g =

[bT cT

]T. (47)

Assumption 1.16 (a), stating that each fi is strongly convex function, implies that f is also stronglyconvex w.r.t. Euclidian norm ∥ · ∥ with convexity parameter σf = mini=1,...,M σi. Furthermore, the dualfunction d is differentiable and its gradient has the following expression [41]:

∇d(ν, µ) = Gx(ν, µ)− g,

where x(ν, µ) denotes the unique optimal solution of the inner problem (46), that is:

x(ν, µ) = argminx∈X

L(x, ν, µ). (48)

Moreover, the gradient ∇d of the dual function is Lipschitz continuous w.r.t. Euclidean norm ∥ · ∥, withconstant [41]:

Ld = ∥G∥2 /σf . (49)

26

Page 27: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

If we denote νSi = [νl]l∈Siand µSi = [µl]l∈Si

, then the dual function can be written in the separableform:

d(ν, µ) =

M∑i=1

di(νSi , µSi)− ⟨ν, b⟩ − ⟨µ, c⟩,

with di having the following expressions:

di(νSi , µSi)= minxi∈Xi

fi(xi) +∑l∈Si

⟨AT

liνl+CTliµl, xi

⟩. (50)

In these settings, we have that the gradient ∇di is:

∇di(νSi , µSi) =

[[Ali]l∈Si

[Cli]l∈Si

]xi(νSi , µSi),

where xi(νSi , µSi) denotes the unique optimal solution in (50). Note that ∇di is Lipschitz continuousw.r.t. Euclidean norm ∥ · ∥, with constant:

Ldi =

∥∥∥∥[ [Ali]l∈Si

[Cli]l∈Si

]∥∥∥∥2 /σi. (51)

For simplicity, we will consider further the notations:

λ =[νTµT

]Tand λl =

[νTl µTl

]T ∀l ∈ V2,

and we will also denote the effective domain of the dual function by D = Rp × Rq+. The following

result, which is a distributed version of the descent lemma, is central in our derivations of distributedalgorithms and in the proofs:

Lemma 1.17. [6,45] Let Assumption 1.16 hold. Then, the following inequality is valid for all λ, λ ∈ D:

d(λ) ≥ d(λ) +⟨∇d(λ), λ− λ

⟩− 1

2∥λ− λ∥2W , (52)

where W=diag(Wν ,Wµ), Wν=diag(∑

i∈NlLdiIpl ; l∈V2

)and Wµ = diag

(∑i∈Nl

LdiIql ; l ∈ V2

).

Since f is strongly convex function, then the following relation, characterizing the distance betweena primal estimate and the primal optimal solution x∗ of our optimization problem (44), can be easilyderived, see e.g. [43]:

σf2∥x(λ)− x∗∥2 ≤ f∗ − d(λ) ∀λ ∈ D, (53)

where x(λ) = argminx∈X

L(x, λ). We denote further by Λ∗ the set of optimal solutions of dual problem

(93). According to Gauvin’s theorem [21], if Assumption 1.16 holds for our original problem (44),then Λ∗ is nonempty and bounded. Since the set of optimal Lagrange multipliers is bounded, for anyλ0 ∈ Rp+q we define the finite quantity:

R(λ0) = maxλ∗∈Λ∗

∥λ∗ − λ0∥W . (54)

Our goal in this work is to propose distributed dual first order methods for which we can derive estimatesfor primal and dual suboptimality and also for primal feasibility violation, i.e. finding a primal-dual pair(x, λ

)such that given accuracy ϵ we have:

x ∈ X, ∥ [Gx− g]D ∥W−1 ≤ O(ϵ), ∥x− x∗∥2 ≤ O(ϵ),

−O(ϵ)≤f(x)−f∗≤O(ϵ) and f∗ − d(λ) ≤ O(ϵ). (55)

27

Page 28: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

1.3.3 Distributed dual fast gradient using an average primal sequence

In this section we propose a fully distributed dual fast (also called accelerated) gradient scheme (DFG)for solving the dual problem (93). A similar centralized variant of the algorithm was proposed byNesterov in [53] and applied further in [40] for solving dual smoothed problems and in [41] for the casewhen the dual updates use inexact information and the step size is a fixed scalar. The scheme defines

two sequences(λk, λk

)k≥0

for the dual variables:

Algorithm (DFG)

Initialization: λ0 = 0. For k ≥ 0 compute:

1. xk = argminx∈X

L(x, λk)

2. λk =[λk +W−1∇d(λk)

]D

3. λk+1 = k+1k+3λ

k+ 2k+3

[W−1

∑ks=0

s+12 ∇d(λs)

]D.

For simplicity of the exposition we restrict our analysis to the case λ0 = 0. The main differencebetween our Algorithm (DFG) and the algorithms proposed in [36, 40, 41, 53] consists in the way weupdate the sequence λk. Instead of using a classical projected gradient step with a scalar step size asin [36, 40, 41, 53], we update λk using a projected weighted gradient step which allows us to obtain adistributed scheme (see Section V). Further, we analyze the convergence properties of Algorithm (DFG):

Theorem 1.18. [41,43] Let Assumption 1.16 hold and the sequences(xk, λk, λk

)k≥0

be generated by

algorithm (DFG). Also, let the primal average sequence be xk =∑k

s=02(s+1)

(k+1)(k+2)xs and R = R(0) =

maxλ∗∈Λ∗

∥λ∗∥W . Then, the following convergence estimates on dual suboptimality, distance to the optimal

solution of (44), primal feasibility violation and suboptimality for Algorithm (DFG) hold:

f∗ − d(λk) ≤ 2R2

(k + 1)2, ∥xk − x∗∥ ≤ 4R

√σf(k + 1)∥∥∥[Gxk− g

]D

∥∥∥W−1

≤ 8R(k + 1)2

, − 8R2

(k + 1)2≤ f(xk)− f∗ ≤ 0.

1.3.4 Distributed hybrid dual fast gradient algorithm using the last iterate

Note that for Algorithm (DFG) the primal sequence{xk}k≥0

for which we are able to recover primal

suboptimality and infeasibility is given by a weighted average of the iterates{xk}k≥0

. However, in

simulations we observe also a good behaviour of the last iterate xk. In this section we propose a hybriddistributed dual fast gradient algorithm for which we can ensure estimates for both primal suboptimalityand feasibility violation of the last iterate xk, which supports our findings from simulations. Thealgorithm is characterized by two phases: in the first phase we perform k steps of Algorithm (DFG)while in the second phase another k steps of a dual weighted gradient algorithm are performed:

28

Page 29: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Algorithm (H-DFG)

Initialization: λ0 = 0.Phase 1: For j = 0, . . . , k compute:

1. xj = argminx∈X

L(x, λj)

2. λj =[λj +W−1∇d(λj)

]D

3. λj+1 = j+1j+3λ

j+ 2j+3

[W−1

∑js=0

s+12 ∇d(λs)

]D.

Phase 2: Set λk = λk. For j = k, . . . , 2k compute:

1. xj = arg minx∈Rn

L(x, λj)

2. λj+1 =[λj +W−1∇d(λj)

]D .

We introduce further the following notation:

k∗ = arg minj∈[k,2k]

∥λj − λj+1∥2W . (56)

Note that the quantity λj−λj+1 denotes the constrained gradient direction (see [53]), which representsan indicator for the suboptimality level of the estimate λj . We can also observe that λj is an optimalsolution of (93) if and only if λj−λj+1 = 0 and thus we want ∥λj−λj+1∥2W to be small. The followingtheorem provides estimates on the dual suboptimality, primal feasibility violation and suboptimality andon the distance to the optimal solution of problem (44) for the Algorithm (H-DFG), in the last iteratesλk

∗and xk

∗:

Theorem 1.19. Let Assumption 1.16 hold, the sequences{λj , λj , xj

}j≥0

be generated by the Algorithm

(H-DFG) and k∗ be given by (56). In addition, let f be Lipschitz continuous with constant Lf, i.e.|f(x)−f(y)| ≤ Lf∥x−y∥. Then, the following estimates for dual suboptimality, distance to the optimalsolution of (44), primal feasibility violation and primal suboptimality for (H-DFG) hold:

(i) f∗ − d(λk∗) ≤ 2R2

(k + 1)2, ∥xk∗ − x∗∥ ≤ 2R

√σf(k + 1)

(ii)∥∥∥[Gxk∗ − g

]D

∥∥∥W−1

≤ 2R(k + 1)

√(k + 1)

(iii) − 2R2

(k+1)√

(k+1)≤f(xk∗)−f∗≤ 2LfR√

σf(k+1).

Proof: (i) From Theorem 1.18 and the initialization in Phase 2 of Algorithm (H-DFG) we have:

2∥λ∗∥2W(k + 1)2

≥ f∗ − d(λk) = f∗ − d(λk).

29

Page 30: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Further, we recall the following result [43]:

d(λj+1) ≥ d(λj) +1

2∥λj − λj+1∥2W ∀j = k, . . . , 2k. (57)

Combining the previous two inequalities, we obtain:

d(λk) = d(λk) ≤ d(λk+1) ≤ · · · ≤ d(λ2k+1), (58)

from which, together with the previous inequality and the fact that k∗ ∈ [k, 2k] we obtain the result.Further, we show the estimate for the distance to the optimal solution. Using (53) with λ = λk

∗we

have:

∥xk∗ − x∗∥ ≤√

2

σf

√f∗ − d(λk∗) ≤ 2∥λ∗∥W√

σf(k + 1).

(ii) Using Theorem 1.18 and (57) we can write:

2∥λ∗∥2W(k + 1)2

≥ f∗− d(λk) ≥ f∗−d(λ2k+1)+1

2

2k∑j=k

∥λj−λj+1∥2W

≥ (k + 1)

2∥λk∗−λk∗+1∥2W ,

where in the third inequality we used (57) recursively and in the last inequality we used (56). Using theprevious inequality we obtain:

∥λk∗−λk∗+1∥2W ≤4∥λ∗∥2W(k + 1)3

. (59)

Further, we will show that∥∥[∇d(λk∗)]D∥∥2W−1 ≤ ∥λk∗−λk∗+1∥2W . We will prove this inequality compo-

nentwise. First, we recall that D = Rp × Rq+. Thus, for all i = 1, . . . , p we have:∣∣∣[∇id(λ

k∗)]R

∣∣∣2W−1

ii

=∣∣∣∇id(λ

k∗)∣∣∣2W−1

ii

(60)

=∣∣∣λk∗i − λk

∗i −W−1

ii ∇id(λk∗)∣∣∣2Wii

=∣∣∣λk∗i − λk

∗+1i

∣∣∣2Wii

,

where in the last inequality we used the definition of λk∗+1. We introduce now the following disjoint

sets: I− ={i ∈ [p+ 1, p+ q] : ∇id(λ

k∗) < 0}

and also I+ ={i ∈ [p+ 1, p+ q] : ∇id(λ

k∗) ≥ 0}.

Using these notations and the definition of D, we can write for all i ∈ I−:∣∣∣[∇id(λk∗)]R+

∣∣∣2W−1

ii

= 0 ≤∣∣∣λk∗i − λk

∗+1i

∣∣∣2Wii

. (61)

On the other hand, for all i ∈ I+ we have:∣∣∣[∇id(λk∗)]R+

∣∣∣2W−1

ii

=∣∣∣∇id(λ

k∗)∣∣∣2W−1

ii

(62)

=∣∣∣[W−1

ii ∇id(λk∗)]R+

∣∣∣2Wii

=∣∣∣λk∗i − λk

∗+1i

∣∣∣2Wii

.

30

Page 31: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Summing up the relations (60),(61) and (62) for all i = 1, . . . , p+ q and combine the result with (59)we obtain: ∥∥∥[∇d(λk∗)]

D

∥∥∥2W−1

≤∥∥∥λk∗ − λk

∗+1∥∥∥2W

≤4∥λ∗∥2W(k + 1)3

.

Taking now into account that[∇d(λk∗)

]D =

[Gxk

∗ − g]D and using the definition of R we conclude

the result.(iii) In order to prove the left-hand side inequality for primal suboptimality we can write:

f∗ = d(λ∗) = minx∈X

f(x) + ⟨λ∗, Gx− g⟩ (63)

≤ f(xk) + ⟨λ∗,[Gxk − g

]D⟩

≤ f(xk) + ∥λ∗∥W∥∥∥[Gxk − g

]D

∥∥∥W−1

,

where the second inequality follows from the fact that λ∗ ∈ D and the last one from Cauchy-Schwartzinequality. Combining now with the estimate on feasibility we obtain the result. For proving the righthand-side inequality for primal suboptimality, we use the estimate for the distance to the optimal solutionand the Lipschitz property of f :

f(xk∗)− f∗ ≤ Lf∥xk

∗ − x∗∥ ≤ 2Lf∥λ∗∥W√σf(k + 1)

,

which concludes the statement.

1.3.5 Distributed implementation

In this section we analyze the distributed implementation of Algorithms (DFG) and (H-DFG) in thecontext of networks. Recall that for network problems the multipliers λ = (ν, µ) have the interpretationof congestion/relability prices. Note that compared to other dual gradient methods from literature[36, 40, 41], our proposed algorithms are fully distributed. Thus, in order to update the sources ratesxi and the congestion/relability prices λl, it is not necessary to set up a common control center, but itis sufficient to interchange a small amount of information among the involved utilities. We look firstat step 1 of the Algorithm (DFG). Note that this step is similar with the steps 1 of phases 1 and 2 ofAlgorithm (H-DFG) and therefore their analysis follows in a similar way. From (50), in order to updatethe rate of each source i ∈ V1 we have:

xki = arg minxi∈Xi

fi(xi) +∑l∈Si

([AT

liCTli

]λkl

)Txi. (64)

Thus, in order to compute the rates xki , our algorithms require only local information, more specifically{Ali, Cli, λ

kl

}l∈Si

. In many practical applications, as in e.g. optimal power flow problems, xki can be

computed in closed form. We discuss further the update of the congestion/reliabiliatily prices λl. Usingthe definitions of W and ∇d, step 2 in Algorithm (DFG) can be written in the following form for eachlink l ∈ V2:

λkl =

[λkl +

[W−1

νll

∑i∈Nl

Alixki

W−1µll

∑i∈Nl

Clixki

]]Rpl×Rql

+

,

where Wνll and Wµlldenote the lth block-diagonal element of matrix Wν and Wµ, respectively. Taking

into account the definitions of Wνll and Wµllwe can conclude that in order to update the dual variable

31

Page 32: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Case/Alg. DFG CFG H-DFG H-CFG DG9 buses 3655 4434 372 541 56005

14 buses 2101 3242 349 1283 54755

30 buses 1368 2013 503 1356 27026

39 buses 1756 6343 1316 4835 69961

57 buses 4876 21123 2003 15507 ∗118 buses 6273 32679 4631 29118 ∗300 buses 7352 33241 8978 37148 ∗2383 buses 10607 51234 7374 47832 ∗

Table 3: Number of iterations for ϵ accuracy.

λkl in step 2 of Algorithm (DFG) we require only local information{Ldi , Ali, Cli, x

ki

}i∈Nl

. Note that

the analysis of step 3 in the Algorithm (DFG) can be derived in a similar way as for step 2. Also, step2 in phases 1 and 2 and step 3 in phase 1 of the Algorithm (H-DFG) follows similarly.

1.3.6 Illustrative example

In this section we present some preliminary numerical results on the direct current optimal power flowproblem. We test the performances of Algorithms (DFG), (H-DFG) and distributed dual gradientAlgorithm (DG) for solving the (DC-OPF) problem (43) for different IEEE bus test cases. We recallAlgorithm (DG):

(DG) : λk+1 =[λk +W−1∇d(λk)

]D.

We also consider the centralized versions of (DFG) and (H-DFG): (CFG) and (H-CFG), where insteadof the step size given by matrixW we use LdIp+q as in (49). The numerical simulation are performed ondifferent power systems, representing classical test cases [76], with the number of busesM ranging from9 to 2383, of generators from 3 to 327 and of interconnecting lines from 18 to 2896: case9 from [76], 39bus New England system, IEEE 14, 30, 57, 118 and 300 bus test cases and Polish system (case 2383wpfrom [76]). For each power system we generate the local constraints sets imposed on the generatedpower of each bus i, Pg

i , while θi is free to vary as indicated above. The local loads P di and the matrices

E, R and Ag have been extracted from MATPOWER [76].Note that in the context of (DC-OPF) problem, ν multipliers associated to the power balance equationhave the economic interpretation as the optimal energy trading prices at the buses of the network.Therefore, our algorithms are able to identify also the optimal energy pricing rates for the energy tradedthrough the interconnections in a distributed fashion. Thus, it is not necessary to set up a commoncontrol center, but it is sufficient to interchange a small amount of information among the involvedbuses.In Table 3 we show, for each test case, the number of iterations performed by the algorithms in or-der to find a suboptimal primal solution xk which satisfies the following stopping criteria for primalsuboptimality and infeasibility:

|f(xk)− f∗|/f∗ ≤ ϵ and∥∥∥[Gxk − g

]D

∥∥∥W−1

≤ ϵ. (65)

In our simulation we consider an accuracy ϵ = 0.01. For each test case, we use CVX in order to computethe optimal value f∗. In the case when the imposed accuracy has not been attained after 3·105 iterations,

32

Page 33: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

we stoped the algorithm and reported ∗. Some remarks are worth to be mentioned. First, we canobserve from Table 3 that both the proposed Algorithms (DFG) and (H-DFG) clearly outperform theclassical dual gradient Algorithm (DG). Thus, the practical behaviour observed in simulations certifiesthe theoretical results derived in the previous sections, where we have proved that the rate of convergenceof the proposed algorithms improves the well known rate of convergence of order O( 1k ) for the Algorithm(DG). This behaviour is also valid for the centralized cases. Another important aspect consists in thefact that for all algorithms, when the dimension of the problem increases, the distributed version becomesmore efficient than the centralized one. This is a consequence of the fact that when the number ofbusses increases, the level of sparsity of the matrices A and C, characterized in terms of the indicessets Si, Si and Nl, is high and therefore the Lipschitz constants Ldi are small in comparison with theoverall Lipschitz constant Ld. These differences in Ldi and Ld lead to a grater step size in the case ofdistributed algorithms compared to centralized ones, thus the first ones perform faster.

33

Page 34: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

2 Control techniques for big data systems

In Task 2 we develop new optimization based-control techniques for big data systems. In paper [P4] weanalyze a family of general random block coordinate descent iterative hard thresholding based methodsfor the minimization of l0 regularized optimization problems, i.e. the objective function is composed ofa smooth convex function and the l0 regularization. This type of optimization problems arise e.g. inpacketized predictive control for networked control systems with unreliable (or rate-limited) communi-cations. In paper [P5] we analyze the convergence of inexact projection primal first order methods forconvex minimization. We show that these algorithms can be used efficiently for solving model predictivecontrol problems arising in embedded applications. We prove that we can still achieve similar conver-gence rates for these inexact projection first order algorithms with those given in the exact projectionsettings, provided that the approximate projection is sufficiently accurate. Our convergence analysisallows to derive explicitly the accuracy of the inexact projection and the number of iterations we need toperform in order to obtain an approximate solution for our convex problem. Further, we also present inpaper [P6] a constructive solution for the inverse optimality problem for the class of continuous piecewiseaffine functions. The main idea is based on the convex lifting concept. Regarding linear optimal control,we show that any continuous piecewise affine control law can be obtained via a linear optimal controlproblem with the control horizon at most equal to 2 prediction steps. However, due to space limitationsin this report we present only the results given in papers [P4] and [P5], paper [P6] can be found on thejournal website.

2.1 Random coordinate descent methods for ℓ0 regularized convex problems: appli-cation to sparse control

Nowadays, there are increasingly numerous engineering applications which promote interest in sparseoptimization problems. In recent practical applications (e.g. sparse control problems [39], sparse robustidentification of hybrid/switched systems [55], state estimation under corrupted measurements [29],compressed sensing [10], sparse principal component analysis [27]) we aim at finding a sparse (fewnonzero components) minimizer of a given convex objective (cost) function. For example, in the sparselinear system identification area [55], estimators with a few non-negligible parameters are sought throughminimization of ℓ2 loss of prediction error. Also, in the sparse control problem [39], a sparse minimizerof a least-squares type objective function is sought, representing the sparse packetized (control) inputof a given plant in presence of packets dropouts. We give further more details on these systems andcontrol applications.

Sparse control. Given the settings of networked control systems with unreliable (or rate-limited) com-munications, an efficient control strategy is packetized predictive control (PPC). In [39] the (PPC)strategy is analyzed under the presence of packets-dropouts, and in order to obtain a more compressiblecontroller, a sparse optimization problem is addressed. Given the system state x ∈ Rn and denotingwith u ∈ Rm the input sequence, then the following problem is considered:

minu

∥u∥0 s.t. ∥Gu−Hx∥22 ≤ xTWx,

where the matrices G,H are coming from the dynamics and performance cost of the system. Note thatthis sparse (PPC) formulation is a particular case of sparse optimization model (66) and can be alsorewritten in the Lagrangian form. Typically, in the Lagrangian formulation, the penalty λ is empiricallychosen such that an appropriate sparsity level is reached.

34

Page 35: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Sparse system identification. The problem consists of finding the sparsest (time-varying) parametervector θ ∈ Rn among the solutions of the linear regression problem Aθ + w = y, where A ∈ Rm×n isthe regression matrix and y are the measurements corrupted by some noise w with bounded magnitude.A widespread sparse constrained formulation is:

minθ∈Rn

∥θ∥0 s.t. ∥Aθ − y∥2 ≤ ϵ. (66)

In [55], a slightly more general sparse model is analyzed and some greedy algorithms are proposedfor identifying a suitable set of affine models along with a switching sequence that can explain themeasurements. The sparsity of the solution implies a minimal number of switches or subsystems. Notethat for a given value ϵ, there exists λ > 0 such that the following regularized (Lagrangian) form:

minθ∈Rn

∥Aθ − y∥22 + λ∥θ∥0

has the same global minimum as (66) (see e.g. [10]).

Note that, the additional requirement of sparsity immediately turns the original least-square minimizationproblem into a very hard combinatorial problem. In order to overcome this issue, many authors haveconsidered the convex relaxation of ℓ0 norm to ℓ1 norm. It is worth to note that there are someresults on this topic given in [10], where it is shown that the ℓ1 convex relaxation is exact under somefavorable conditions, such as θ is sufficiently sparse and the matrix A obeys the so-called restrictedisometry property (RIP). However, such conditions are not usually satisfied in applications such assystem identification or control, due to the strong correlation of the columns of regression matrix A (seee.g. [68]). Therefore, the direct approach of ℓ0 regularized formulation is more appropriate for theseapplications.In this work we analyze a family of general random block coordinate descent iterative hard thresholdingbased methods for the minimization of ℓ0 regularized optimization problems, i.e. the objective functionis composed of a smooth convex function and the ℓ0 regularization. The family of the algorithms weconsider takes a very general form, consisting in the minimization of a certain approximate version ofthe objective function one block variable at a time, while fixing the rest of the block variables. Suchtype of methods are particularly suited for solving nonsmooth ℓ0 regularized problems since they solvean easy low dimensional problem at each iteration, often in closed form. Our family of methods coversparticular cases such as random block coordinate gradient descent and random proximal coordinatedescent methods. We analyze necessary optimality conditions for this nonconvex ℓ0 regularized problemand devise a procedure for the separation of the set of local minima into restricted classes based onapproximation versions of the objective function. We provide a unified analysis of the almost sureconvergence for this family of random block coordinate descent algorithms and prove that, for eachapproximation version, the limit points are local minima from the corresponding restricted class of localminimizers. We also provide numerical experiments which show the superior behavior of our methodsin comparison with the usual iterative hard thresholding algorithm.

2.1.1 Notations and preliminaries

We consider the space Rn composed by column vectors. For x, y ∈ Rn denote the scalar product by⟨x, y⟩ = xT y and the Euclidean norm by ∥x∥ =

√xTx. We use the same notation ⟨·, ·⟩ (∥ · ∥) for scalar

product (norm) in spaces of different dimensions. For any matrix A ∈ Rm×n we use σmin(A) for theminimal eigenvalue of matrix A. We use the notation [n] = {1, 2, . . . , n} and e = [1 · · · 1]T ∈ Rn. In

35

Page 36: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

the sequel, we consider the following decompositions of the variable dimension and of the n×n identitymatrix:

n =

N∑i=1

ni, In = [U1 . . . UN ] , In =[U(1) . . . U(n)

],

where Ui ∈ Rn×ni and U(j) ∈ Rn for all i ∈ [N ] and j ∈ [n]. If the index set corresponding to block iis given by Si, then |Si| = ni. Given x ∈ Rn, then for any i ∈ [N ] and j ∈ [n], we denote:

xi = UTi x ∈ Rni , ∇if(x) = UT

i ∇f(x) ∈ Rni ,

x(j) = UT(j)x ∈ R, ∇(j)f(x) = UT

(j)∇f(x) ∈ R.

In this work we analyze the properties of local minima and devise a family of random block coordinatedescent methods for the following ℓ0 regularized optimization problem:

minx∈Rn

F (x) (= f(x) + ∥x∥0,λ) , (67)

where function f is smooth and convex and the quasinorm of x is defined as:

∥x∥0,λ =N∑i=1

λi∥xi∥0,

where ∥xi∥0 is the quasinorm which counts the number of nonzero components in the vector xi ∈ Rni ,which is the ith block component of x, and λi ≥ 0 for all i = 1, . . . , N . Note that in this formulationwe do not impose sparsity on all block components of x, but only on those ith blocks for which thecorresponding penalty parameter λi > 0. However, in order to avoid the convex case, intensively studiedin the literature, we assume that there is at least one i such that λi > 0.For any vector x ∈ Rn, the support of x is given by supp(x), which denotes the set of indices cor-responding to the nonzero components of x. We denote x = max

j∈supp(x)|x(j)| and x = min

j∈supp(x)|x(j)|.

Additionally, we introduce the following set of indices:

I(x) = supp(x) ∪ {j ∈ [n] : j ∈ Si, λi = 0}

and Ic(x) = [n]\I(x). Given two scalars p ≥ 1, r > 0 and x ∈ Rn, the p−ball of radius r and centeredin x is denoted by Bp(x, r) = {y ∈ Rn : ∥y − x∥p < r}. Let I ⊆ [n] and denote the subspaceof all vectors x ∈ Rn satisfying I(x) ⊆ I with SI , i.e. SI = {x ∈ Rn : xi = 0 ∀i /∈ I}. Wedenote with f∗ the optimal value of the convex problem f∗ = minx∈Rn f(x) and its optimal set withX∗

f = {x ∈ Rn : ∇f(x) = 0}. In this section we consider the following assumption on function f :

Assumption 2.1. The function f has (block) coordinatewise Lipschitz continuous gradient with con-stants Li > 0 for all i ∈ [N ], i.e. the convex function f satisfies the following inequality for all i ∈ [N ]:

∥∇if(x+ Uihi)−∇if(x)∥ ≤ Li∥hi∥ ∀x ∈ Rn, hi ∈ Rni .

An immediate consequence of Assumption 2.1 is the following relation [51]:

f(x+ Uihi) ≤ f(x) + ⟨∇if(x), hi⟩+Li

2∥hi∥2 ∀x ∈ Rn, hi ∈ Rni . (68)

We denote with λ = [λ1 · · ·λN ]T ∈ RN , L = [L1 · · ·LN ]T and Lf the global Lipschitz constant of thegradient ∇f(x). In the Euclidean settings, under Assumption 2.1 a tight upper bound of the globalLipschitz constant is Lf ≤

∑Ni=1 Li (see [51, Lemma 2]). Note that a global inequality based on

Lf , similar to (68), can be also derived. Moreover, we should remark that Assumption 2.1 has beenfrequently considered in coordinate descent settings (see e.g. [47–49,51,62]).

36

Page 37: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

2.1.2 Characterization of local minima

In this section we present the necessary optimality conditions for problem (67) and provide a detaileddescription of local minimizers. First, we establish necessary optimality conditions satisfied by any localminimum. Then, we separate the set of local minima into restricted classes around the set of globalminimizers. The next theorem provides conditions for obtaining local minimizers of problem (67):

Theorem 2.2. If Assumption 2.1 holds, then any z ∈ Rn\{0} is a local minimizer of problem (67) on

the ball B∞(z, r), with r = min{z, λ

∥∇f(z)∥1

}, if and only if z is a global minimizer of convex problem

minx∈SI(z)

f(x). Moreover, 0 is a local minimizer of problem (67) on the ball B∞

(0,

mini∈[N ] λi

∥∇f(z)∥1

)provided

that 0 ∈ X∗f , otherwise is a global minimizer for (67).

Proof. For the first implication, we assume that z is a local minimizer of problem (67) on the open ballB∞(z, r), i.e. we have:

f(z) ≤ f(y) ∀y ∈ B∞(z, r) ∩ SI(z).

Based on Assumption 2.1 it follows that f has also global Lipschitz continuous gradient, with constantLf , and thus we have:

f(z) ≤ f(y) ≤ f(z) + ⟨∇f(z), y − z⟩+Lf

2∥y − z∥2 ∀y ∈ B∞(z, r) ∩ SI(z).

Taking α = min{ 1Lf, r

maxj∈I(z)

|∇(j)f(z)|} and y = z − α∇I(z)f(z), we obtain:

0 ≤(α2

2− α

Lf

)∥∇I(z)f(z)∥2 ≤ 0.

Therefore, we have ∇I(z)f(z) = 0, which means that:

z = arg minx∈SI(z)

f(x). (69)

For the second implication we first note that for any y, d ∈ Rn, with y = 0 and ∥d∥∞ < y, we have:

|y(i) + d(i)| ≥ |y(i)| − |d(i)| ≥ y − ∥d∥∞ > 0 ∀i ∈ supp(y). (70)

Clearly, for any d ∈ B∞(0, r)\SI(y), with r = y, we have:

∥y + d∥0,λ = ∥y∥0,λ +∑

i∈Ic(y)∩supp(d)

∥d(i)∥0,λ ≥ ∥y∥0,λ + λ.

Let d ∈ B∞(0, r)\SI(y), with r = min{y, λ

∥∇f(y)∥1

}. The convexity of function f and the Holder

inequality lead to:

F (y + d) ≥ f(y) + ⟨∇f(y), d⟩+ ∥y + d∥0,λ≥ F (y)− ∥∇f(y)∥1∥d∥∞ + λ ≥ F (y) ∀y ∈ Rn. (71)

37

Page 38: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

We now assume that z satisfies (69). For any x ∈ B∞(z, r) ∩ SI(z) we have ∥x − z∥∞ < z, which by(70) implies that |x(i)| > 0 whenever |z(i)| > 0. Therefore, we get:

F (x) = f(x) + ∥x∥0,λ ≥ f(z) + ∥z∥0,λ = F (z),

and combining with the inequality (71) leads to the second implication. Furthermore, if 0 ∈ X∗f ,

then ∇f(0) = 0. Assuming that mini∈[N ] λi > 0, then F (x) ≥ f(0) + ⟨∇f(0), x⟩ + ∥x∥0,λ ≥F (0)−∥∇f(0)∥1∥x∥∞+mini∈[N ] λi ≥ F (0) for all x ∈ B∞

(0,

mini∈[N ] λi

∥∇f(z)∥1

). If 0 ∈ X∗

f , then ∇f(0) = 0

and thus F (x) ≥ f(0) + ⟨∇f(0), z⟩+ ∥x∥0,λ ≥ F (0) for all x ∈ Rn.

From Theorem 2.2 we conclude that any vector z ∈ Rn is a local minimizer of problem (67) if and onlyif the following equality holds:

∇I(z)f(z) = 0.

We denote with Tf the set of all local minima of problem (67), i.e.

Tf ={z ∈ Rn : ∇I(z)f(z) = 0

},

and we call them basic local minimizers. It is not hard to see that when the function f is stronglyconvex, the number of basic local minima of problem (67) is finite, otherwise we might have an infinitenumber of basic local minimizers.

2.1.3 Strong local minimizers

In this section we introduce a family of strong local minimizers of problem (67) based on an approximationof the function f . It can be easily seen that finding a basic local minimizer is a trivial procedure e.g.: (a)if we choose some set of indices I ⊆ [n] such that {j ∈ [n] : j ∈ Si, λi = 0} ⊆ I, then from Theorem2.2 the minimizer of the convex problem minx∈SI

f(x) is a basic local minimizer for problem (67); (b)if we minimize the convex function f w.r.t. all blocks i satisfying λi = 0, then from Theorem 2.2 weobtain again some basic local minimizer for (67). This motivates us to introduce more restricted classesof local minimizers. Thus, we first define an approximation version of function f satisfying certainassumptions. In particular, given i ∈ [N ] and x ∈ Rn, the convex function ui : Rni → R is an upperbound of function f(x+ Ui(yi − xi)) if it satisfies:

f(x+ Ui(yi − xi)) ≤ ui(yi;x) ∀yi ∈ Rni . (72)

We additionally impose the following assumptions on each function ui.

Assumption 2.3. The approximation function ui satisfies the assumptions:(i) The function ui(yi;x) is strictly convex and differentiable in the first argument, is continuous in thesecond argument and satisfies ui(xi;x) = f(x) for all x ∈ Rn.(ii) Its gradient in the first argument satisfies ∇ui(xi;x) = ∇if(x) ∀x ∈ Rn.(iii) For any x ∈ Rn, the function ui(yi;x) has Lipschitz continuous gradient in the first argument withconstant Mi > Li, i.e. there exists Mi > Li such that:

∥∇ui(yi;x)−∇ui(zi;x)∥ ≤Mi∥yi − zi∥ ∀yi, zi ∈ Rni .

(iv) There exists µi such that 0 < µi ≤Mi − Li and

ui(yi;x) ≥ f(x+ Ui(yi − xi)) +µi2∥yi − xi∥2 ∀x ∈ Rn, yi ∈ Rni .

38

Page 39: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Note that a similar set of assumptions has been considered in [23], where the authors derived a generalframework for the block coordinate descent methods on composite convex problems. Clearly, Assumption2.3 (iv) implies the upper bound (72) and in [23] this inequality is replaced with the assumption ofstrong convexity of ui in the first argument.We now provide several examples of approximation versions of the objective function f which satisfyAssumption 2.3.

Example 2.4. We now provide three examples of approximation versions for the function f . The readercan easily find many other examples of approximations satisfying Assumption 2.3.1. Separable quadratic approximation: given M ∈ RN , such that Mi > Li for all i ∈ [N ], we definethe approximation version

uqi (yi;x,Mi) = f(x) + ⟨∇if(x), yi − xi⟩+Mi

2∥yi − xi∥2.

It satisfies Assumption 2.3, in particular condition (iv) holds for µi = Mi − Li. This type of approxi-mations was used by Nesterov for deriving the random coordinate gradient descent method for solvingsmooth convex problems [51] and further extended to the composite convex case in [48,62].

2. General quadratic approximation: given Hi ≽ 0, such that Hi ≻ LiIni for all i ∈ [N ], we define theapproximation version

uQi (yi;x,Hi) = f(x) + ⟨∇if(x), yi − xi⟩+1

2⟨yi − xi,Hi(yi − xi)⟩.

It satisfies Assumption 2.3, in particular condition (iv) holds for µi = σmin(Hi − LiIni) (the smallesteigenvalue). This type of approximations was used by Luo et al. in deriving the greedy coordinatedescent method based on the Gauss-Southwell rule for solving composite convex problems [72].

3. Exact approximation: given β ∈ RN , such that βi > 0 for all i ∈ [N ], we define the approximationversion

uei (yi;x, β) = f(x+ Ui(yi − xi)) +βi2∥yi − xi∥2.

It satisfies Assumption 2.3, in particular condition (iv) holds for µi = βi. This type of approximationfunctions was used especially in the nonconvex settings [23].

Based on each approximation function ui satisfying Assumption 2.3, we introduce a class of restrictedlocal minimizers for our nonconvex optimization problem (67).

Definition 2.5. For any set of approximation functions ui satisfying Assumption 2.3, a vector z is calledan u-strong local minimizer for problem (67) if it satisfies:

F (z) ≤ minyi∈Rni

ui(yi; z) + ∥z + Ui(yi − zi)∥0,λ ∀i ∈ [N ].

Moreover, we denote the set of strong local minima, corresponding to the approximation functions ui,with Lu.

It can be easily seen that

minyi∈Rni

ui(yi; z) + ∥z + Ui(yi − zi)∥0,λyi=zi≤ ui(zi; z) + ∥z∥0,λ = F (z)

39

Page 40: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

and thus an u-strong local minimizer z ∈ Lu, has the property that each block zi is a fixed point of theoperator defined by the minimizers of the function ui(yi; z) + λi∥yi∥0, i.e. we have for all i ∈ [N ]:

zi = arg minyi∈Rni

ui(yi; z) + λi∥yi∥0.

Theorem 2.6. Let the set of approximation functions ui satisfy Assumption 2.3, then any u−stronglocal minimizer is a local minimum of problem (67), i.e. the following inclusion holds:

Lu ⊆ Tf .

Proof. From Definition 2.5 and Assumption 2.3 we have:

F (z) ≤ minyi∈Rni

ui(yi; z) + ∥z + Ui(yi − zi)∥0,λ

≤ minyi∈Rni

ui(zi; z) + ⟨∇ui(zi; z), yi − zi⟩+Mi

2∥yi − zi∥2 + ∥z + Ui(yi − zi)∥0,λ

= minyi∈Rni

F (z) + ⟨∇if(z), yi − zi⟩+Mi

2∥yi − zi∥2 + λi(∥yi∥0 − ∥zi∥0)

≤ F (z) + ⟨∇if(z), hi⟩+Mi

2∥hi∥2 + λi(∥zi + hi∥0 − ∥zi∥0)

for all hi ∈ Rni and i ∈ [N ]. Choosing now hi as follows:

hi = − 1

MiU(j)∇(j)f(z) for some j ∈ I(z) ∩ Si,

we have from the definition of I(z) that

λi(∥zi + hi∥0 − ∥zi∥0) ≤ 0

and thus 0 ≤ − 12Mi

∥∇(j)f(z)∥2 or equivalently ∇(j)f(z) = 0. Since this holds for any j ∈ I(z) ∩ Si,it follows that z satisfies ∇I(z)f(z) = 0. Using now Theorem 2.2 we obtain our statement.

For the three approximation versions given in Example 2.4 we obtain explicit expressions for the cor-responding u-strong local minimizers. In particular, for some M ∈ RN

++ and i ∈ [N ], if we considerthe previous separable quadratic approximation uqi (yi;x,Mi), then any strong local minimizer z ∈ Luq

satisfies the following relations:

(i) ∇I(z)f(z) = 0 and additionally

(ii)

{|∇(j)f(z)| ≤

√2λiMi, if z(j) = 0

|z(j)| ≥√

2λiMi, if z(j) = 0, ∀i ∈ [N ] and j ∈ Si.

The relations given in (ii) can be derived based on the separable structure of the approximationuqi (yi;x,Mi) and of the quasinorm ∥ · ∥0 using similar arguments as in Lemma 3.2 from [34]. Forcompleteness, we present the main steps in the derivation. First, it is clear that any z ∈ Luq satisfies:

z(j) = arg miny(j)∈R

∇(j)f(z)(y(j) − z(j)) +Mi

2|y(j) − z(j)|2 + λi∥y(j)∥0 (73)

40

Page 41: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

for all j ∈ Si and i ∈ [N ]. On the other hand since the optimum point in the previous optimizationproblems can be 0 or different from 0, we have:

miny(j)∈R

∇(j)f(z)(y(j) − z(j)) +Mi

2|y(j) − z(j)|2 + λi∥y(j)∥0

= min

{Mi

2|z(j) −

1

Mi∇(j)f(z)|2 −

1

2Mi|∇(j)f(z)|2, λi −

1

2Mi|∇(j)f(z)|2

}.

If z(j) = 0, then from fixed point relation of problem (73) and the expression for its optimal value we

have Mi2 |z(j) − 1

Mi∇(j)f(z)|2 − 1

2Mi|∇(j)f(z)|2 ≤ λi − 1

2Mi|∇(j)f(z)|2 and thus |∇(j)f(z)| ≤

√2λiMi.

Otherwise, we have j ∈ I(z) such that from Theorem 2.2 we have ∇(j)f(z) = 0 and combining

with Mi2 |z(j) − 1

Mi∇(j)f(z)|2 − 1

2Mi|∇(j)f(z)|2 ≥ λi − 1

2Mi|∇(j)f(z)|2 leads to |z(j)| ≥

√2λiMi

. Similar

derivations as above can be derived for the general quadratic approximations uQi (yi;x,Hi) providedthat Hi is diagonal matrix. For general matrices Hi, the corresponding strong local minimizers are fixedpoints of small ℓ0 regularized quadratic problems of dimensions ni.Finally, for some β ∈ RN

++ and i ∈ [N ], considering the exact approximation uei (yi;x, βi) we obtain thatany corresponding strong local minimizer z ∈ Lue satisfies:

zi = arg minhi∈Rni

F (z + Uihi) +βi2∥hi∥2 ∀i ∈ [N ].

Theorem 2.7. Let Assumption 2.1 hold and u1, u2 be two approximation functions satisfying Assump-tion 2.3. Additionally, let

u1(yi;x) ≤ u2(yi;x), ∀yi ∈ Rni , x ∈ Rn, i ∈ [N ].

Then the following inclusions are valid:

X ∗ ⊆ Lu1 ⊆ Lu2 ⊆ Tf .

Proof. Assume z ∈ X ∗, i.e. it is a global minimizer of our original nonconvex problem (67). Then, wehave:

F (z) ≤ minyi∈Rni

F (z + Ui(yi − zi))

= minyi∈Rni

f(z + Ui(yi − zi)) + λi∥yi∥0 +∑j =i

λj∥zj∥0

≤ minyi∈Rni

u1i (yi; z) + ∥z + Ui(yi − zi)∥0,λ ∀i ∈ [N ],

and thus z ∈ Lu1 , i.e. we proved that X ∗ ⊆ Lu1 . Therefore, any class of u-strong local minimizerscontains the global minima of problem (67).Further, let us take z ∈ Lu1 . Using Definition (2.5) and defining

ti = arg minyi∈Rni

u2i (yi; z) + ∥z + Ui(yi − zi)∥0,λ,

41

Page 42: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

we get:

F (z) ≤ minyi∈Rni

u1i (yi; z) + ∥z + Ui(yi − zi)∥0,λ

≤ u1i (ti; z) + ∥z + Ui(ti − zi)∥0,λ≤ u2i (ti; z) + ∥z + Ui(ti − zi)∥0,λ= min

yi∈Rniu2i (yi; z) + ∥z + Ui(yi − zi)∥0,λ.

This shows that z ∈ Lu2 and thus Lu1 ⊆ Lu2 .

Note that if the following inequalities hold

(Li + βi)Ini ≼ Hi ≼MiIni ∀i ∈ [N ],

using the Lipschitz gradient relation (68), we obtain that

uei (yi;x, βi) ≤ uQi (yi;x,Hi) ≤ uqi (yi;x,Mi) ∀x ∈ Rn, yi ∈ Rni .

Therefore, from Theorem 2.7 we observe that uq (uQ)-strong local minimizers for problem (67) areincluded in the class of all basic local minimizers Tf . Thus, designing an algorithm which converges toa local minimum from Luq (LuQ) will be of interest. Moreover, ue-strong local minimizers for problem(67) are included in the class of all uq (uQ)-strong local minimizers. Thus, designing an algorithm whichconverges to a local minimum from Lue will be of interest. To illustrate the relationships between thepreviously defined classes of restricted local minima and see how much they are related to global minimaof (67), let us consider an example.

Example 2.8. We consider the least square settings f(x) = ∥Ax− b∥2, where A ∈ Rm×n and b ∈ Rm

satisfying:

A =

1 α1 · · · αn

1

1 α2 · · · αn2

1 α3 · · · αn3

1 α4 · · · αn4

+ [pI4 O4,n−4] , b = qe,

with e ∈ R4 the vector having all entries 1. We choose the following parameter values: α =[1 1.1 1.2 1.3]T , n = 7, p = 3.3, q = 25, λ = 1 and βi = 0.0001 for all i ∈ [n]. We further con-sider the scalar case, i.e. ni = 1 for all i. In this case we have that uqi = uQi , i.e. the separable andgeneral quadratic approximation versions coincide. The results are given in Table 4. From 128 possiblelocal minima, we found 19 local minimizers in Luq given by uqi (yi;x, Lf ), and only 6 local minimizersin Luq given by uqi (yi;x, Li). Moreover, the class of ue-strong local minima Lue given by uei (yi;x, βi)contains only one vector which is also the global optimum of problem (67), i.e. in this case Lue = X ∗.From Table 4 we can clearly see that the newly introduced classes of local minimizers are much morerestricted (in the sense of having small number of elements, close to that of the set of global minimizers)than the class of basic local minimizers that is much larger.

42

Page 43: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Table 4: Strong local minima distribution on a least square example.

Class of local minima TfLuq

uqi (yi;x, Lf )Luq

uqi (yi;x, Li)Lue

uei (yi;x, βi)

Number of local minima 128 19 6 1

2.1.4 Random coordinate descent type methods

We now present a family of random block coordinate descent methods suitable for solving the class ofproblems (67). The family of the algorithms we consider takes a very general form, consisting in theminimization of a certain approximate version of the objective function one block variable at a time,while fixing the rest of the block variables. Thus, these algorithms are a combination between an iterativehard thresholding scheme and a general random coordinate descent method and they are particularlysuited for solving nonsmooth ℓ0 regularized problems since they solve an easy low dimensional problemat each iteration, often in closed form. Our family of methods covers particular cases such as randomblock coordinate gradient descent and random proximal coordinate descent methods.Let x ∈ Rn and i ∈ [N ]. Then, we introduce the following thresholding map for a given approximationversion u satisfying Assumption 2.3:

T ui (x) = arg min

yi∈Rniui(yi;x) + λi∥yi∥0.

In order to find a local minimizer of problem (67), we introduce the family of random block coordinatedescent iterative hard thresholding (RCD-IHT) methods, whose iteration is described as follows:

Algorithm 2.9. RCD-IHT

1. Choose x0 ∈ Rn and approximation version u satisfying Assumption 2.3. For k ≥ 0 do:

2. Choose a (block) coordinate ik ∈ [N ] with uniform probability

3. Set xk+1ik

= T uik(xk) and xk+1

i = xki ∀i = ik.

Note that our algorithm is directly dependent on the choice of approximation u and the computation ofthe operator T u

i (x) is in general easy, sometimes even in closed form. For example, when ui(yi;x) =uqi (yi;x,Mi) and ∇ikf(x

k) is available, we can easily compute the closed form solution of T uik(xk) as

in the iterative hard thresholding schemes [34]. Indeed, if we define ∆i(x) ∈ Rni as follows:

(∆i(x))(j) =Mi

2|x(j) − (1/Mi)∇(j)f(x)|2, (74)

then the iteration of (RCD-IHT) method becomes:

xk+1(j) =

{xk(j) −

1Mik

∇(j)f(xk), if (∆ik(xk))(j) ≥ λik

0, if (∆ik(xk))(j) ≤ λik ,

for all j ∈ Sik . Note that if at some iteration λik = 0, then the iteration of algorithm (RCD-IHT)is identical with the iteration of the usual random block coordinate gradient descent method [48, 51].Further, our algorithm has, in this case, similarities with the iterative hard thresholding algorithm (IHTA)analyzed in [34]. For completeness, we also present the algorithm (IHTA).

43

Page 44: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Algorithm 2.10. IHTA [34]

1. Choose Mf > Lf . For k ≥ 0 do:

2. xk+1 = argminy∈Rn f(xk) + ⟨∇f(xk), y − xk⟩+ Mf

2 ∥y − xk∥2 + ∥y∥0,λ,

or equivalently for each component we have the update:

xk+1(j) =

{xk(j) −

1Mf

∇(j)f(xk), if

Mf

2 |xk(j) −1

Mf∇(j)f(x

k)|2 ≥ λi

0, ifMf

2 |xk(j) −1

Mf∇(j)f(x

k)|2 ≤ λi,

for all j ∈ Si and i ∈ [N ]. Note that the arithmetic complexity of computing the next iterate xk+1

in (RCD-IHT), once ∇ikf(xk) is known, is of order O(nik), which is much lower than the arithmetic

complexity per iteration O(n) of (IHTA) for N >> 1, that additionally requires the computation of fullgradient ∇f(xk). Similar derivations as above can be derived for the general quadratic approximationsuQi (yi;x,Hi) provided that Hi is diagonal matrix. For general matrices Hi, the corresponding algorithmrequires solving small ℓ0 regularized quadratic problems of dimensions ni.Finally, in the particular case when we consider the exact approximation ui(yi;x) = uei (yi;x, βi), ateach iteration of our algorithm we need to perform an exact minimization of the objective function fw.r.t. one randomly chosen (block) coordinate. If λik = 0, then the iteration of algorithm (RCD-IHT)requires solving a small dimensional subproblem with a strongly convex objective function as in theclassical proximal block coordinate descent method [23]. In the case when λik > 0 and ni > 1, thissubproblem is nonconvex and usually hard to solve. However, for certain particular cases of the functionf and ni = 1 (i.e. scalar case n = N), we can easily compute the solution of the small dimensionalsubproblem in algorithm (RCD-IHT). Indeed, for x ∈ Rn let us define:

vi(x) = x+ Uihi(x), where hi(x) = arg minhi∈R

f(x+ Uihi) +βi2∥hi∥2

∆i(x) = f(x− Uixi) +βi2∥xi∥2 − f(vi(x))− βi

2∥(vi(x))i − xi∥2 ∀i ∈ [n]. (75)

Then, it can be seen that the iteration of (RCD-IHT) in the scalar case for the exact approximationuei (yi;x, βi) has the following form:

xk+1ik

=

{(vik(xk))ik , if ∆ik(xk) ≥ λik0, if ∆ik(xk) ≤ λik .

In general, if the function f satisfies Assumption 2.1, computing vik(xk) at each iteration of (RCD-IHT) requires the minimization of an unidimensional convex smooth function, which can be efficientlyperformed using unidimensional search algorithms. Let us analyze the least squares settings in order tohighlight the simplicity of the iteration of algorithm (RCD-IHT) in the scalar case for the approximationuei (yi;x, βi).

Example 2.11. Let A ∈ Rm×n, b ∈ Rm and f(x) = 12∥Ax− b∥2. In this case (recall that we consider

ni = 1 for all i) we have the following expression for ∆i(x):

∆i(x) =1

2∥r −Aixi∥2 +

βi2∥xi∥2 −

1

2

∥∥∥∥r(Im − AiATi

∥Ai∥2 + βi

)∥∥∥∥2 − βi2

∥∥∥∥ ATi r

∥Ai∥2 + βi

∥∥∥∥2 ,44

Page 45: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where r = Ax − b. Under these circumstances, the iteration of (RCD-IHT) has the following closedform expression:

xk+1ik

=

xkik −AT

ikrk

∥Aik∥2+βik

, if ∆ik(xk) ≥ λik

0, if ∆ik(xk) ≤ λik .(76)

In the sequel we use the following notations for the entire history of index choices, the expected valueof objective function f w.r.t. the entire history and for the support of the sequence xk:

ξk = {i0, . . . , ik−1}, fk = E[f(xk)], Ik = I(xk).

Due to the randomness of algorithm (RCD-IHT), at any iteration k with λik > 0, the sequence Ik

changes if one of the following situations holds for some j ∈ Sik :

(i) xk(j) = 0 and (T uik(xk))(j) = 0

(ii) xk(j) = 0 and (T uik(xk))(j) = 0.

In other terms, at a given moment k with λik > 0, we expect no change in the sequence Ik of algorithm(RCD-IHT) if there is no index j ∈ Sik satisfying the above corresponding set of relations (i) and (ii).We define the notion of change of Ik in expectation at iteration k, for algorithm (RCD-IHT) as follows:let xk be the sequence generated by (RCD-IHT), then the sequence Ik = I(xk) changes in expectationif the following situation occurs:

E[|Ik+1 \ Ik|+ |Ik \ Ik+1| | xk] > 0, (77)

which implies (recall that we consider uniform probabilities for the index selection):

P(|Ik+1 \ Ik|+ |Ik \ Ik+1| > 0 | xk

)≥ 1

N.

Next, we show that there is a finite number of changes of Ik in expectation generated by algorithm(RCD-IHT) and then, we prove global convergence of this algorithm, in particular we show that thelimit points of the generated sequence converges to strong local minima from the class of points Lu.

2.1.5 Global convergence analysis

In this section we analyze the descent properties of the previously introduced family of coordinatedescent algorithms under Assumptions 2.1 and 2.3. Based on these properties, we establish the natureof the limit points of the sequence generated by Algorithm (RCD-IHT). In particular, we derive that anyaccumulation point of this sequence is almost surely a local minimum which belongs to the class Lu.Note that the classical results for any iterative algorithm used for solving general nonconvex problemsstate global convergence to stationary points, while for the ℓ0 regularized nonconvex and NP-hardproblem (67) we show that our family of algorithms have the property that the generated sequencesconverge to strong local minima.In order to prove almost sure convergence results for our family of algorithms, we use the followingsupermartingale convergence lemma of Robbins and Siegmund (see e.g. [56]):

45

Page 46: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Lemma 2.12. Let vk, uk and αk be three sequences of nonnegative random variables satisfying thefollowing conditions:

E[vk+1|Fk] ≤ (1 + αk)vk − uk ∀k ≥ 0 a.s. and∞∑k=0

αk <∞ a.s.,

where Fk denotes the collections v0, . . . , vk, u0, . . . , uk, α0, . . . , αk. Then, we have limk→∞ vk = v fora random variable v ≥ 0 a.s. and

∑∞k=0 uk <∞ a.s.

Further, we analyze the convergence properties of algorithm (RCD-IHT). First, we derive a descentinequality for this algorithm.

Lemma 2.13. Let xk be the sequence generated by (RCD-IHT) algorithm. Under Assumptions 2.1 and2.3 the following descent inequality holds:

E[F (xk+1) | xk] ≤ F (xk)− E[µik2

∥xk+1 − xk∥2 | xk]. (78)

Proof. From Assumption 2.3 we have:

F (xk+1) +µik2

∥xk+1ik

− xkik∥2 ≤ uik(x

k+1ik

, xk) + ∥xk+1∥0,λ

≤ uik(xkik, xk) + ∥xk∥0,λ

≤ f(xk) + ∥xk∥0,λ = F (xk).

In conclusion, our family of algorithms belong to the class of descent methods:

F (xk+1) ≤ F (xk)− µik2

∥xk+1ik

− xkik∥2. (79)

Taking expectation w.r.t. ik we get our descent inequality.

We now prove the global convergence of the sequence generated by algorithm (RCD-IHT) to localminima which belongs to the restricted set of local minimizers Lu.

Theorem 2.14. Let xk be the sequence generated by algorithm (RCD-IHT). Under Assumptions 2.1and 2.3 the following statements hold:(i) There exists a scalar F such that:

limk→∞

F (xk) = F a.s. and limk→∞

∥xk+1 − xk∥ = 0 a.s.

(ii) At each change of sequence Ik in expectation we have the following relation:

E[µik2

∥xk+1 − xk∥2 | xk]≥ δ,

where δ = 1N min

{min

i∈[N ]:λi>0

µiλi

Mi, mini∈[N ],j∈Si∩supp(x0)

µi

2 |x0(j)|

2

}> 0.

(iii) The sequence Ik changes a finite number of times as k → ∞ almost surely. The sequence ∥xk∥0converges to some ∥x∗∥0 almost surely. Furthermore, any limit point of the sequence xk belongs to theclass of strong local minimizers Lu almost surely.

46

Page 47: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. (i) From the descent inequality given in Lemma (2.13) and Lemma 2.12 we have that there existsa scalar F such that limk→∞ F (xk) = F almost sure. Consequently, we also have limk→∞ F (xk) −F (xk+1) = 0 almost sure and since our method is of descent type, then from (79) we get

µik2 ∥xk+1 −

xk∥2 ≤ F (xk)− F (xk+1), which leads to limk→∞∥xk+1 − xk∥ = 0 almost sure.(ii) For simplicity of the notation we denote x+ = xk+1, x = xk and i = ik. First, we show thatany nonzero component of the sequence generated by (RCD-IHT) is bounded below by a positiveconstant. Let x ∈ Rn and i ∈ [N ]. From definition of T u

i (x), for any j ∈ supp(T ui (x)), the jth

component of the minimizer T ui (x) of the function ui(yi;x) + λi∥yi∥0 is denoted (T u

i (x))(j). Let usdefine y+ = x + Ui(T

ui (x) − xi). Then, for any j ∈ supp(T u

i (x)) the following optimality conditionholds:

∇(j)ui(y+i ;x) = 0. (80)

On the other hand, given j ∈ supp(T ui (x)), from the definition of T u

i (x) we get:

ui(y+i ;x) + λi∥y+i ∥0 ≤ ui(y

+i − U(j)y

+(j);x) + λi∥y+i − U(j)y

+(j)∥0.

Subtracting λi∥y+i − U(j)y+(j)∥0 from both sides, leads to:

ui(y+i ;x) + λi ≤ ui(y

+i − U(j)y

+(j);x). (81)

Further, if we apply the Lipschitz gradient relation given in Assumption 2.3 (iii) in the right hand sideand use the optimality conditions for the unconstrained problem solved at each iteration, we get:

ui(y+i − U(j)y

+(j);x) ≤ ui(y

+i ;x)− ⟨∇(j)ui(y

+i ;x), y

+(j)⟩ +

Mi

2|y+(j)|

2

(80)= ui(y

+i ;x) +

Mi

2|y+(j)|

2.

Combining with the left hand side of (81) we get:

|(T ui (x))(j)|2 ≥

2λiMi

∀j ∈ supp(T ui (x)). (82)

Replacing x = xk for k ≥ 0, it can be easily seen that, for any j ∈ supp(xki ) and i ∈ [N ], we have:

|xk(j)|2

{≥ 2λi

Mi, if xk(j) = 0 and i ∈ ξk

= |x0(j)|2, if xk(j) = 0 and i /∈ ξk.

Further, assume that at some iteration k > 0 a change of sequence Ik in expectation occurs. Thus,

there is an index j ∈ [n] (and block i containing j) such that either(xk(j) = 0 and

(T ui (x

k))(j)

= 0)

or(xk(j) = 0 and

(T ui (x

k))(j)

= 0). Analyzing these cases we have:

∥T ui (x

k)− xki ∥2 ≥∣∣∣∣(T u

i (xk))(j)

− xk(j)

∣∣∣∣2≥ 2λi

Miif xk(j) = 0

≥ 2λiMi

if xk(j) = 0 and i ∈ ξk

= |x0(j)|2 if xk(j) = 0 and i /∈ ξk.

47

Page 48: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Observing that under uniform probabilities we have:

E[µik2

∥xk+1 − xk∥2|xk]=

1

N

N∑i=1

µi2∥T u

i (xk)− xki ∥2,

we can conclude that at each change of sequence Ik in expectation we get:

E[µik2

∥xk+1 − xk∥2|xk]≥ 1

Nmin

{min

i∈[N ]:λi>0

µiλiMi

, mini∈[N ],j∈Si∩supp(x0)

µi2|x0(j)|

2

}.

(iii) From limk→∞

∥xk+1 − xk∥ = 0 a.s. we have limk→∞

E[∥xk+1 − xk∥ | xk

]= 0 a.s. On the other hand

from part (ii) we have that if the sequence Ik changes in expectation, then E[∥xk+1−xk∥2 | xk] ≥ δ > 0.These facts imply that there are a finite number of changes in expectation of sequence Ik, i.e. thereexist K > 0 such that for any k > K we have Ik = Ik+1.Further, if the sequence Ik is constant for k > K, then we have Ik = I∗ and ∥xk∥0,λ = ∥x∗∥0,λ for anyvector x∗ satisfying I(x∗) = I∗. Also, for k > K algorithm (RCD-IHT) is equivalent with the classicalrandom coordinate descent method [23], and thus shares its convergence properties, in particular anylimit point of the sequence xk is a minimizer on the coordinates I∗ for minx∈SI∗ f(x). Therefore, if thesequence Ik is fixed, then we have for any k > K and ik ∈ Ik:

uik(xk+1ik

;xk) + ∥xk+1∥0,λ ≤ uik(yik ;xk) + ∥xk + Uik(yik − xkik)∥0,λ ∀yik ∈ Rnik . (83)

On the other hand, denoting with x∗ an accumulation point of xk, taking limit in (83) and using that∥xk∥0,λ = ∥x∗∥0,λ as k → ∞, we obtain the following relation:

F (x∗) ≤ minyi∈Rni

u(yi;x∗) + ∥x∗ + Ui(yi − x∗i )∥0,λ a.s.

for all i ∈ [N ] and thus x∗ is the minimizer of the previous right hand side expression. Using thedefinition of local minimizers from the set Lu, we conclude that any limit point x∗ of the sequence xk

belongs to this set, which proves our statement.

It is important to note that the classical results for any iterative algorithm used for solving nonconvexproblems usually state global convergence to stationary points, while for our algorithms we were ableto prove global convergence to local minima of our nonconvex and NP-hard problem (67). Moreover,if λi = 0 for all i ∈ [N ], then the optimization problem (67) becomes convex and we see that ourconvergence results cover also this setting.

2.1.6 Random data experiments on sparse learning

In this section we analyze the practical performances of our family of algorithms (RCD-IHT) and comparethem with that of algorithm (IHTA) [34]. We perform several numerical tests on sparse learning problemswith randomly generated data. All algorithms were implemented in Matlab code and the numericalsimulations are performed on a PC with Intel Xeon E5410 CPU and 8 Gb RAM memory.Sparse learning represents a collection of learning methods which seek a tradeoff between some goodness-of-fit measure and sparsity of the result, the latter property allowing better interpretability. One of themodels widely used in machine learning and statistics is the linear model (least squares setting). Thus,in the first set of tests we consider sparse linear formulation:

48

Page 49: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

minx∈Rn

F (x)

(=

1

2∥Ax− b∥2 + λ∥x∥0

),

where A ∈ Rm×n and λ > 0. We analyze the practical efficiency of our algorithms in terms of theprobability of reaching a global optimal point. Due to difficulty of finding the global solution of thisproblem, we consider a small model m = 6 and n = 12. For each penalty parameter λ, rangingfrom small values (0.01) to large values (2), we ran the family of algorithms (RCD-IHT), for separablequadratic approximation (denoted (RCD-IHT-uq), for exact approximation (denoted (RCD-IHT-ue) and(IHTA) [34] from 100 randomly generated (with random support) initial vectors. The numbers of runsout of 100 in which each method found the global optimum is given in Table 5. We observe that forall values of λ our algorithms (RC-IHT-uq) and (RCD-IHT-ue) are able to identify the global optimumwith a rate of success superior to algorithm (IHTA) and for extreme values of λ our algorithms performmuch better than (IHTA).

Table 5: Numbers of runs out of 100 in which algorithms (IHTA), (RCD-IHT-uq) and (RCD-IHT-ue)found global optimum.

λ (IHTA) (RCD-IHT-uq) (RCD-IHT-ue)

0.01 95 96 100

0.07 92 92 100

0.09 43 51 70

0.15 41 47 66

0.35 24 28 31

0.8 36 43 44

1.2 29 29 54

1.8 76 81 91

2 79 86 97

Table 6: Performance of Algorithms (IHTA), (RCD-IHT-uq), (RCD-IHT-ue)

m\n (IHTA) (RCD-IHT-uq) (RCD-IHT-ue)F ∗ ∥x∗∥0 iter F ∗ ∥x∗∥0 full-iter F ∗ ∥x∗∥0 full-iter

20\100 1.56 23 797 1.39 21 602 -0.67 15 12

50\100 -95.88 31 4847 -95.85 31 4046 -449.99 89 12

30\200 -14.11 35 2349 -14.30 33 1429 -92.95 139 12

50\200 -0.88 26 3115 -0.98 25 2494 -13.28 83 19

70\300 -12.07 70 5849 -11.94 71 5296 -80.90 186 19

70\500 -20.60 157 6017 -19.95 163 5642 -69.10 250 16

100\500 -0.55 16 4898 -0.52 16 5869 -47.12 233 14

80\1000 13.01 197 9516 13.71 229 7073 -0.56 19 13

80\1500 5.86 75 7825 6.06 77 7372 -0.22 24 14

150\2000 26.43 418 21353 25.71 509 20093 -30.59 398 16

150\2500 26.52 672 15000 27.09 767 15000 -55.26 603 17

49

Page 50: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

In the second set of experiments we consider the ℓ2 regularized logistic loss model from machine learning[5]. In this model the relation between the data, represented by a random vector a ∈ Rn, and itsassociated label, represented by a random binary variable y ∈ {0, 1}, is determined by the conditionalprobability:

P{y|a;x} =ey⟨a,x⟩

1 + e⟨a,x⟩,

where x denotes a parameter vector. Then, for a set ofm independently drawn data samples {(ai, yi)}mi=1,the joint likelihood can be written as a function of x. To find the maximum likelihood estimate oneshould maximize the likelihood function, or equivalently minimize the negative log-likelihood (the logisticloss):

minx∈Rn

1

m

m∑i=1

log(1 + e⟨ai,x⟩

)− yi⟨ai, x⟩.

Under the assumption of n ≤ m and A = [a1, . . . , am] ∈ Rn×m being full rank, it is well known thatf(·) is strictly convex. However, there are important applications (e.g. feature selection) where theseassumptions are not satisfied and the problem is highly ill-posed. In order to compensate this drawback,the logistic loss is regularized by some penalty term (e.g. ℓ2 norm ∥x∥22, see [5]). Furthermore, thepenalty term implicitly bounds the length of the minimizer, but does not promote sparse solutions.Therefore, it is desirable to impose an additional sparsity regularizer, such as the ℓ0 quasinorm. Inconclusion our problem to be minimized is given by:

minx∈Rn

F (x)

(=

1

m

m∑i=1

log(1 + e⟨ai,x⟩

)− yi⟨ai, x⟩+

ν

2∥x∥2 + ∥x∥0,λ

),

where now f is strongly convex with parameter ν. For simulation, data were uniformly random generatedand we fixed the parameters ν = 0.5 and λ = 0.2. Once an instance of random data has been generated,we ran 10 times our algorithms (RCC-IHT-uq) and (RCD-IHT-ue) and algorithm (IHTA) [34] startingfrom 10 different initial points. We reported in Table 6 the best results of each algorithm obtainedover all 10 trials, in terms of best function value that has been attained with associated sparsity andnumber of iterations. In order to report relevant information, we have measured the performance ofcoordinate descent methods (RCD-IHT-uq) and (RCD-IHT-ue) in terms of full iterations obtained bydividing the number of all iterations by the dimension n. The column F ∗ denotes the final functionvalue attained by the algorithms, ∥x∗∥0 represents the sparsity of the last generated point and iter(full-iter) represents the number of iterations (the number of full iterations). Note that our algorithms(RCD-IHT-uq) and (RCD-IHT-ue) have superior performance in comparison with algorithm (IHTA) onthe reported instances. We observe that algorithm (RCD-IHT-ue) performs very few full iterations inorder to attain best function value amongst all three algorithms. Moreover, the number of full iterationsperformed by algorithm (RCD-IHT-ue) scales up very well with the dimension of the problem.

2.1.7 Sparse packetized predictive control

Packetized predictive control (PPC) is an efficient control strategy recently applied to networked systemswith unreliable and rate-limited communication links such as wireless networks and internet [39]. In(PPC), the controller output is obtained through minimizing a finite-horizon cost function on-line andin a receding horizon manner. Each control packet contains a sequence of tentative plant inputs fora finite horizon of future time instants and is transmitted through a communication channel. Packets

50

Page 51: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

which are successfully received at the plant actuator side, are stored in a buffer to be used wheneverlater packets are dropped. When there are no packet-dropouts, (PPC) reduces to model predictivecontrol. For (PPC) to give desirable closed-loop properties, the more unreliable the network is, thelarger the horizon length N (and thus the number of tentative plant input values contained in eachpacket) needs to be chosen. In principle this would require increasing the network bandwidth (i.e., itsbit-rate). In order to avoid this drawback, a sparse optimization problem is addressed. Consider thefollowing unconstrained discrete-time linear time-invariant plant model with a scalar input:

x(t+ 1) = Ax(t) +Bu(t), t ≥ 0, ,

where x(t) ∈ Rnx , u(t) ∈ R. As in [39], we model the rate-limited architecture through a dropoutsequence {d(t)}t≥0 where d(t) = 1 if packet-dropout occurs and d(t) = 0 otherwise. At each timeinstant t, the controller calculates and sends a packet u(x(t)) = [u0(x(t)) . . . uN−1(x(t))]

T ∈ RN , tothe plant input node. In order to have robustness against packet-dropouts, we use the buffering strategyfrom [39].Typically, without having sparsity constraints, the controller computes the command by solving the ℓ2formulation:

u(x) = arg minu∈RN

∥Gu−Hx∥2 + r∥u∥22, (84)

where x ∈ Rnx is the present system state, r ≥ 0 and matrices G,H are defined by the dynamics of theplant and the stage costs [39]. Furthermore, in the sparse setting (rate-limited channels), the authorsin [39] provide two formulations of the sparse control problem. First, the ℓ1 regularization, where giventhe system state x ∈ Rnx , we compute the control input

u(x) = arg minu∈RN

∥Gu−Hx∥2 + µ∥u∥1. (85)

Second, the ℓ0 approach requires solving

u(x) = arg minu∈RN

∥u∥0 s.t. ∥Gu−Hx∥22 ≤ xTWx,

where W > 0 is a weight matrix. Then, we reformulate the ℓ0 problem into the Lagrangian form:

u(x) = arg minu∈RN

∥Gu−Hx∥2 + λ∥u∥0. (86)

For simulations we consider the same matrices (A,B) as in [39] and N = 10. We compare theperformance of algorithm (RPAM-IHT) on problem (86) with the performances of algorithms solvingthe convex relaxations (85) and (84), in terms of the convergence of the closed-loop state to equilibriumstate and the evolution of the support of the control input u(x). We run each algorithm 500 times withrandomly generated packet-dropouts and random initial vectors x(0), and we plot the average trajectoriesof the closed-loop state and average sparsity of the control input u(x). We consider the same penaltyvalues as in [39]: for ℓ1 control (see (85)) we take µ = 3.3 (labeled ℓ1(i)), and µ = 10.7167 (labeledℓ1(ii)); and for ℓ2 control (see (84)) we consider r = 0 (labeled ideal) and r = 4.1 (labeled ideal-r).For the ℓ0 control (see (86)), the penalty λ is empirically chosen at each simulation step according tothe typical bound for the least-squares setting: λ = ∥Hx∥2/constant. Note that the ℓ2 control inputscan be computed in closed form and for ℓ1 regularized formulation we used the optimization softwareCVX for computing them. From the top of Fig. 5 we observe that ℓ0 Lagrangian controller given byalgorithm (RPAM-IHT) has similar stabilizing properties as the ideal ℓ2 controller, while the ℓ1 controllerfailed to stabilize the system. Moreover, the plot at the bottom of Fig. 5 shows that the ℓ1 and ℓ0control problems induces similar sparsity levels in the control input vectors u(x).

51

Page 52: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Figure 5: Averaged 2-norm of the closed-loop state ∥x(t)∥2 (top) and averaged sparsity ∥u(t)∥0 of theMPC control vector (bottom).

10 20 30 40 50 60t

RPAMidealideal−rl

1(i)

l1(ii)

10 20 30 40 50 60t

RPAM−IHTl

1(i)

l1(ii)

2.2 Inexact projection primal first order methods for convex minimization: applicationto embedded MPC

Optimization is one of the main computational paradigm in systems and control theory [44, 65, 67].Many systems and control tools require the solution of an optimization problem, so that the maincomputational burden in these strategies reduces to solving a constrained minimization, see e.g. [17,44, 65, 67]. In particular, model predictive control (MPC) is an advanced optimization-based controlmethodology designed for handling linear and non-linear systems, which received much attention fromthe optimization community in the last decade [65]. Moreover, embedded MPC applications ask forappropriate hardware with a sufficient computing power, memory and simple optimization schemes,which keep the computational effort at minimum, see [18,26].It is well-known that primal first order algorithms achieve sublinear (linear) convergence for smoothconvex (smooth strongly convex) constrained minimization. However, these methods encounter numer-ical difficulties when the primal feasible set is complicated, since they require exact projection onto thisset. Algorithmic alternatives to convex problems with complicated feasible set are the dual first ordermethods. Dual methods are able to handle easily complicated constraints, but they have difficulties inconvergence when the norm of the optimal Lagrange multiplier is large, since this norm appears linearlyin the convergence estimates of these methods. Moreover, they have typically sublinear convergencerate in an average primal sequence, even when the primal problem has smooth and strongly convexobjective function. Motivated by these issues, in this work we analyze the convergence of primal firstorder methods with inexact projections for solving constrained convex problems with smooth and thenstrongly convex objective function. In particular, we consider the inexact variants of Projected Gradi-ent and Projected Fast Gradient methods, where instead of computing an exact projection onto thecomplicated primal feasible set, an approximate projection, not necessarily feasible, is used. We provethat we can still achieve similar convergence rates for these inexact projection first order algorithmswith those given in the exact projection settings, provided that the approximate projection is sufficientlyaccurate. Our convergence analysis allows to derive explicitly the accuracy of the inexact projection and

52

Page 53: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

the number of iterations we need to perform in order to obtain an approximate solution for our convexproblem. Finally, practical performance on random QPs show encouraging results.

2.2.1 Motivation: MPC problem for linear systems

In this section we present our problem of interest, the model predictive control (MPC) problem and wederive sparse and condensed optimization formulations of it. We consider discrete-time systems, definedby the following linear difference equations:

z(t+ 1) = Az(t) +Bv(t), (87)

where z(t) ∈ Rnz and v(t) ∈ Rnv represent the state and the input of the system at time t, respectively.We also impose state and input constraints z(t) ∈ Z and v(t) ∈ V for all t ≥ 0. We assume that bothsets Z and V are polyhedra, i.e. they are described by linear constraints. For the system (87) we alsoconsider a general smooth convex stage cost ℓt(z(t), v(t)), usually of quadratic form. For a predictionhorizon of length N , the conventional MPC problem for system (87), with initial state z ∈ Rnz , state

trajectory z =[z(1)T · · · z(N)T

]T ∈ RNnz and input trajectory v =[v(0)T · · · v(N − 1)T

]T ∈ RNnv ,can be formulated as [65]:

V ∗(z) = min(v,z)

VN (z,v)

(=

N∑t=1

ℓt(z(t), v(t))

)s.t.: z(t) ∈ Z, v(t) ∈ V

z(t+ 1) = Az(t) +Bv(t) ∀t = 1, . . . , N − 1.

In order to present compact reformulations of the MPC problem as optimization problems, we definethe dynamics matrices A and B as:

A=

InAA2

...AN−1

AN

, B=

0 0 . . . 0 0B 0 . . . 0 0AB B . . . 0 0...

.... . .

......

AN−2B AN−3B . . . B 0AN−1B AN−2B . . . AB B

.

Using these notations, we can express the dynamics of the system over the prediction horizon as:

z = Az +Bv.

In the sparse formulation of linear MPC we consider both the states z and inputs v over the predictionhorizon as decision variables and in this case we obtain the following optimization problem:

V ∗(z) =min(v,z)

VN (z,v)

(=

N∑t=1

ℓt(z(t), v(t))

)s.t.: Cz ≤ c, Dv ≤ d (88)

z−Bv = Az,

53

Page 54: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where the linear inequality constraints come from the state and inputs constraints on the system. Notethat in this formulation, all the matrices are sparse and structured. However, in MPC it is also sometimesused a condensed form of the stage problem (88), obtained by eliminating the state vector using thedynamics z = Az + Bv. Therefore, after the elimination of the states z, the problem (88) can berewritten in the following equivalent simpler form:

V ∗(z) =minu

VN (Az +Bv,v) (89)

s.t.: Cv ≤ c,

and some components of c depend on the initial state z. When the stage costs ℓt(·) are quadratic, thenboth MPC problems (88) and (89) are QPs. Moreover, we notice that both formulations (88) and (89)are particular cases of the general convex optimization model:

f∗ = minx

f(x) (90)

s.t. : x ∈ X (= ∩pi=1Xi) ,

where the objective function is smooth, i.e. it has gradient Lipschitz, and convex and the feasible set Xis described as the intersection of a finite number of convex sets Xi. Moreover, each set Xi is simple,i.e. the projection onto this set can be done efficiently, e.g. box, halfspaces, hyperplanes.In MPC, at each time instant, given the initial state z, we need to solve approximately one of theoptimization problems (88) or (89). Moreover, in the context of fast embedded systems we need toprovide efficient numerical optimization algorithms for finding an approximate solution, with computa-tional complexity certification and low memory requirements. In the next sections we show that firstorder methods can satisfy these requirements.

2.2.2 General problem formulation

We consider the following general constrained convex optimization problem, which contains as a partic-ular case the above MPC problem:

f∗ = minx∈Rn

f(x) (91)

s.t. : x ∈ X,

where the objective function f : Rn → R is convex and the feasible set X is a general closed convexset. We denote with X∗ the set of the optimal solutions of problem (91). The main goal in this work isto compute an approximate optimal point of the optimization problem (91) for a given accuracy ϵ > 0.For this, we introduce the following definition:

Definition 2.15. The point x is an ϵ-optimal point of optimization problem (91) if it satisfies:

|f(x)− f∗| ≤ ϵ and distX(x) ≤ ϵ.

We make the following assumptions on objective function f :

Assumption 2.16. The objective function f has Lipschitz continuous gradient with constant Lf , thatis there exists Lf > 0 such that:

∥∇f(x)−∇f(y)∥ ≤ Lf∥x− y∥ ∀x, y ∈ Rn.

54

Page 55: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Assumption 2.17. The objective function f is strongly convex with constant σf , that is there existsσf > 0 such that:

⟨∇f(x)−∇f(y), x− y⟩ ≥ σf∥x− y∥2 ∀x, y ∈ Rn.

An useful consequence of the Assumptions 2.16 and 2.17 is given by the following inequality, see e.g. [53]:

⟨∇f(x)−∇f(y), x− y⟩ ≥ 1

Lf + σf∥∇f(x)−∇f(y)∥2 +

LfσfLf + σf

∥x− y∥2 (92)

for all x, y ∈ Rn. Notice that this inequality is valid also for the smooth case, i.e. when only Assumption2.16 holds, by taking σf = 0 in the right hand side of relation (92). There are classical results statingthat, under Assumption 2.16 (or the additional Assumption 2.17), exact projection primal first ordermethods (such as Projected Gradient and Fast Gradient) converge sublinearly (or linearly) to the solutionof the convex problem (91), see e.g. [53]. However, these primal algorithms require an exact projectiononto the feasible set X, which in many cases can be complicated, and thus this projection may be toodifficult to be computed exactly. Therefore, the previous work on this class of optimization problemswith complicated primal feasible set focused on dual first order methods, since they can avoid this issueof exact projection, but they usually converge sublinearly to the primal solution. For completeness, webriefly review these dual results.

2.2.3 Dual approach in convex constrained optimization

Let us now recall the iteration complexities of the dual gradient and fast gradient algorithms for solvingthe convex problem (91). In this section we assume that the feasible set is described by functionalconstraints, that is X = {x ∈ Rn : h(x) ≤ 0}, where h : Rn → Rp is a convex function. We alsoassume that there exists Ch <∞ such that ∥∇h(x)∥ ≤ Ch for all x ∈ Rn and that the Slater conditionholds for (91). We denote the associated dual problem of (91) by:

maxu∈Rp

+

d(u)

(= min

x∈RnL(x, u)

), (93)

where the Lagrangian function is given by:

L(x, u) = f(x) +

m∑i=1

uihi(x).

We denote the dual optimal set by U∗ = arg maxu∈Rp

+

d(u). Note that Slater condition guarantees that

strong duality holds for (91). When the objective function f is only convex, that is Assumption 2.17does not hold, then the dual function is not differentiable. Then, dual first order methods need to becombined with other techniques in order to derive convergence estimates: for example dual first ordermethods are combined with augmented Lagrangian smoothing [32], with proximal type smoothing [40]or with non-smooth analysis tools [50]. However, if the objective function is strongly convex, that isAssumption 2.17 holds, then defining x(u) = argminx∈Rn L(x, u) we get that the dual function d isdifferentiable and its gradient is given by [64]:

∇d(u) = h(x(u)) ∀u ∈ Rp+.

55

Page 56: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Moreover, the dual function d has Lipschitz continuous gradient [41,43]:

∥∇d(u)−∇d(v)∥ ≤ (C2h/σf )∥u− v∥ ∀u, v ∈ Rp

+. (94)

Thus, the Lipschitz constant is Ld =C2

hσf

. Several iteration complexity results for dual first order

algorithms have been derived in the literature, see e.g. [41,43] and the references therein. The main firstorder methods used for solving the dual problem (93) are Dual Gradient and Fast Gradient methods. Inboth methods, at each iteration it is required to use inner algorithms adequate to optimization problemswith smooth and strongly convex objective function and simple feasible set, see e.g. [53]. When appliedto inner Lagrangian problem, the fast gradient method will converge linearly, provided that strongconvexity Assumption 2.17 holds. Note that a complete analysis of the overall iteration complexity ofdual first order methods for solving problem (91), including the inner computational complexity, hasbeen given e.g. in [41, 43]. To synthesize the most relevant results from these works in a simple way,let the initial dual point u0 = 0. Then, the primal convergence rate analysis for the dual (fast) gradientalgorithms has been given for an average primal sequence (xk)k≥0 in the following form:

∥[h(xk)

]+∥ ≤ 2Ld∥u∗∥

(k + 1)p(95)

−2Ld∥u∗∥2

(k + 1)p≤ f(xk)− f∗ ≤ 0,

where u∗ ∈ U∗ and p = 1 for the Dual Gradient method and p = 2 for the Dual Fast Gradient method.Notice that there are also some recent results on the linear convergence of dual first order methods,but they require certain properties for the primal feasible set, such as a polyhedral description of thisset [45]. Since practical implementation of a dual algorithm in embedded environments involves atheoretical estimation of the iteration complexity, the norm of the optimal Lagrange multiplier, ∥u∗∥,prevents this estimation due to the fact that we cannot approximate its value a priori and in general thisnorm is also very large. Also the convergence rates (95) holds for some average primal points, which insome cases do not have a satisfactory practical behavior compared to the last primal iterate xk. Thus,in general, the dual approach inherently brings some difficult aspects for numerical implementation invarious embedded applications.

2.2.4 Primal approach in convex constrained optimization

On the other hand, any primal first order method for (91) would require the exact projection onto thefeasible set X described usually by complicated constraints [53]. Therefore, to avoid at least some ofthese issues induced by the dual approach or the primal approach with exact projection, inexact projectiongradient based methods have been considered in the literature. For non-convex optimization problemswith convex feasible set inexact projection spectral gradient methods were analyzed in [7]. In this non-convex setting, global asymptotic convergence results were obtained by means of non-monotone linesearches. For the convex setting, paper [66] tackled the issue of inexact computations of the proximaliterations for the composite convex optimization model:

minx∈Rn

f(x) + h(x),

where f is smooth convex function (and eventually strongly convex) and h is a lower semicontinuousconvex function (e.g. h can be the indicator function of a closed convex set). In that paper the authors

56

Page 57: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

studied the Primal Gradient and Fast Gradient methods with inexact information and inexact proximaliteration, that is at each iteration an approximation of the gradient ∇f and an approximation of theproximal step are considered. Sublinear and linear convergence rates were provided for the smoothconvex and smooth strongly convex cases, respectively. However, in the case when h is the indicatorfunction, that is h = IX , it is important to notice that both first order schemes in [66] require an inexactfeasible projection onto the convex set X. That is, given some z, we need to find a δ−approximatesolution of the projection subproblem:

minx∈Rn

[∥x− z∥2 + IX(x)

] (≡ min

x∈X∥x− z∥2

).

For the previous projection problem the inexact solution x is admissible in [66] for a fixed accuracy δ ifit satisfies:

∥x− z∥2 + IX(x) ≤ minx

[∥x− z∥2 + IX(x)

]+ δ (96)

= ∥[z]X − z∥2 + δ,

which imposes that x ∈ X. Therefore, this criterion (96) asks for the feasibility of the inexact projectionx. In conclusion, both first order schemes developed in [66] require at least the feasibility of theirgenerated primal sequences. However, in many applications, the complicated constraints prevent orat least pose hard difficulties to obtain a feasible sequence, asking for a more relaxed criterion for theapproximate projection. Thus, in this work we direct our efforts to avoid the feasibility requirement andto impose a more relaxed criterion for the inexact projection than (96). The reader should note thatthe proofs from [66] use certain properties and relations available only when the sequences generated bytheir methods are feasible. Thus, it is important to notice that the proof techniques from [66] cannotbe extended to the new more relaxed criterion adopted in the present work and detailed below.

2.2.5 Primal Gradient Method with inexact projections

In the following we analyze the convergence behavior of Primal Gradient and Fast Gradient methodsbased on the approximate projections with respect to a new criterion, for solving the general convexmodel (91). In many practical cases, it is too difficult even to compute a feasible point of problem(91), which asks for first order algorithms with not necessarily feasible iterates. Given some z ∈ Rn,both algorithms that we propose below require a δ-approximate solution x of the following projectionproblem:

minx∈X:=∩iXi

∥x− z∥2,

also known in the literature as the best approximation problem [16], which satisfies:

∥x− [z]X∥ ≤ δ.

Note that we do not require x to be a feasible point of X. Therefore, our criterion is less restrictivethan the criterion (96) used in [66], since we do not require x to be in X and since

1

2∥x− [z]X∥2 ≤ ∥x− z∥2 + IX(x)−min

x

[∥x− z∥2 + IX(x)

],

which says that any feasible point x ∈ X satisfying (96) also satisfies ∥x− [z]X∥ ≤√2δ. Notice that

in Section 2.2.7 we present several existing algorithms for approximately solving the previous projectionproblem.

57

Page 58: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

First, we introduce and analyze the Primal Gradient Method (PGM) with inexact projections for solvingproblem (91):

Algorithm PGM

Given x0 and δ > 0, for k ≥ 1 do:

1. Compute zk = xk − 1Lf

∇f(xk),

2. Compute xk+1 such that ∥xk+1 − [zk]X∥ ≤ δ.

Note that if δ = 0, then from PGM we recover the classical primal projected gradient algorithm whoseconvergence analysis is given e.g. in [53]. However, from our knowledge, for δ > 0 the convergencebehavior of algorithm PGM has not been analyzed before and the convergence proofs require a newanalysis. In the sequel we provide the convergence rates for the inexact projection primal gradientmethod PGM. We start with a few auxiliary results useful in our convergence analysis. First, for someoptimal solution x∗ ∈ X∗ of (91), we define the constant:

r0 = ∥x0 − x∗∥.

Then, we have the following lemma:

Lemma 2.18. Let Assumption 2.16 hold. Also let z ∈ Rn and x satisfying ∥x− [z]X∥ ≤ δ. Then, thefollowing holds:

f([z]X) ≥ f(x)− δ(∥∇f(x)−∇f(y)∥+ ∥∇f(y)∥)≥ f(x)− δ(Lf∥x− y∥+ ∥∇f(y)∥) ∀y ∈ Rn.

Proof. Using the convexity of f , the Cauchy-Schwartz inequality, triangle inequality and Lipschitz con-tinuity of the gradient ∇f (see Assumption 2.16) we obtain:

f([z]X) ≥ f(x) + ⟨∇f(x), [z]X − x⟩C.S.≥ f(x)− ∥∇f(x)∥δT.I.≥ f(x)− (∥∇f(x)−∇f(y)∥+ ∥∇f(y)∥)δ≥ f(x)− (Lf∥x− y∥+ ∥∇f(y)∥)δ ∀y ∈ Rn.

which confirms the stated inequality.

Further, we prove sublinear convergence rate for the average of the gradient sequence

{1k

k∑i=1

∇f(xi)}

k≥0

to ∇f(x∗).

Lemma 2.19. Let Assumption 2.16 hold and {xk}k≥0 be the sequence generated by algorithm PGM.Given the inner accuracy δ > 0 of the inexact projection, then for any k ≥ 1 the following relation holds:

1

k

k∑i=1

∥∇f(xi)−∇f(x∗)∥ ≤

(L2fr

20

k+ 2δL2

fr0 + 2kL2fδ

2

)1/2

.

58

Page 59: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. Let rk = ∥xk − x∗∥, then from the triangle inequality and ∥xk+1 − [zk]X∥ ≤ δ, we have:

r2k+1 ≤(∥xk+1 − [zk]X∥+ ∥[zk]X − x∗∥

)2≤ (∥[zk]X − x∗∥+ δ)2. (97)

Since −δ ≤ rk+1 − δ ≤ ∥[zk]X − x∗∥, we then obtain:

(rk+1 − δ)2 ≤ max{∥[zk]X − x∗∥2, δ2}≤ ∥[zk]X − x∗∥2 + δ2. (98)

For bounding further ∥[zk]X − x∗∥, we use the first order optimality condition of (91), that is x∗ =[x∗− (1/Lf )∇f(x∗)]X , leading to ∥[zk]X −x∗∥ = ∥[xk − (1/Lf )∇f(xk)]X − [x∗− (1/Lf )∇f(x∗)]X∥.From this observation and the non-expansiveness property of the projection operator, we further obtain:

∥[zk]X − x∗∥2 ≤ ∥xk − x∗ + (1/Lf )(∇f(x∗)−∇f(xk))∥2

= r2k +1

L2f

∥∇f(xk)−∇f(x∗)∥2 − 2

Lf⟨∇f(xk)−∇f(x∗), xk − x∗⟩

(92)

≤ r2k −1

L2f

∥∇f(xk)−∇f(x∗)∥2, (99)

where in the last inequality we used the simplified version of (92) by taking σf = 0. A first consequenceof (99) is that ∥[zk]X − x∗∥ ≤ rk, which combined with (97) yields the following upper bound on rk:

rk ≤ r0 + kδ ∀k ≥ 0. (100)

The second consequence of (99) and (98) is given by:

∥∇f(xk)−∇f(x∗)∥2 ≤ L2f [r

2k − (rk+1 − δ)2] + L2

fδ2 = L2

f (r2k − r2k+1) + 2rk+1L

2fδ.

By summing over the entire history i = 0, · · · , k in this last inequality and by using the upper bound(100) on rk, we obtain:(

1

k

k−1∑i=0

∥∇f(xi)−∇f(x∗)∥

)2

≤ 1

k

k−1∑i=0

∥∇f(xi)−∇f(x∗)∥2

≤L2fr

20

k+

2δL2f

k

k∑i=1

ri(100)

≤L2fr

20

k+ 2δL2

fr0 + (k + 1)L2fδ

2,

which confirms our statement.

Usage of Jensen inequality in Lemma 2.19 yieldsO( 1√k) convergence rate for the sequence

{1k

k∑i=1

∇f(xi)}

k≥0

to ∇f(x∗), provided that δ = O( 1k ). Another useful result is given in the lemma below:

Lemma 2.20. Under the assumptions of Lemma 2.19 the following relation holds:

f

(1

k

k−1∑i=0

xi+1

)− f∗ ≤ 1

k

k−1∑i=0

[f([zi]X)− f(x∗)] + ∥∇f(x∗)∥δ

+ δ

(L2fr

20

k+ 2δL2

fr0 + kL2fδ

2

)1/2

∀k ≥ 1.

59

Page 60: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. From Lemma 2.18 we have:

f([zk]X)− f∗ ≥ f(xk+1)− f∗ − (∥∇f(xk+1)−∇f(x∗)∥+ ∥∇f(x∗)∥)δ.

By summing over the entire history i = 0, · · · , k − 1, using the convexity of f and Lemma 2.19 wefurther derive:

1

k

k−1∑i=0

f([zi]X)− f∗ ≥ 1

k

k∑i=1

(f(xi)− f∗)− ∥∇f(x∗)∥δ − δ

k

k∑i=1

∥∇f(xi)−∇f(x∗)∥

≥ f

(1

k

k∑i=1

xi

)− f∗ − ∥∇f(x∗)∥δ − δ

(L2fr

20

k+ 2δL2

fr0 + kL2fδ

2

)1/2

,

which confirms our statement.

Now we are ready to analyze the rate of convergence of algorithm PGM under various assumptions onthe objective function f . For simplicity of the exposition, we consider below that Lf > 1, ϵ < 1 andthe initial point x0 satisfies Lf∥x0 − x∗∥ ≥ ∥∇f(x∗)∥. Notice that if any of these inequalities does nothold, there will be only slight changes in constants we derive in our convergence rates, but the terms ink will remain unchanged.

Convergence analysis: smooth caseWe now derive the sublinear convergence rate for algorithm PGM in a primal average sequence xk =k−1∑i=0

xi, under Lipschitz gradient assumption on f .

Theorem 2.21. Under Assumption 2.16, let {xk}k≥0 be the sequence generated by algorithm PGM.Given inner accuracy δ > 0, then the following sublinear estimates for primal feasibility and suboptimality

in xk =k−1∑i=0

xi hold:

(i) distX(xk) ≤ δ

(ii)− ∥∇f(x∗)∥δ ≤ f(xk)− f∗ ≤2Lfr

20

k+ 5Lfr0δ + 5kLfδ

2.

Proof. First note that the definition of the approximate projection with accuracy δ, i.e. ∥xk+1−[zk]X∥ ≤δ, implies the following relations:

distX(xk+1) = ∥xk+1 − [xk+1]X∥ ≤ ∥xk+1 − [zk]X∥ ≤ δ.

Now, using that distX(·) is a convex function and the definition of average sequence xk we get theestimate on feasibility stated by the theorem. Further, the optimality conditions of problem (91) are:

⟨∇f(x∗), x− x∗⟩ ≥ 0 ∀x ∈ X. (101)

Using the convexity of function f , the above optimality conditions and the Cauchy-Schwartz inequality

60

Page 61: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

we obtain the left hand side suboptimality inequality:

f(xk)− f∗ ≥ ⟨∇f(x∗), xk − x∗⟩

= ⟨∇f(x∗), 1k

k−1∑i=0

[zi]X − x∗⟩+ ⟨∇f(x∗), xk − 1

k

k−1∑i=0

[zi]X⟩

(101)

≥ 1

k

k−1∑i=0

⟨∇f(x∗), xi − [zi]X⟩ ≥ −∥∇f(x∗)∥δ. (102)

By using the same notations as in the proof of Lemma 2.19, recall that rk ≤ r0 + kδ and from therelation (98) we have:

(r2k+1 − δ)2 ≤ ∥[zk]X − x∗∥2 + δ2. (103)

To obtain our result we refine the right hand side of the previous bound as follows:

∥[zk]X − x∗∥2 = ∥[zk]X − xk + xk − x∗∥2

= r2k + 2⟨xk − x∗, [zk]X − xk⟩+ ∥[zk]X − xk∥2

= r2k + 2⟨[zk]X − x∗, [zk]X − xk⟩ − ∥[zk]X − xk∥2

≤ r2k +2

Lf⟨∇f(xk), x∗ − [zk]X⟩ − ∥[zk]X − xk∥2

= r2k −2

Lf

(⟨∇f(xk), xk − [zk]X⟩+

Lf

2∥[zk]X − xk∥2

)+

2

Lf⟨∇f(xk), x∗ − xk⟩

≤ r2k −2

Lf(f([zk]X)− f(xk)) +

2

Lf(f∗ − f(xk))

= r2k −2

Lf(f([zk]X)− f∗)). (104)

Combination of (104) and (103) leads to:

2

Lf(f([zk]X)− f∗)) ≤ r2k − (r2k+1 − δ)2 + δ2 = r2k − r2k+1 + 2δrk+1.

By summing over the entire history i = 0, · · · , k− 1 this inequality, and using rk ≤ r0 + kδ, we furtherderive that:

1

k

k−1∑i=0

(f([zi]X)− f∗) ≤Lfr

20

k+ 2δLfr0 + (k + 1)Lfδ

2. (105)

To obtain a convergence rate for our sequence we use Lemma 2.20 in combination with (105):

f

(1

k

k∑i=1

xi

)− f∗ ≤

Lfr20

k+ 2δLfr0 + 2kLfδ

2

+ δ

(L2fr

20

k+ 2δL2

fr0 + 2kL2fδ

2

)1/2

+ δ∥∇f(x∗)∥

≤2Lfr

20

k+ δ(4Lfr0 + ∥∇f(x∗)∥) + Lfδ

2(4k + 1). (106)

Finally, the suboptimality estimates (102) and (106) together with the assumption ∥∇f(x∗)∥ ≤ Lfr0,lead to our statements.

61

Page 62: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

We now provide conditions on the accuracy of the inexact projection and the number of iterationsrequired in order to obtain an ϵ-optimal point for (91). More precisely, from Theorem 2.21 it followsthat if we choose:

δ ≤ ϵ

15Lfr0,

then after at most

K =

⌈√6Lfr

20

ϵ

⌉iterations, the average primal iterate xK is an ϵ−optimal point for (91), that is it satisfies: distX(xK) ≤ ϵand |f(xK)− f∗| ≤ ϵ.

Convergence analysis: smooth and strongly convex caseWe further derive the linear convergence rate of algorithm PGM when the objective function is smoothstrongly convex.

Theorem 2.22. Under Assumptions 2.16 and 2.17, let {xk}k≥0 be the sequence generated by algorithmPGM. Given inner accuracy δ > 0, then the following estimates on the primal feasibility and suboptimalityare valid:

(i) distX(xk) ≤ δ,

(ii) ∥xk − x∗∥ ≤(1−

σfLf

)k

r0 +Lf

σfδ.

Proof. We use the same notations as in Theorem 2.21. The upper bound (i) on the feasibility gapis obtained with a similar reasoning as in the proof of Theorem 2.21. Using the optimality conditonx∗ = [x∗ − (1/Lf )∇f(x∗)]X of (91) and the triangle inequality, we have:

r2k+1 = ∥[zk]X − [x∗ − (1/Lf )∇f(x∗)]X + (xk+1 − [zk]X)∥2

≤ (∥[zk]X − [x∗ − (1/Lf )∇f(x∗)]X∥+ ∥xk+1 − [zk]X∥)2.

By using the non-expansiveness property of the projection operator and ∥xk+1− [zk]X∥ ≤ δ, we furtherhave:

r2k+1 ≤ (∥xk − x∗ +1

Lf(∇f(x∗)−∇f(xk))∥+ δ)2. (107)

On the other hand, by using (92), we get the following bound:

∥xk − x∗ + (1/Lf )(∇f(x∗)−∇f(xk))∥2

= r2k −2

Lf⟨∇f(xk)−∇f(x∗), xk − x∗⟩+ 1

L2f

∥∇f(xk)−∇f(x∗)∥2

(92)

≤ r2k −2

Lf

(1

Lf + σf∥∇f(xk)−∇f(x∗)∥2 +

LfσfLf + σf

r2k

)+

1

L2f

∥∇f(xk)−∇f(x∗)∥2

=

(1−

2σfLf + σf

)r2k +

1

Lf

(1

Lf− 2

Lf + σf

)∥∇f(xk)−∇f(x∗)∥2

As. 2.17≤

(1−

σfLf

)2

r2k. (108)

62

Page 63: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

By using (108) into (107), we obtain:

rk+1 ≤(1−

σfLf

)rk + δ ∀k ≥ 0.

which immediately implies the part (ii) of our stament.

An important consequence of Theorem 2.22 is that if we choose the accuracy of the inexact projectionsubproblem:

δ ≤σf ϵ

2Lf,

then after at most:

K =

⌈Lf

σflog

(2r20ϵ

)⌉iterations, the last primal iterate xK satisfies: distX(xK) ≤ ϵ and ∥xK − x∗∥ ≤ ϵ.Further we derive primal suboptimality estimates in terms of function values. For simplicity, recall that∥∇f(x∗)∥ ≤ Lfr0.

Corollary 2.23. Under the assumptions of Theorem 2.22, let {xk}k≥0 be the sequence generated byalgorithm PGM. Then, for given δ > 0 the following estimate on suboptimality holds:

− δ∥∇f(x∗)∥ ≤ f(xk)− f∗ ≤(1−

σfLf

)k

2Lfr20 +

L3f

σ2fδ2 +

3L2f

σfδr0.

Proof. Using a similar reasoning as in Lemma 2.18, we have:

f(x∗) ≥ f(xk)− (Lf∥xk − x∗∥+ ∥∇f(x∗)∥)∥xk − x∗∥.

This inequality together with Theorem 2.22 (ii) lead to:

f(xk)− f∗ ≤(1−

σfLf

)2k

Lfr20 +

L3f

σ2fδ2 + ∥∇f(x∗)∥δ +

(1−

σfLf

)k[2L2

f

σfδr0 + ∥∇f(x∗)∥r0

]

≤(1−

σfLf

)k

2Lfr20 +

L3f

σ2fδ2 +

3L2f

σfδr0,

which confirms our statement.

Again, for simplicity of the following estimates, consider r0 ≥ ϵ1/2

3L1/2f

. From Theorem 2.22 and Corrolary

2.23, it follows that if we choose the accuracy of the inexact projection as follows:

δ ≤σf ϵ

9L2fr0

,

then after at most:

K =

⌈Lf

σflog

(6Lfr

20

ϵ

)⌉iterations, the last primal iterate xK is an ϵ−optimal point for (91), that is it satisfies: distX(xK) ≤ ϵand |f(xK) − f∗| ≤ ϵ. It is also important to observe that if δ = 0, then from Theorems 2.21 and2.22 we recover the usual convergence rates for the projected gradient method in the smooth case andsmooth strongly convex case, respectively [53].

63

Page 64: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

2.2.6 Primal Fast Gradient Method with inexact projections

We now introduce and analyze the Primal Fast Gradient Method (PFGM) with inexact projections forsolving problem (91). Moreover, we assume in this section that the set X is bounded, i.e. DX =maxx,y∈X

∥x−y∥ <∞. Although we believe that this assumption can be removed, for simplicity we impose

it here and we leave for future work whether we can perform a convergence analysis without it.

Algorithm PFGM

Set y1 = x0, θ1

{= 1 if σf = 0

≥√

Lf

σfif σf > 0

, for k ≥ 1:

1) Compute zk = yk − 1Lf

∇f(yk)2) Compute xk+1 such that ∥xk+1 −

[zk]X∥ ≤ δ

3) Compute θk+1 from: θ2k+1 − θk+1 =(1− θk+1σf

Lf

)θ2k

4) Update yk+1 = xk+1 + βk+1

(xk+1 − xk

), where

βk =(θk−1)(Lf−θk+1σf )

θk+1(Lf−σf ).

Notice if δ = 0, then algorithm PFGM becomes the usual Projected Fast Gradient Method of Nesterovwhose convergence analysis is given e.g. in [53]. From our best knowledge, for δ > 0 we are not awareof any convergence results for algorithm PFGM. As we will see below the new convergence analysisrequires significant modifications of the classical one. Let us define:

vk = λkvk−1 + (1− λk)y

k−1 + θk(xk − yk−1)

= yk +Lf (θk − 1)

Lf − σfθk(yk − xk−1)

and λk = 1− θkσf

Lf. Also notice that βk ≤ γ, where

γ ≤

2 if σf = 0√Lf−

√σf√

Lf+√σf

if σf > 0,

which leads to βk ≤ 2. For simplicity of the exposition, we assume in the sequel that DX ≥max{δ, 1

Lf∥∇f(x∗)∥}. In order to analyze the convergence behaviour of algorithm PFGM, we first

derive some useful lemmas.

Lemma 2.24. Let Assumption 2.16 hold and the feasible set X be bounded. Also let {xk, yk}k≥0 bethe sequences generated by algorithm PFGM. Given the inner accuracy δ > 0 of the inexact projection,then ∥vk − x∗∥ satisfies the following recurrence:

(∥vk+1 − x∗∥ − θkδ)2+

2θ2kLf

(f([zk]X)−f∗) ≤ λk

[∥vk − x∗∥2+

2θ2k−1

Lf(f([zk−1]X)−f∗)

]+ 20λkθ

2k−1δDX + θ2kδ

2.

64

Page 65: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. We denote an alterate variant of vk:

vk = λkvk−1 + (1− λk)y

k−1 + θk([zk−1]X − yk−1)

= xk−1 + θk

([zk−1]X − xk−1

).

By defining rk = ∥vk −x∗∥ and taking into account that ∥vk+1− vk+1∥ ≤ θkδ, from triangle inequalitywe have:

r2k+1 ≤ (∥vk+1 − x∗∥+ ∥vk+1 − vk+1∥)2

≤ (∥vk+1 − x∗∥+ θkδ)2,

which implies that −θkδ ≤ rk+1 − θkδ ≤ ∥vk+1 − x∗∥ and therefore:

(rk+1 − θkδ)2 ≤ max{∥vk+1 − x∗∥2, θ2kδ2}≤ ∥vk+1 − x∗∥2 + θ2kδ

2. (109)

To obtain an upper bound on ∥vk − x∗∥ we derive:

∥vk+1 − x∗∥2 = ∥λkvk + (1− λk)yk − x∗∥2 + θ2k∥[zk]X − yk∥2

+ 2θk⟨[zk]X − yk, λkvk + (1− λk)y

k − x∗⟩= ∥λkvk + (1− λk)y

k − x∗∥2 + θ2k∥[zk]X − yk∥2

+ 2θk⟨[zk]X − yk, yk − x∗+λkLf (θk − 1)

Lf − σfθk(yk − xk)⟩. (110)

In order to further refine the upper bound, we recall the optimality conditions:

⟨[zk]X − yk +1

Lf∇f(yk), z − [zk]X⟩ ≥ 0 ∀z ∈ X. (111)

On the other hand, notice that for any x ∈ X, we have:

2⟨[zk]X − yk,yk − x⟩+ ∥[zk]X − yk∥2

=2⟨[yk − (1/Lf )∇f(yk)]X − yk, [zk]X − x⟩ − ∥[zk]X − yk∥2

(111)

≤ − 2

Lf⟨∇f(yk), [zk]X − x⟩ − ∥[zk]X − yk∥2

= − 2

Lf

(⟨∇f(yk), [zk]X − yk⟩+

Lf

2∥[zk]X − yk∥2

)+

2

Lf⟨∇f(yk), x− yk⟩

≤ 2

Lf

(f(x)− f([zk]X)

)−σfLf

∥x− yk∥2. (112)

Taking now x = x∗ in (112) results in:

2θk⟨[zk]X − yk, yk − x∗⟩ ≤ 2θkLf

(f∗ − f([zk]X))−θkσfLf

∥yk − x∗∥2 − θk∥[zk]X − yk∥2. (113)

65

Page 66: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Further, taking x = [zk−1]X in (112) yields:

2θk(θk − 1)⟨[zk]X − yk,yk − [zk−1]X⟩ ≤2(θ2k − θk)

Lf[f([zk−1]X)− f([zk]X)]

− (θ2k − θk)∥[zk]X − yk∥2 −σfθk(θk − 1)

Lf∥yk − [zk−1]X∥2. (114)

Now, by using the definition of yk and βk ≤ γ we also obtain:

⟨[zk]X − yk, yk − xk⟩ ≤ ⟨[zk]X − yk, yk − [zk−1]X⟩+ δ(2βk + 1)(DX + δ)

≤ ⟨[zk]X − yk, yk − [zk−1]X⟩+ δ(2γ + 1)(DX + δ). (115)

By using (113) and (114)-(115) into relation (110), we obtain:

∥vk+1 − x∗∥2(113)+(114)

≤ ∥λkvk + (1− λk)yk − x∗∥2 + 2θk

Lf(f∗ − f([zk]X))

+2(θ2k − θk)

Lf

(f([zk−1]X)− f([zk]X)

)−σfθkLf

∥x∗ − yk∥2

+ 2θk(θk − 1)δ(2γ + 1)(DX + δ)

θ2k−θk=λkθ2k−1

≤ λkr2k +

2λkθ2k−1

Lf

(f([zk−1]X)− f∗

)+

2θ2kLf

(f∗ − f([zk]X)) + 20λkθ2k−1δDX ,

where in the second inequality we used convexity of ∥·∥, 1−λk =σfθkLf

, θ2k−θk = λkθ2k−1 and assumption

DX ≥ δ. Lastly, by combining the resulted inequality with the relation (109), we obtain our result.

Next we prove a second auxiliary result which will strongly facilitate the derivations of the convergence

rates for algorithm PFGM. Notice that, in the strongly convex case, we consider for simplicity θ1 =√

Lf

σf,

which also implies θk =√

Lf

σffor all k ≥ 1.

Lemma 2.25. Under the assumptions of Lemma 2.24, the following assertions hold:(i) In the convex case, i.e. σf = 0, the following bound holds:√

2θ2k−1

Lf(f([zk−1]X)− f∗) ≤ δθ2k−1 + δθk−1 +

[r21 + δ2θ4k−1 +

k−1∑i=1

(2θ2i δ

2 + 20θ2i−1δDX

)]1/2.

(ii) In the strongly convex case, i.e. σf > 0, and if we choose θ1 =√

Lf

σf, then the following bound

holds:√2

σf(f([zk−1]X)− f∗) ≤ 2δ

(1−√λ)2

+ δ

√Lf

σf

+

[λk−1

(r21 +

2

σf(f([z0]X)− f∗)

)+

δ2

(1− λ)3+

C

λ(1− λ)

]1/2,

where λ = 1−√

σf

Lfand C = 20λ

(1−λ)2δDX + δ2

(1−λ)2.

66

Page 67: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. We use the same notations as in Lemma 2.24. Additionally, we denote:

bk =

√r2k +

2θ2k−1

Lf(f([zk−1]X)− f∗)− θk−1δ,

and we aim at finding an upper bound on this sequence. Based on the fact that f([zk−1]X)− f∗ ≥ 0,Lemma 2.24 implies:

b2k+1 ≤ r2k+1 − 2θkδrk+1 + θ2kδ2 +

2θ2kLf

(f([zk]X)− f∗)

= (rk+1 − θkδ)2 +

2θ2kLf

(f([zk]X)−f∗)

Lemma 2.24≤ λk

[r2k +

2θ2k−1

Lf(f([zk−1]X)− f∗)

]+ 20λkθ

2k−1δDX + θ2kδ

2

= λk(bk + θk−1δ)2 + Ck ∀k ≥ 1,

where Ck = 20λkθ2k−1δDX+θ2kδ

2. In order to obtain an upper bound on {bk}k≥0, we look at its slightly

modified variant {bk}k≥0 defined as:

b2k+1 = λk(bk + δθk−1)2 + Ck, (116)

assuming that b1 = b1. By assuming for some k ≥ 1 that bk ≤ bk, then from (116) it can be furtherderived:

b2k+1 ≤ λk(bk + δθk−1)2 + Ck ≤ λk(bk + δθk−1)

2 + Ck(116)= b2k+1. (117)

Thus, this fact leads to: bk ≤ bk for all k ≥ 1. On the other hand, it can be seen that {bk}k≥0 satisfies

b2k+1 = λk(bk + δθk−1)2 + Ck ≥ λkb

2k, which implies:

bj ≤ bk

k−1∏i=j

1√λj

∀j < k. (118)

(i) First assume σf = 0, which implies λk = 1 for all k ≥ 1. The relation (118) turns into bj ≤ bk for

all j < k. Due to monotonicity of {bk}k≥0, recurrence (116) implies that:

b2k+1

(116)= b21 + 2δ

(k∑

i=1

θi−1bi

)+

k∑i=1

(θ2i−1δ

2 + Ci

)≤ b21 + 2δbk+1

(k−1∑i=1

θi

)+

k∑i=1

(θ2i−1δ

2 + Ci

)= b21 + 2δθ2k−1bk+1 +

k∑i=1

(θ2i−1δ

2 + Ci

).

67

Page 68: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

The last inequality leads to:

bk+1

(117)

≤ bk+1 ≤ δθ2k−1 +

[δ2θ4k−1 + b21 +

k∑i=1

(θ2i−1δ

2 + Ci

)]1/2,

which confirms the part (i) of our statement.

(ii) Now assume the strongly convex case, that is σf > 0. This implies λk = λ := 1 −√

σf

Lfand

θk = θ := 11−λ for all k ≥ 0. Also notice that Ck = C := 20λ

(1−λ)2δDX + δ2

(1−λ)2. First, (118) implies

that:bj ≤ λ−

(k−j)2 bk ∀j < k. (119)

Similar to previous case, the relations (119) and (116) lead to:

b2k+1

(116)= λkb21 +

1− λ

(k∑

i=1

λk−i+1bi

)+

k∑i=1

λk−i+1

[δ2

(1− λ)2+C

λ

](119)

≤ λkb21 +2δ

1− λbk+1

(k∑

i=1

λk−i+1

2

)+

1

1− λ

[δ2

(1− λ)2+C

λ

]≤ λkb21 +

(1−√λ)2

bk+1 +δ2

(1− λ)3+

C

λ(1− λ).

The last inequality leads to

bk+1

(117)

≤ bk+1≤2δ

(1−√λ)2

+

[λkb21+

δ2

(1− λ)3+

C

λ(1− λ)

]1/2which confirms the second part of our statement.

Now we are ready to derive the convergence rates of algorithm PFGM under different assumptions onobjective function f .

Convergence analysis: smooth caseWe now prove the sublinear convergence rate of algorithm PFGM for the smooth convex case using therelations from Lemma 2.25.

Theorem 2.26. Let Assumption 2.16 hold and the feasible set X be bounded. Also let {xk, yk}k≥0 bethe sequences generated by algorithm PFGM. Given the inner accuracy δ > 0, the following estimateson primal feasibility and suboptimality hold for the last iterate xk:

(i) distX(xk) ≤ δ

(ii) − δ∥∇f(x∗)∥ ≤ f(xk)− f∗ ≤4Lfr

20

(k − 1)2+ 3δ2k2 + 22DXδk + 4δLfDX .

Proof. The feasibility bound results immediately: distX(xk) ≤ ∥xk − [zk−1]X∥ ≤ δ. The lower boundon suboptimality can be easily derived from the convexity relation:

f(x)− f∗ ≥ ⟨∇f(x∗), x− x∗⟩= ⟨∇f(x∗), [x]X − x∗⟩+ ⟨∇f(x∗), x− [x]X⟩≥ −∥∇f(x∗)∥∥x− [x]X∥.

68

Page 69: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

By taking x = xk, we obtain the lower bound on the suboptimality. On the other hand, by squaringboth sides in Lemma 2.25 (i), we get:

2θ2k−1

Lf(f([zk−1]X)− f∗) ≤ 2r21 + 6δ2θ4k−1 + 4δ2θ2k−1 + 22δDX

k−1∑i=1

θ2i−1,

which, by dividing with2θ2k−1

Lf, further implies:

f([zk−1]X)− f∗ ≤Lfr

21

θ2k−1

+ δ2(3θ2k−1 + 2) + 22δDX

k−1∑i=1

θ2i−1

θ2k−1

≤Lfr

21

θ2k−1

+ 3δ2θ2k + 22DXδk. (120)

By using the fact that k/2 ≤ θk ≤ k + 1, we obtain:

f([zk−1]X)− f∗ ≤4Lfr

20

(k − 1)2+ 3δ2k2 + 22DXδk.

Using now Lemma 2.18, it follows that:

f(xk)− f∗ ≤4Lfr

20

(k − 1)2+ 3δ2k2 + 22DXδk + δ[Lf (DX + δ) + ∥∇f(x∗)∥].

Lastly, by using assumption max{∥∇f(x∗)∥, Lfδ} ≤ LfDX , the last inequality confirms part (ii) ofthe theorem.

We now provide estimates on the accuracy of the inexact projection and the number of iterations requiredin order to obtain an ϵ-optimal point for (91). Let us assume for simplicity that r0 ≥ (Lf ϵ)

1/2. FromTheorem 2.26 it follows that if:

δ ≤ ϵ3/2

(17DX)2L1/2f

,

then after at most:

K =

⌈√12Lfr

20

ϵ

⌉iteration, the last primal iterate xK satisfies: distX(xK) ≤ ϵ and |f(xK)− f∗| ≤ ϵ.

Convergence analysis: smooth and strongly convex caseWe now derive linear convergence rate of algorithm PFGM for the smooth strongly convex case:

Theorem 2.27. Let Assumption 2.16 and 2.17 hold and the feasible set X be bounded. Also let{xk, yk}k≥0 be the sequences generated by algorithm PFGM. Given the inner accuracy δ > 0, the followestimates hold for primal feasibility and suboptimality for the last iterate xk:

(i) distX(xk) ≤ δ

(ii)− δ∥∇f(x∗)∥ ≤ f(xk)− f∗ ≤ λk−1[σfr

21+2(f([z0]X)−f∗)

]+40σfDXδ

(1− λ)3+

2σfδ2

λ(1− λ)3,

where λ = 1−√

σf

Lf.

69

Page 70: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. The upper bound on the feasibility gap and the lower bound on the suboptimality gap areobtained using the same lines as in the proof of Theorem 2.26. By squaring both sides in Lemma 2.25(ii) and then dividing with 2

σf, results in:

f([zk−1]X)− f∗ ≤ λk−1σf

[r21 +

2

σf(f([z0]X)− f∗)

]+

4σfδ

(1−√λ)2

+2δ√Lfσf +

σfδ2

(1− λ)3+

Cσfλ(1− λ)

≤ λk−1σf

[r21 +

2

σf(f([z0]X)− f∗)

]+

4σfδ

(1−√λ)2

+ 2δ√Lfσf +

20σfδDX

(1− λ)3+

σfδ2

(1− λ)3+

σfδ2

λ(1− λ)3

≤ λk−1σf

[r21 +

2

σf(f([z0]X)− f∗)

]+

38σfDXδ

(1− λ)3+

2σfδ2

λ(1− λ)3. (121)

On the other hand, using Lemma 2.18 and the assumption max{δ, 1Lf

∇f(x∗)} ≤ DX , we get:

f(xk)− f∗ ≤ f([zk−1]X)− f∗ + δ(LfDX + ∥∇f(x∗)∥)≤ f([zk−1]X)− f∗ + 2δLfDX

≤ f([zk−1]X)− f∗ +2σfDXδ

(1− λ)3,

which together with (121) leads to the upper bound on the suboptimality gap.

Let us assume for simplicity that DX ≥ ϵ1/2(1−λ)3/2

40λ1/2 . From Theorem 2.27 it follows that if we choose:

δ ≤ ϵ(1− λ)3

120σfDX,

then after at most:

K =

⌈√Lf

σflog

(3σfr

21 + 6(f([z0]X)−f∗)

ϵ

)⌉iterations, the last primal iterate xK satisfies distX(xK) ≤ ϵ and |f(xK)− f∗| ≤ ϵ. Note that from theconvergence rates given in Theorems 2.21 and 2.26 we can conclude that in the smooth case algorithmPGM is more robust, allowing larger inner accuracy of order O(ϵ) for computing the inexact projection,than algorithm PFGM which requires O(ϵ3/2) accuracy for computing the inexact projection. On theother hand, in the smooth and strongly convex case the convergence rates from Theorems 2.22 and2.27 show that both algorithms PGM and PFGM require the same O(ϵ) accuracy for computing theinexact projection. Moreover, the reader should note that similar sublinear and linear convergence ratesas those given in the above theorems can be also derived for variants of PGM and PFGM with decreasingaccuracy δk for the inexact projection.

70

Page 71: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

2.2.7 Algorithms for inexact projection onto intersection of convex sets

In this section we present several existing algorithmic schemes for solving the best approximation problemwhen the set X is described as the intersection of a finite family of simple closed convex sets. Oneof the most powerful algorithms for computing the best approximation of (projection of) a given pointonto a finite intersection of convex sets X = ∩p

i=1Xi is the Dykstra algorithm [16]. Dykstra uses ateach iteration only projections onto individual sets Xi, which are assumed to be computationally easy.More precisely, Dykstra algorithm computes an approximate projection of a vector z = zk onto closedconvex set X = ∩p

i=1Xi:min

x∈∩pi=1Xi

∥x− z∥2,

by performing the following iterations [16]:

Algorithm CDA

Set y0 = z, q−(p−1) = · · · = q−1 = q0 = 0, for j ≥ 1:

1. yj+1 =[yj + qj+1−p

]X(j mod p)+1

2. qj+1 = yj + qj+1−p − yj+1.

Since there exist linear convergence results when Xi’s are halfspaces, we further consider a particularversion of optimization problem (91):

minx∈Rn

f(x) s.t. Cx ≤ c,

where now the feasible set X = ∩pi=1Xi is the intersection of halfspaces Xi = {x ∈ Rn : Cix ≤ ci}

, with Ci the ith row of matrix C for all i = 1, . . . , p. Note that many problems from systems andcontrol theory can be posed into this form. In this context, algorithms PGM and PFGM require at eachiteration k the solution of the following best approximation problem:

minx∈Rn

1

2∥x− zk∥2 (122)

s.t. x ∈ X = {x : Cx ≤ c} = ∩pi=1Xi.

The main computational burden is given by the projection from step 1), which in our case can beobtained easily by:

[y]Xi = y − [⟨Ci, y⟩ − ci]+∥Ci∥2

Ci ∀i = 1 : p.

The convergence behavior of algorithm CDA when Xi are halfspaces was analyzed in [16] and linearrate of convergence of the following form was derived:

∥yj − [z]X∥ ≤ rjη,

where r ∈ [0, 1) and η are constants depending on the matrix C. For example, when p = 2 the sequence{yj}j≥0 is either finite or satisfies:

∥yj − [z]X∥ ≤ rj−1∥z − [z]X∥,

where in this case r is the cosine of the angle between the two functionals which define the halfspaces.

71

Page 72: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

2.2.8 Illustrative example

When the stage costs ℓt are quadratic and the constraints Z and V are polyhedral, the MPC problem (89)is a standard condensed QP. Therefore, in order to test our algorithms we generate further QPs comingfrom condensed MPC formulations for random linear systems with box state and input constraints:

minx∈Rn

1

2xTQx+ qTx

s.t. : Cx ≤ c,

where C ∈ Rm×n, σf = λmin(Q) > 0 and Lf = λmax(Q) > 0. We compare the Matlab implemen-tations of PGM and PFGM from this section with the Dual Gradient method (DGM) and Dual FastGradient method (DFGM) from [43]. We recall that, from theoretical estimates, in order to compute

an average primal suboptimal point x satisfying

√m∑i=1

dist2Xi(x) ≤ ϵ and |f(x) − f∗| ≤ ϵ, the Dual

Gradient method requires O(∥C∥2∥u∗∥2

σf ϵ

)outer iterations and the Dual Fast Gradient method requires

O(∥C∥∥u∗∥(σf ϵ)1/2

)outer iterations (i.e. evaluations of dual gradients), see e.g. [43]. On the other hand,

for σf > 0 our methods PGM and PFGM have outer complexities (number of approximate primal

projections) of order O((

Lf

σf

)plog(∥x

∗∥2ϵ )

), where p = 1 or p = 1/2, respectively.

The inner projection problem for the primal methods minx:Cx≤c

∥x − zk∥2 is approximately solved by

quadprog solver. Note that our QP-MPC formulations have box constraints lb ≤ x ≤ ub, so that thefeasible set is indeed bounded. On the other hand, let u be the current Lagrange multiplier, then theinner subproblem for the dual methods is:

minx∈Rn

1

2xTQx+ qTx+ uT (Cx− c, )

which can be solved via a linear system since the optimality conditions are given by Qx(u)+q+CTu = 0.

In Table 7 we present the average number of iterations performed by the four algorithms, PGM, PFGM,DGM and DFGM, on 10 random QP-MPC problems of dimensions ranging from 50 to 103, to obtain anϵ−optimal point with ϵ = 10−2. We first compute f∗ with quadprog and then we count the numberof outer iterations performed until the criterion max{∥[Cxk − c]+∥, |f(xk) − f∗|} ≤ ϵ is met for eachalgorithm. Note that the stage costs in MPC lead to strongly convex objective functions. We observethat the primal methods significantly outperform the dual ones in terms of outer iterations. However, theouter complexities of the primal methods depend only on conditioning number of the primal objectivefunction f , while in the dual case the complexities are strongly affected by the constraints matrix C.

Given the accuracy ϵ = 10−2, we also tested algorithms PGM and PFGM for various inner accuracies δof the inexact projection, to check their sensitivities. The results are given in Figure 6. We consideredvarious accuracies δ for computing the inexact projection ranging from 10ϵ to 0.1ϵ and compared withthe theoretical estimates of δ given in the previous sections. Clearly, the theoretical δ allows to getan ϵ-optimal point of the QP, while the others have stationary behaviors before finding the required ϵ-optimal point. We also compared in the same Figure 6 the theoretical values of δ with variable accuraciesδk = 1/k for PGM and δk = 1/k2 for PFGM, respectively. Again we observe a better behavior of thetwo algorithms when running with the theoretical estimates of δ.

72

Page 73: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

150 200 250iterations (k)

PGM

δ theory

δ = 0.1* εδ = εδ = 10* εδ = 1/k

0 50 100 150 200

10−2

10−1

100

101

102

103

104

105

106

iterations (k)

max(|

f(xk )−

f* |,||

[Cx

k −c] +||)

PFGM

δ theory

δ = 0.1* εδ = εδ = 10* εδ = 1/k 2

150 200 250iterations (k)

PGM

δ theory

δ = 0.1* εδ = εδ = 10* εδ = 1/k

0 20 40 60 80 100

10−2

10−1

100

101

102

103

104

iterations (k)

max(|

f(xk )−

f* |,||[C

xk −c

] +||)

PFGM

δ theory

δ = 0.1* εδ = εδ = 10* εδ = 1/k 2

Figure 6: Convergence behavior of algorithms PGM and PFGM for various δ-accuracies of inexactprojection: top-convex QP, bottom-strongly convex QP.

Dim. \ Alg. DGM DFGM PGM PFGM

50 12849 885 30 18100 46036 2597 40 22200 57274 2675 50 25350 96127 3863 76 30500 176601 7586 82 41800 - - 89 451000 - - 97 51

Table 7: Performance comparison on strongly convex QPs.

73

Page 74: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3 Structural analysis of optimization problems

In Task 3 we analyze the structural properties of optimization problems, coming e.g. from modelingand control applications, and derive efficient optimization algorithms that take into account the specificstructure arising in these problems. For example, in paper [P7] we derive linear convergence rates ofseveral first order methods for solving smooth non-strongly convex constrained optimization problems,i.e. involving an objective function with a Lipschitz continuous gradient that satisfies some relaxedstrong convexity condition. In particular, in the case of smooth constrained convex optimization, weprovide several relaxations of the strong convexity conditions and prove that they are sufficient forgetting linear convergence for several first order methods such as projected gradient, fast gradient andfeasible descent methods. We also provide examples of functional classes that satisfy our proposedrelaxations of strong convexity conditions. Finally, we show that the proposed relaxed strong convexityconditions cover important applications ranging from solving linear systems, Linear Programming, anddual formulations of linearly constrained convex problems arising in model predictive control. In paper[P8] we employ a parallel version of a randomized (block) coordinate descent method for minimizing thesum of a partially separable smooth convex function and a fully separable non-smooth convex function.Under the assumption of Lipschitz continuity of the gradient of the smooth function, this method has asublinear convergence rate. Linear convergence rate of the method is obtained for the newly introducedclass of generalized error bound functions. We prove that the new class of generalized error boundfunctions encompasses both global/local error bound functions and smooth strongly convex functions.We also show that the theoretical estimates on the convergence rate depend on the number of blockschosen randomly and a natural measure of separability of the smooth component of the objectivefunction.

3.1 Linear convergence of first order methods for non-strongly convex optimization

Recently, there emerges a surge of interests in accelerating first order methods for difficult optimizationproblems, for example the ones without strong convex objective function, arising in different applicationssuch as model predictive control for linear systems [45] or machine learning [33]. Algorithms based ongradient information have proved to be effective in these settings, such as projected gradient and itsfast variants [53], stochastic gradient descent [54] or coordinate gradient descent [80].The standard assumption for proving linear convergence of first order methods for smooth convexoptimization is the strong convexity of the objective function, an assumption which does not hold formany practical applications. In this work we derive linear convergence rates of several first order methodsfor solving smooth non-strongly convex constrained optimization problems, i.e. involving an objectivefunction with a Lipschitz continuous gradient that satisfies some relaxed strong convexity condition.In particular, in the case of smooth constrained convex optimization, we provide several relaxationsof the strong convexity conditions and prove that they are sufficient for getting linear convergence forseveral first order methods such as projected gradient, fast gradient and feasible descent methods. Wealso provide examples of functional classes that satisfy our proposed relaxations of strong convexityconditions. Finally, we show that the proposed relaxed strong convexity conditions cover importantapplications ranging from solving linear systems, Linear Programming, and dual formulations of linearlyconstrained convex problems arising in model predictive control.

74

Page 75: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3.1.1 Problem formulation

In this work we consider the class of convex constrained optimization problems:

(P ) : f∗ = minx∈X

f(x),

where X ⊆ Rn is a simple closed convex set, that is the projection onto this set is easy, and f : X → Ris a closed convex function. We further denote by X∗ = argminx∈X f(x) the set of optimal solutionsof problem (P). We assume throughout this section that the optimal set X∗ is nonempty and closed andthe optimal value f∗ is finite. Moreover, in this work we assume that the objective function is smooth,that is f has Lipschitz continuous gradient with constant Lf > 0 on the set X:

∥∇f(x)−∇f(y)∥ ≤ Lf∥x− y∥ ∀x, y ∈ X. (123)

An immediate consequence of (123) is the following inequality [53]:

f(y) ≤ f(x) + ⟨∇f(x), y − x⟩+Lf

2∥x− y∥2 ∀x, y ∈ X, (124)

while, under convexity of f , we also have:

0 ≤ ⟨∇f(x)−∇f(y), x− y⟩ ≤ Lf∥x− y∥2 ∀x, y ∈ X. (125)

It is well known that first order methods are converging sublinearly on the class of problems whoseobjective function f has Lipschitz continuous gradient with constant Lf on the set X, e.g. convergencerates in terms of function values of order [53]:

f(xk)− f∗ ≤Lf∥x0 − x∗∥2

2kfor projected gradient,

f(xk)− f∗ ≤2Lf∥x0 − x∗∥2

(k + 1)2for fast gradient,

(126)

where xk is the kth iterate generated by the method. Typically, in order to show linear convergence offirst order methods applied for solving smooth convex problems, we need to require strong convexity ofthe objective function. We recall that f is strongly convex function on the convex set X with constantσf > 0 if the following inequality holds [53]:

f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y)−σfα(1− α)

2∥x− y∥2 (127)

for all x, y ∈ X and α ∈ [0, 1]. Note that if σf = 0, then f is simply a convex function. We denote bySLf ,σf

(X) the class of σf -strongly convex functions with an Lf -Lipschitz continuous gradient on X.

First order methods are converging linearly on the class of problems (P) whose objective function f isin SLf ,σf

(X), e.g. convergence rates of order [53]:

f(xk)− f∗ ≤Lf∥x0 − x∗∥2

2

(1−

σfLf

)k

for projected gradient,

f(xk)− f∗ ≤ 2(f(x0)− f∗

)(1−

√σfLf

)k

for fast gradient.

(128)

75

Page 76: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

In the case of a differentiable function f with Lf -Lipschitz continuous gradient, each of the followingconditions below is equivalent to inclusion f ∈ SLf ,σf

(X) [53]:

f(y) ≥ f(x) + ⟨∇f(x), y − x⟩+σf2∥x− y∥2 ∀x, y ∈ X,

σf∥x− y∥2 ≤ ⟨∇f(x)−∇f(y), x− y⟩ ∀x, y ∈ X.(129)

Let us give some properties of smooth strongly convex functions from the class SLf ,σf(X). Firstly,

using the optimality conditions for (P), that is ⟨∇f(x∗), y− x∗⟩ ≥ 0 for all y ∈ X and x∗ ∈ X∗, in thefirst inequality in (129) we get the following relation:

f(x)− f∗ ≥σf2∥x− x∗∥2 ∀x ∈ X. (130)

Further, the gradient mapping of a continuous differentiable function f with Lipschitz gradient in apoint x ∈ Rn is defined as [53]:

g(x) = Lf (x− [x− 1/Lf∇f(x)]X) ,

If additionally, the function f has also Lipschitz continuous gradient, then we obtain a second relationvalid for any f ∈ SLf ,σf

(X) [79][Lemma 22]:

σf2∥x− y∥ ≤ ∥g(x)− g(y)∥ ∀x, y ∈ X. (131)

However, in many applications the strong convexity condition (127) or equivalently one of the conditions(129) cannot be assumed to hold. Therefore, in the next sections we introduce some non-strongly convexconditions for the objective function f that are less conservative than strong convexity. These are basedon relaxations of strong convexity relations (129)–(131).

3.1.2 Non-strongly convex conditions for a function

In this section we introduce several functional classes that are relaxing the strong convexity properties(129)–(131) of a function and derive relations between these classes. More precisely, we observe thatstrong convexity relations (129) or (131) are valid for all x, y ∈ X. We propose in this section functionalclasses satisfying conditions of the form (129) or (131) that hold for some particular choices of x andy, or satisfying simply the condition (130).

Quasi-strong convexityThe first non-strongly convex relaxation we introduce is based on choosing a particular value for y in thefirst strong convexity inequality in (129), that is y = x ≡ [x]X∗ (recall that [x]X∗ denotes the projectionof x onto the optimal set X∗ of convex problem (P)):

Definition 3.1. Continuously differentiable function f is called quasi-strongly convex on set X if thereexists a constant κf > 0 such that for any x ∈ X and x = [x]X∗ we have:

f∗ ≥ f(x) + ⟨∇f(x), x− x⟩+κf2∥x− x∥2 ∀x ∈ X. (132)

76

Page 77: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Note that inequality (132) alone does not even imply convexity of function f . Moreover, our definitionof quasi-strongly convex functions does not ensure uniqueness of the optimal solution of problem (P)and does not require f to have Lipschitz continuous gradient. We denote the class of convex functionswith Lipschitz continuous gradient with constant Lf in (123) and satisfying the quasi-strong convexityproperty with constant κf in (132) by qSLf ,κf

(X). Clearly, for strongly convex functions with constant

κf , from the first condition in (129) with y = x∗ ∈ X∗, we observe that the following inclusion hold:

SLf ,κf(X) ⊆ qSLf ,κf

(X). (133)

Moreover, combining the inequalities (124) and (132), we obtain that the condition number of objectivefunction f ∈ qSLf ,κf

(X), defined as µf = κf/Lf , satisfies:

0 < µf ≤ 1. (134)

We will derive below other functional classes that are related to our newly introduced class of quasi-strongly convex functions qSLf ,κf

(X).

Quadratic under-approximationLet us define the class of functions satisfying a quadratic under-approximation on the set X, obtainedfrom relaxing the first inequality in (129) by choosing y = x and x = x ≡ [x]X∗ :

Definition 3.2. Continuously differentiable function f has a quadratic under-approximation on X ifthere exists a constant κf > 0 such that for any x ∈ X and x = [x]X∗ we have:

f(x) ≥ f∗ + ⟨∇f(x), x− x⟩+κf2∥x− x∥2 ∀x ∈ X. (135)

We denote the class of convex functions with Lipschitz continuous gradient and satisfying the quadraticunder-approximation property (135) on X by ULf ,κf

(X). Then, we have the following inclusion:

Theorem 3.3. Inequality (132) implies inequality (135). Therefore, the following inclusion holds:

qSLf ,κf(X) ⊆ ULf ,κf

(X). (136)

Proof. Let f ∈ qSLf ,κf(X). Since f is convex function, it satisfies the inequality (135) with some

constant κf (0) ≥ 0, i.e.:

f(x) ≥ f∗ + ⟨∇f(x), x− x⟩+κf (0)

2∥x− x∥2. (137)

77

Page 78: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Using first order Taylor approximation in the integral form we have:

f(x) = f(x) +

∫ 1

0⟨∇f(x+ τ(x− x)), x− x⟩dτ

= f(x) +

∫ 1

0

1

τ⟨∇f(x+ τ(x− x)), τ(x− x)⟩dτ

(132) in x+τ(x−x)

≥ f(x) +

∫ 1

0

1

τ

(f(x+ τ(x− x))− f(x) +

κf2∥τ(x− x)∥2

)dτ

(137)

≥ f(x)+

∫ 1

0

1

τ

(⟨∇f(x), τ(x− x)⟩+

κf (0)

2∥τ(x− x)∥2

)+1

τ

κf2∥τ(x− x)∥2dτ

= f(x) +

∫ 1

0⟨∇f(x), x− x⟩+

τκf (0)

2∥x− x∥2 +

τκf2

∥x− x∥2dτ

= f(x) + ⟨∇f(x), x− x⟩+κf (0) + κf

2· 12∥x− x∥2.

If we denote κf (1) =κf (0)+κf

2 , then we get that inequality (137) also holds for κf (1). Repeating thesame argument as above for f ∈ qSLf ,κf

(X) and satisfying (137) for κf (1) we get that inequality (137)

also holds for κf (2) =κf (1)+κf

2 =κf (0)+3κf

4 . Iterating this procedure we obtain that after t steps:

κf (t) =κf (t− 1) + κf

2=κf (0) + (2t − 1)κf

2t→ κf as t→ ∞.

Since after any t steps the inequality (137) holds with κf (t), using continuity of κf (t) in (137) weobtain (135). This proves our statement.

Moreover, combining the inequalities (124) and (135), we obtain that the condition number ofobjective function f ∈ ULf ,κf

(X), defined as µf = κf/Lf , satisfies:

0 < µf ≤ 1. (138)

Quadratic gradient growthLet us define the class of functions satisfying a bound on the variation of gradients over the set X. Itis obtained by relaxing the second inequality in (129) by choosing y = x ≡ [x]X∗ :

Definition 3.4. Continuously differentiable function f has a quadratic gradient growth on set X if thereexists a constant κf > 0 such that for any x ∈ X and x = [x]X∗ we have:

⟨∇f(x)−∇f(x), x− x⟩ ≥ κf∥x− x∥2 ∀x ∈ X. (139)

Now, let us denote the class of convex differentiable functions with Lipschitz gradient and satisfying thequadratic gradient growth (139) by GLf ,κf

(X). In [78] the authors analyzed a similar class of objectivefunctions, but for unconstrained optimization problems, that is X = Rn, which was called restrictedstrong convexity and was defined as: there exists a constant κf > 0 such that ⟨∇f(x), x − x⟩ ≥κf∥x− x∥2 for all x ∈ Rn. An immediate consequence of Theorem 3.3 is the following inclusion:

Theorem 3.5. Inequality (132) implies inequality (139). Therefore, the following inclusion holds:

qSLf ,κf(X) ⊆ GLf ,κf

(X). (140)

78

Page 79: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. If f ∈ qSLf ,κf(X), then f satisfies the inequality (132). From Theorem 3.3 we also have that

f satisfies inequality (135). By adding the two inequalities (132) and (135) in x we get:

⟨∇f(x)−∇f(x), x− x⟩ ≥ κf∥x− x∥2 ∀x ∈ X, (141)

which proves that inequality (139) holds.

We prove below that (132) or (135) alone and convexity of f implies (139) with constant κf/2. Indeed,let us assume for example that (135) holds, then we have:

f(x) ≥ f∗ + ⟨∇f(x), x− x⟩+κf2∥x− x∥2

≥ f(x) + ⟨∇f(x), x− x⟩+ ⟨∇f(x), x− x⟩+κf2∥x− x∥2

= f(x) + ⟨∇f(x)−∇f(x), x− x⟩+κf2∥x− x∥2,

which leads to (139) with constant κf/2. Combining the inequalities (125) and (139), we obtain thatthe condition number of objective function f ∈ GLf ,κf

(X), satisfies:

0 < µf ≤ 1. (142)

Theorem 3.6. Inequality (139) implies inequality (135). Therefore, the following inclusion holds:

GLf ,κf(X) ⊆ ULf ,κf

(X). (143)

Proof. Let f ∈ GLf ,κf(X), then from first order Taylor approximation in the integral form we get:

f(x) = f(x) +

∫ 1

0⟨∇f(x+ t(x− x)), x− x⟩dt

= f(x) + ⟨∇f(x), x− x⟩+∫ 1

0⟨∇f(x+ t(x− x))−∇f(x), x− x⟩dt

=f(x) + ⟨∇f(x), x− x⟩+∫ 1

0

1

t⟨∇f(x+ t(x− x))−∇f(x), t(x− x)⟩dt

(139)

≥ f(x) + ⟨∇f(x), x− x⟩+∫ 1

0

1

tκf∥t(x− x)∥2dt

= f(x) + ⟨∇f(x), x− x⟩+κf2∥x− x∥2,

where we used that [x + t(x − x)]X∗ = x for any t ∈ [0, 1]. This chain of inequalities proves that fsatisfies inequality (135) with the same constant κf .

Quadratic functional growthWe further define the class of functions satisfying a quadratic functional growth property on the set X.It shows that the objective function grows faster than the squared distance between any feasible pointand the optimal set. More precisely, since ⟨∇f(x∗), y− x∗⟩ ≥ 0 for all y ∈ X and x∗ ∈ X∗, then usingthis relation and choosing y = x and x = x ≡ [x]X∗ in the first inequality in (129), we get a relaxationof this strong convexity condition similar to inequality (130):

79

Page 80: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Definition 3.7. Continuously differentiable function f has a quadratic functional growth on X if thereexists a constant κf > 0 such that for any x ∈ X and x = [x]X∗ we have:

f(x)− f∗ ≥κf2∥x− x∥2 ∀x ∈ X. (144)

Since the above quadratic functional growth inequality is given in x, this does not mean that f growseverywhere faster than the quadratic function κf/2∥x−x∥2. We denote the class of convex differentiablefunctions with Lipschitz continuous gradient and satisfying the quadratic functional growth (144) byFLf ,κf

(X). We now derive inclusion relations between the functional classes we have introduced so far:

Theorem 3.8. The following chain of implications are valid:

(129) ⇒ (132) ⇒ (139) ⇒ (135) ⇒ (144).

Therefore, the following inclusions hold:

SLf ,κf(X) ⊆ qSLf ,κf

(X) ⊆ GLf ,κf(X) ⊆ ULf ,κf

(X) ⊆ FLf ,κf(X). (145)

Proof. From the optimality conditions for problem (P) we have ⟨∇f(x), x− x⟩ ≥ 0 for all x ∈ X. Then,for any objective function f satisfying (135), i.e. f ∈ ULf ,κf

(X), we also have (144). In conclusion,

from previous derivations, (133) and Theorems 3.5 and 3.6 we obtain our chain of inclusions.

Let us define the condition number of objective function f ∈ FLf ,κf(X) as µf =

κf

Lf. If the feasible set

X is unbounded, then combining (124) with (144) and considering ∥x− x∥ → ∞, we conclude that:

0 < µf ≤ 1. (146)

However, if the feasible set X is bounded, we may have κsf ≫ Lf , provided that ∥∇f(x)∥ is large, andthus the condition number might be greater than 1:

µf ≥ 1. (147)

Moreover, from the inclusions given by Theorem 3.8 we conclude that:

µf (S) ≤ µf (qS) ≤ µf (G) ≤ µf (U) ≤ µf (F).

Let us denote the projected gradient step from x with:

x+ = [x− 1/Lf∇f(x)]X ,

and its projection onto the optimal set X∗ with x+ = [x+]X∗ . Then, we will show that if x+ is closerto X∗ than x, then the objective function f must satisfy the quadratic functional growth (144):

Theorem 3.9. Let f be a convex function with Lipschitz continuous gradient with constant Lf . Ifthere exists some positive constant β < 1 such that the following inequality holds:

∥x+ − x+∥ ≤ β∥x− x∥ ∀x ∈ X,

then f satisfies the quadratic functional growth (144) on X with the constant κf = Lf (1− β)2.

80

Page 81: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. On the one hand, from triangle inequality for the projection we have:

∥x− x∥ ≤ ∥x− x+∥ ≤ ∥x− x+∥+ ∥x+ − x+∥.

Combining this relation with the condition from the theorem, that is ∥x+ − x+∥ ≤ β∥x− x∥, we get:

(1− β)∥x− x∥ ≤ ∥x− x+∥. (148)

On the other hand, we note that x+ is the optimal solution of the problem:

x+ = argminz∈X

[f(x) + ⟨∇f(x), z − x⟩+

Lf

2∥z − x∥2

]. (149)

From (124) we have:

f(x+) ≤ f(x) + ⟨∇f(x), x+ − x⟩+Lf

2∥x+ − x∥2

and combining with the optimality conditions of (149) in x, that is ⟨∇f(x)+Lf (x+−x), x−x+⟩ ≤ 0,

we get the following decrease in terms of the objective function:

f(x+) ≤ f(x)−Lf

2∥x+ − x∥2. (150)

Finally, combining (148) with (150), and using f(x+) ≥ f∗, we get our statement.

Error bound propertyLet us recall the gradient mapping of a continuous differentiable function f with Lipschitz continuousgradient in a point x ∈ Rn: g(x) = Lf (x − x+), where x+ = [x − 1/Lf∇f(x)]X is the projectedgradient step from x. Note that g(x∗) = 0 for all x∗ ∈ X∗. Moreover, if X = Rn, then g(x) = ∇f(x).Recall that the main property of the gradient mapping for convex objective functions with Lipschitzcontinuous gradient of constant Lf is given by the following inequality [53][Theorem 2.2.7]:

f(y) ≥ f(x+) + ⟨g(x), y − x⟩+ 1

2Lf∥g(x)∥2 ∀y ∈ X and x ∈ Rn. (151)

Taking y = x in (151) and using that f(x+) ≥ f∗, we get the simpler inequality:

⟨g(x), x− x⟩ ≥ 1

2Lf∥g(x)∥2 ∀x ∈ Rn. (152)

In [72] Tseng introduced an error bound condition that estimates the distance to the solution set fromany feasible point by the norm of the proximal residual: there exists a constant κ > 0 such that∥x − x∥ ≤ κ∥x − [x − ∇f(x)]X∥ for all x ∈ X. This notion was further extended and analyzedin [45,48,79]. Next, we define an error bound type condition, obtained from the relaxation of the strongconvex inequality (131) for the particular choice y = x ≡ [x]X∗ .

Definition 3.10. The continuous differentiable function f has a global error bound on X if there existsa constant κf > 0 such that for any x ∈ X and x = [x]X∗ we have:

∥g(x)∥ ≥ κf∥x− x∥ ∀x ∈ X. (153)

81

Page 82: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

We denote the class of convex functions with Lipschitz continuous gradient and satisfying the errorbound (153) by ELf ,κf

. Let us define the condition number of the objective function f ∈ ELf ,κf(X) as

µf =κf

Lf. Combining inequality (152) and (153) we conclude that the condition number satisfies the

inequality:

0 < µf ≤ 2. (154)

However, for the unconstrained case, i.e. X = Rn and ∇f(x) = 0, from (123) and (153) we get0 < µf ≤ 1. We now determine relations between the quadratic functional growth condition and theerror bound condition.

Theorem 3.11. Inequality (153) implies inequality (144) with constant µf ·κf . Therefore, the followinginclusion holds for the functional class ELf ,κf

(X):

ELf ,κf(X) ⊆ FLf ,µf ·κf

(X). (155)

Proof. Combining (150) and (153) we obtain:

κ2f∥x− x∥2 ≤ ∥g(x)∥2 ≤ 2Lf (f(x)− f(x+)) ≤ 2Lf (f(x)− f∗) ∀x ∈ X.

In conclusion, inequality (144) holds with the constantκ2f

Lf= µf ·κf , where we recall µf = κf/Lf . This

also proves the inclusion: ELf ,κf(X) ⊆ FLf ,µf ·κf

(X).

Theorem 3.12. Inequality (144) implies inequality (153) with constant 11+µf+

√1+µf

· κf . Therefore,

the following inclusion holds for the functional class FLf ,κf(X):

FLf ,κf(X) ⊆ E

Lf ,1

1+µf+√

1+µf·κf

(X). (156)

Proof. From the gradient mapping property (151) evaluated at the point y = x+ ≡ [x+]X∗ , we get:

f∗ ≥ f(x+) + ⟨g(x), x+ − x⟩+ 1

2Lf∥g(x)∥2

= f(x+) + ⟨g(x), x+ − x+⟩ − 1

2Lf∥g(x)∥2.

Further, combining the previous inequality and (144), we obtain:

⟨g(x), x+ − x+⟩+ 1

2Lf∥g(x)∥2 ≥ f(x+)− f∗ ≥

κf2∥x+ − x+∥2.

Using Cauchy-Schwartz inequality for the scalar product and then rearranging the terms we obtain:

1

2Lf

(∥g(x)∥+ Lf∥x+ − x+∥

)2 ≥ κf + Lf

2∥x+ − x+∥2

or equivalently

∥g(x)∥+ Lf∥x+ − x+∥ ≥√Lf (κf + Lf )∥x+ − x+∥.

82

Page 83: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

We conclude that:

∥g(x)∥ ≥(√

Lf (κf + Lf )− Lf

)∥x+ − x+∥.

Since

∥x− x∥ ≤ ∥x− x+∥ ≤ ∥x− x+∥+ ∥x+ − x+∥ =1

Lf∥g(x)∥+ ∥x+ − x+∥,

then we obtain:

∥g(x)∥ ≥(√

Lf (κf + Lf )− Lf

)(∥x− x∥ − 1

Lf∥g(x)∥

).

After simple manipulations and using that µf = κf/Lf , we arrive at:

∥g(x)∥ ≥κf

1 + µf +√1 + µf

∥x− x∥,

which shows that inequality (153) is valid for the constant 11+µf+

√1+µf

· κf .

Note that the functional classes we have introduced previously were obtained by relaxing the strongconvexity inequalities (129)–(131) for some particular choices of x and y. The reader can find otherfavorable examples of relaxations of strong convexity inequalities and we believe that this work opensan window of opportunity for algorithmic research in non-strongly convex optimization settings. In thenext section we provide concrete examples of objective functions that can be found in the functionalclasses introduced above.

3.1.3 Functional classes in qSLf ,κf(X), GLf ,κf

(X) and FLf ,κf(X)

We now provide examples of structured convex optimization problems whose objective function satisfiesone of our relaxations of strong convexity conditions that we have introduced in the previous sections.We start first recalling some error bounds for the solutions of a system of linear equalities and inequalities.Let A ∈ Rp×n, C ∈ Rm×n and the arbitrary norms ∥·∥α and ∥·∥β in Rm+p and Rn. Given the nonemptypolyhedron:

P = {x ∈ Rn : Ax = b, Cx ≤ d},

then there exists a constant θ(A,C) > 0 such that Hoffman inequality holds (for a proof of the Hoffmaninequality see e.g. [79]):

∥x− x∥ ≤ θ(A,C)

∥∥∥∥ Ax− b[Cx− d]+

∥∥∥∥α

∀x ∈ Rn,

where x = [x]P ≡ argminz∈P∥z − x∥β. The constant θ(A,C) is the Hoffman constant for thepolyhedron P with respect to the pair of norms (∥·∥α, ∥·∥β). In [28], the authors provide severalestimates for the Hoffman constant. Assume that A has full row rank and define the following quantity:

ζα,β(A,C) := minI∈J

minu,v

{∥ATu+ CT v∥β∗ :

∥∥∥∥uv∥∥∥∥α∗

= 1, vI ≥ 0, v[m]\I = 0

},

83

Page 84: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where J = {I ∈ 2[m] : card I = r − p, rank[AT , CTI ] = r} and r = rank[AT , CT ]. An alternative

formulation of the above quantity is:

1

ζα,β(A,C)=sup

∥∥∥∥uv∥∥∥∥α∗:

∥ATu+ CT v∥β∗ = 1, rows of Ccorresponding to nonzero components of vand rows of A are linearly independent

. (157)

In [28] it was proved that ζα,β(A,C)−1, where ζα,β(A,C) is defined in (157), is the Hoffman constant

for the polyhedral set P w.r.t. the norms (∥·∥α, ∥·∥β).Considering the Euclidean setting (α = β = 2) and the above assumptions, then from previous discussionwe have:

θ(A,C) = maxI∈J

1

σmin([AT , CTI ]

T ).

Under some regularity condition we can state a simpler form for ζα,2(A,C). Assume that A has fullrow rank and that the set {h ∈ Rn : Ah = 0, Ch < 0} = ∅, then, we have [28]:

ζα,2(A,C) := min

{∥ATu+ CT v∥2 :

∥∥∥∥uv∥∥∥∥α∗

= 1, v ≥ 0

}. (158)

Thus, for the special case m = 0, i.e. there are no inequalities, we have ζ2,2(A, 0) = σmin(A), whereσmin(A) denotes the smallest nonzero singular value of A, and the Hofman constant is:

θ(A, 0) =1

σmin(A). (159)

Composition of strongly convex function with linear map is in qSLf ,κf(X)

Let us consider the class of optimization problems (P) having the following structured form:

f∗ =minxf(x) ≡ g(Ax) (160)

s.t. : x ∈ X ≡ {x ∈ Rn : Cx ≤ d},

i.e. the objective function is in the form f(x) = g(Ax), where g is a smooth and strongly convexfunction and A ∈ Rm×n is a nonzero general matrix. Problems of this form arise in various applicationsincluding dual formulations of linearly constrained convex problems, convex quadratic problems, routingproblems in data networks, statistical regression and many others. Note that if A has full column rank,then g(Ax) is strongly convex function. However, if A is rank deficient, then g(Ax) is not stronglyconvex. We prove next that the objective function of problem (160) belongs to the class qSLf ,κf

.

Theorem 3.13. Let X = {x ∈ Rn : Cx ≤ d} be a polyhedral set, function g : Rm → R be σg-stronglyconvex with Lg-Lipschitz continuous gradient on X, and A ∈ Rm×n be a nonzero matrix. Then, theconvex function f(x) = g(Ax) belongs to the class qSLf ,κf

(X), with constants Lf = Lg∥A∥2 and

κf =σg

θ2(A,C), where θ(A,C) is the Hoffman constant for the polyhedral optimal set X∗.

Proof. The fact that f has Lipschitz continuous gradient follows immediately from the definition (123):

∥∇f(x)−∇f(y)∥ = ∥AT∇g(Ax)−AT∇g(Ay)∥ ≤ ∥A∥∥∇g(Ax)−∇g(Ay)∥≤ ∥A∥Lg∥Ax−Ay∥ ≤ ∥A∥2Lg∥x− y∥.

84

Page 85: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Thus, Lf = Lg∥A∥2. Further, under assumptions of the theorem, there exists a unique pair (t∗, T ∗) ∈Rm × Rn such that the following relations hold:

Ax∗ = t∗, ∇f(x∗) = T ∗ ∀x∗ ∈ X∗. (161)

For completeness, we give a short proof of this well known fact: let x∗1, x∗2 be two optimal points for the

optimization problem (160). Then, from convexity of f and definition of optimal points, it follows that:

f

(x∗1 + x∗2

2

)=f(x∗1) + f(x∗2)

2.

Since f(x) = g(Ax) we get from previous relation that:

g

(Ax∗1 +Ax∗2

2

)=g(Ax∗1) + g(Ax∗2)

2.

On the other hand using the definition of strong convexity (127) for g we have:

g

(Ax∗1 +Ax∗2

2

)≤ g(Ax∗1) + g(Ax∗2)

2− σg

8∥Ax∗1 −Ax∗2∥2.

Combining the previous two relations, we obtain that Ax∗1 = Ax∗2. Moreover, ∇f(x∗) = AT∇g(Ax∗).In conclusion, Ax and the gradient of f are constant over the set of optimal solutions X∗ for (160),i.e. the relations (161) hold. Moreover, we have that f∗ = f(x∗) = g(Ax∗) = g(t∗) for all x∗ ∈ X∗.In conclusion, the set of optimal solutions X∗ is described by the following polyhedral set:

X∗ = {x∗ : Ax∗ = t∗, Cx∗ ≤ d}.

Since we assume that our optimization problem (P) has at least one solution, i.e. the optimal polyhedralset X∗ is non-empty, then from Hoffman inequality we have that there exists some positive constantdepending on the matrices A and C describing the polyhedral set X∗, i.e. θ(A,C) > 0, such that:

∥x− x∥ ≤ θ(A,C)

∥∥∥∥[ Ax− t∗

[Cx− d]+

]∥∥∥∥ ∀x ∈ Rn,

where x = [x]X∗ (the projection of the vector x onto the optimal set X∗). Then, for any feasible x, i.e.x satisfying Cx ≤ d, we have:

∥x− x∥ ≤ θ(A,C)∥Ax−Ax∥ ∀x ∈ X.

On the other hand, since g is strongly convex, it follows that:

g(Ax)(129)

≥ g(Ax) + ⟨∇g(Ax), Ax−Ax⟩+ σg2∥Ax−Ax∥2.

Combining the previous two relations and keeping in mind that f(x) = g(Ax) and∇f(x) = AT∇g(Ax),we obtain:

f∗ ≥ f(x) + ⟨∇f(x), x− x⟩+ σg2θ2(A,C)

∥x− x∥2 ∀x ∈ X,

which proves that the quasi-strong convex inequality (132) holds for the constant κf = σg/θ2(A,C).

85

Page 86: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Note that we can relax the requirements for g in Theorem 3.13. For example, we can replace thestrong convexity assumption on g with the conditions that g has unique minimizer t∗ and it satisfiesthe quasi-strong convex condition (132) with constant κg > 0. Then, using the same arguments as inthe proof of Theorem 3.13, we can show that for objective functions f(x) = g(Ax) of problem (P), theoptimal set is X∗ = {x∗ : Ax∗ = t∗, Cx∗ ≤ d} and f satisfies (132) with constant κf =

κg

θ2(A,C),

provided that the corresponding optimal set X∗ is nonempty.Moreover, in the unconstrained case, that is X = Rn, and for objective function f(x) = g(Ax), we getfrom (159) the following expression for the quasi-strong convexity constant:

κf = σgσ2min(A). (162)

Below we prove two extensions that belong to other functional classes we have introduced in this work.

Composition of strongly convex function with linear map plus a linear term for X = Rn is inGLf ,κf

(X)

Let us now consider the class of unconstrained optimization problems (P), i.e. X = Rn, having theform:

f∗ = minx∈Rn

f(x) ≡ g(Ax) + cTx, (163)

i.e. the objective function is in the form f(x) = g(Ax)+ cTx, where g is a smooth and strongly convexfunction, A ∈ Rm×n is a nonzero general matrix and c ∈ Rn. We prove in the next theorem that thistype of objective function for problem (163) belongs to the class GLf ,κf

:

Theorem 3.14. Under the same assumptions as in Theorem 3.13 with X = Rn, the objective functionof the form f(x) = g(Ax) + cTx belongs to the class GLf ,κf

(X), with constants Lf = Lg∥A∥2 and

κf =σg

θ2(A,0), where θ(A, 0) is the Hoffman constant for the optimal set X∗.

Proof. Since g is σg-strongly convex and with Lg-Lipschitz continuous gradient, then by the samereasoning as in the proof of Theorem 3.13 we get that there exists unique vector t∗ such that Ax∗ = t∗

for all x∗ ∈ X∗. Similarly, there exists unique scalar s∗ such that cTx∗ = s∗ for all x∗ ∈ X∗. Indeed,for x∗1, x

∗2 ∈ X∗ we have:

f∗ = g(t∗) + cTx∗1 = g(t∗) + cTx∗2,

which implies that cTx∗1 = cTx∗2. On the other hand, since problem (P) is unconstrained, for anyx∗ ∈ X∗ we have:

0 = ∇f(x∗) = AT∇g(t∗) + c,

which implies that cTx∗ = −(∇g(t∗))TAx∗ = −(∇g(t∗))T t∗. Therefore, the set of optimal solutionsX∗ is described in this case by the following polyhedron:

X∗ = {x∗ : Ax∗ = t∗}.

Then, there exists θ(A, 0) > 0 such that the Hoffman inequality holds:

∥x− x∥ ≤ θ(A, 0)∥Ax−Ax∥ ∀x ∈ Rn.

From the previous inequality and strong convexity of g, we have:

σgθ2(A, 0)

∥x− x∥2 ≤ σg∥Ax−Ax∥2(129)

≤ ⟨∇g(Ax)−∇g(Ax), Ax−Ax⟩

= ⟨AT∇g(Ax) + c−AT∇g(Ax)− c, x− x⟩= ⟨∇f(x)−∇f(x), x− x⟩.

86

Page 87: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Finally, we conclude that the inequality on the variation of gradients (139) holds with constant κf =σg

θ2(A,0).

Composition of strongly convex function with linear map plus a linear term is in FLf ,κf(XM )

Finally, let us now consider the class of optimization problems (P) of the form:

f∗ =minxf(x) ≡ g(Ax) + cTx (164)

s.t. : x ∈ X ≡ {x ∈ Rn : Cx ≤ d},

i.e. the objective function is in the form f(x) = g(Ax)+ cTx, where g is a smooth and strongly convexfunction, A ∈ Rm×n is a nonzero matrix and c ∈ Rn. We now prove that the objective function ofproblem (164) belongs to class FLf ,κf

, provided that some boundedness assumption is imposed on f .

Theorem 3.15. Under the same assumptions as in Theorem 3.13, the objective function f(x) =g(Ax) + cTx belongs to the class FLf ,κf

(XM ) for any constant M > 0 such that XM = {x : x ∈X, f(x) − f∗ ≤ M}, with constants Lf = Lg∥A∥2 and κf =

σg

θ2(A,c,C)(1+Mσg+2c2g), where θ(A, c, C)

is the Hoffman constant for the polyhedral optimal set X∗ and cg = ∥∇g(Ax∗)∥, with x∗ ∈ X∗.

Proof. From the proof of Theorem 3.14 it follows that there exist unique t∗ and s∗ such that the optimalset of (164) is given as follows:

X∗ = {x∗ : Ax∗ = t∗, cTx∗ = s∗, Cx∗ ≤ d}.

From Hoffman inequality we have that there exists some positive constant depending on the matricesA,C and c describing the polyhedral set X∗, i.e. θ(A,C, c) > 0, such that:

∥x− x∥ ≤ θ(A, c, C)

∥∥∥∥∥∥ Ax− t∗

cTx− s∗

[Cx− d]+

∥∥∥∥∥∥ ∀x ∈ Rn,

where recall that x = [x]X∗ . Then, for any feasible x, i.e. satisfying Cx ≤ d, we have:

∥x− x∥2 ≤ θ2(A, c, C)(∥Ax−Ax∥2 + (cTx− cT x)2

)∀x ∈ X. (165)

Since f(x) = g(Ax) + cTx and g is strongly convex, it follows from (129) that:

g(Ax)− g(Ax) ≥ ⟨∇g(Ax), Ax−Ax⟩+ σg2∥Ax−Ax∥2

= ⟨AT∇g(Ax) + c, x− x⟩ − ⟨c, x− x⟩+ σg2∥Ax−Ax∥2

= ⟨∇f(x), x− x⟩ − ⟨c, x− x⟩+ σg2∥Ax−Ax∥2.

Using that ⟨∇f(x), x− x⟩ ≥ 0 for all x ∈ X, and definition of f , we obtain:

f(x)− f∗ ≥ σg2∥Ax−Ax∥2 ∀x ∈ X. (166)

87

Page 88: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

It remains to bound (cTx− cT x)2. It is easy to notice that θ(A, c, C) ≥ 1/∥c∥. We also observe that:

cTx− cT x = ⟨∇f(x), x− x⟩ − ⟨∇g(Ax), Ax−Ax⟩.

Since f(x)− f∗ ≥ ⟨∇f(x), x− x⟩ ≥ 0 for all x ∈ X, then we obtain:

|cTx− cT x| ≤ f(x)− f∗ + ∥∇g(Ax)∥ ∥Ax−Ax∥,

and then using inequality (α + β)2 ≤ 2α2 + 2β2 and considering f(x)− f∗ ≤ M , cg = ∥∇g(t∗)∥ and(166), we get:

(cTx− cT x)2 ≤ 2(f(x)− f∗)2 + 2c2g∥Ax−Ax∥2

(2M +

4c2gσg

)(f(x)− f∗) ∀x ∈ X, f(x)− f∗ ≤M.

Finally, we conclude that:

∥x− x∥2 ≤ 2θ2(A, c, C)

σg

(1 +Mσg + 2c2g

)(f(x)− f∗) ∀x ∈ X, f(x)− f∗≤M.

This proves the statement of the theorem.

Typically, for feasible descent methods we take M = f(x0) − f∗ in the previous theorem, where x0 isthe starting point of the method. Moreover, if X is bounded, then there exists always M such thatf(x) − f∗ ≤ M for all x ∈ X. Note that the requirement f(x) − f∗ ≤ M for having a second ordergrowth inequality (144) for f is necessary, as shown in the following example:

Example 3.16. Let us consider problem (P) in the form (164) given by:

minx∈R2

+

1

2x21 + x2

which has X∗ = {0} and f∗ = 0. Clearly, there is no constant κf <∞ such that the following inequalityto be valid:

f(x) ≥κf2∥x∥2 ∀x ≥ 0.

We can take for example x1 = 0 and x2 → +∞. However, for any M > 0 there exists κf (M) < ∞satisfying the above inequality for all x ≥ 0 with f(x) ≤M . For example, we can take:

κf (M) = min{1, 1

M} ⇒ µf (M) =

1

Mfor M ≥ 1.

Note that for this example θ(A, c, C) = 1∥c∥ = 1.

In the sequel we analyze the convergence rate of several first order methods for solving convex constrainedoptimization problem (P) having the objective function in one of the functional classes introduced inthis section.

88

Page 89: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3.1.4 Linear convergence of projected gradient method (GM)

We show in this section that projected gradient method has linear convergence rates on optimizationproblems (P), whose objective function satisfies one of the non-strongly convex conditions given above.Let us consider the projected gradient algorithm with variable step size:

Algorithm (GM)

Given x0 ∈ X for k ≥ 1 do:

1. Compute xk+1 =[xk − αk∇f(xk)

]X

where αk is a step size such that αk ∈ [L−1f , L−1

f ], with Lf ≥ Lf .

Linear convergence of (GM) for qSLf ,κf

Let us show that the projected gradient method converges linearly on optimization problems (P) whoseobjective functions belong to the class qSLf ,κf

.

Theorem 3.17. Let the optimization problem (P) have the objective function belonging to the classqSLf ,κf

. Then, the sequence xk generated by the projected gradient method (GM) with constant step

size αk = 1/Lf on (P) converges linearly to some optimal point in X∗ with the rate:

∥xk − xk∥2 ≤(1− µf1 + µf

)k

∥x0 − x0∥2, where µf =κfLf. (167)

Proof. From Lipschitz continuity of the gradient of f given in (124) we have:

f(xk+1) ≤ f(xk) + ⟨∇f(xk), xk+1 − xk⟩+Lf

2∥xk+1 − xk∥2. (168)

The optimality conditions for xk+1 are:

⟨xk+1 − xk + αk∇f(xk), x− xk+1⟩ ≥ 0 ∀x ∈ X. (169)

Taking x = xk in (169) and replacing the corresponding expression in (168), we get:

f(xk+1) ≤ f(xk) + (Lf

2− 1

αk)∥xk+1 − xk∥2

αk≤L−1f

≤ f(xk)−Lf

2∥xk+1 − xk∥2.

89

Page 90: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Further, we have:

∥xk+1 − xk∥2 = ∥xk − xk∥2 + 2⟨xk − xk, xk+1 − xk⟩+ ∥xk+1 − xk∥2

= ∥xk − xk∥2 + 2⟨xk+1 − xk, xk+1 − xk⟩ − ∥xk+1 − xk∥2

(169)

≤ ∥xk − xk∥2 + 2αk⟨∇f(xk), xk − xk+1⟩ − ∥xk+1 − xk∥2

= ∥xk− xk∥2+2αk⟨∇f(xk), xk− xk⟩+2αk⟨∇f(xk), xk− xk+1⟩− ∥xk+1 − xk∥2

(132)

≤ ∥xk − xk∥2 + 2αk

(f∗ − f(xk)−

κf2∥xk − xk∥2

)− 2αk

(⟨∇f(xk), xk+1 − xk⟩+ 1

2αk∥xk+1 − xk∥2

)= (1− αkκf )∥xk − xk∥2 + 2αkf

− 2αk

(f(xk) + ⟨∇f(xk), xk+1 − xk⟩+ 1

2αk∥xk+1 − xk∥2

)Lf≤1/αk

≤ (1− αkκf )∥xk − xk∥2 + 2αkf∗

− 2αk

(f(xk) + ⟨∇f(xk), xk+1 − xk⟩+

Lf

2∥xk+1 − xk∥2

)(168)

≤ (1− αkκf )∥xk − xk∥2 − 2αk(f(xk+1)− f∗).

Since (132) holds for the function f , then from Theorem 3.8 we also have that (144) holds and thereforef(xk+1) − f∗ ≥ κf

2 ∥xk+1 − xk+1∥2. Combining the last inequality with the previous one and takinginto account that ∥xk+1 − xk+1∥ ≤ ∥xk+1 − xk∥ , we get:

∥xk+1 − xk+1∥2 ≤ (1− αkκf )∥xk − xk∥2 − αkκf∥xk+1 − xk+1∥2,

or equivalently

∥xk+1 − xk+1∥2 ≤1− αkκf1 + αkκf

· ∥xk − xk∥2. (170)

However, the best decrease is obtained for the constant step size αk = 1/Lf and using the definitionof the condition number µf = κf/Lf , we get:

∥xk+1 − xk+1∥2 ≤1− µf1 + µf

· ∥xk − xk∥2.

This proves our statement.

90

Page 91: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Based on Theorem 3.17 we can easily derive linear convergence for the projected gradient algorithm(GM) in terms of the function values:

f(xk+1)(168)

≤ f(xk) + ⟨∇f(xk), xk+1 − xk⟩+Lf

2∥xk+1 − xk∥2

Lf≤1/αk

≤ minx∈X

f(xk) + ⟨∇f(xk), x− xk⟩+ 1

2αk∥xk − x∥2

≤ minx∈X

f(x) +1

2αk∥xk − x∥2 ≤ f(xk) +

1

2αk∥xk − xk∥2

(167)

≤ f∗ +Lf

2

(1− µf1 + µf

)k

∥x0 − x0∥2.

Finally, the best convergence rate is obtained for constant step size αk = 1/Lf :

f(xk)− f∗(173)

≤Lf∥x0 − x0∥2

2

(1− µf1 + µf

)k−1

∀k ≥ 1. (171)

However, this rate is not continuous as µf → 0. For simplicity, let us assume constant step sizeαk = 1/Lf , and then, using that (GM) is a descent method, i.e. f(xk) − f∗ ≤ f(xk−j) − f∗ for allj < k and iterating the main inequality from the proof of Theorem 3.17, we obtain:

∥xk − xk∥2 ≤ (1− µf )∥xk−1 − xk−1∥2 − 2

Lf

(f(xk)− f∗

)≤ (1− µf )

k∥x0 − x0∥2 − 2

Lf

k∑j=0

(1− µf )j(f(xk−j)− f∗)

≤ (1− µf )k∥x0 − x0∥2 − 2

Lf

(f(xk)− f∗

) k∑j=0

(1− µf )j .

Finally, we get linear convergence in terms of the function values:

f(xk)− f∗ ≤Lf∥x0 − x0∥2

µf(1− µf )−k − 1

. (172)

Since (1 + α)k → 1 + αk as α→ 0, then we see that:

µf(1− µf )−k − 1

≤ 1

kas µf → 0,

and thus from (172) we recover the classical sublinear rate for (GM) as µf → 0:

f(xk)− f∗ ≤Lf∥x0 − x0∥2

2kas µf → 0.

Linear convergence of (GM) for FLf ,κf

We now show that the projected gradient method converges linearly on optimization problems (P) whoseobjective functions belong to the class FLf ,κf

.

91

Page 92: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Theorem 3.18. Let optimization problem (P) have objective function belonging to the class FLf ,κf.

Then, the sequence xk generated by the projected gradient method (GM) with constant step sizeαk = 1/Lf on (P) converges linearly to some optimal point in X∗ with the rate:

∥xk − xk∥2 ≤(

1

1 + µf

)k

∥x0 − x0∥2, where µf =κfLf. (173)

Proof. Using similar arguments as in the previous Theorem 3.17, we have:

∥xk+1 − x∥2 = ∥xk − x∥2 + 2⟨xk − x, xk+1 − xk⟩+ ∥xk+1 − xk∥2

= ∥xk − x∥2 + 2⟨xk+1 − x, xk+1 − xk⟩ − ∥xk+1 − xk∥2

(169)

≤ ∥xk − x∥2 + 2αk⟨∇f(xk), x− xk+1⟩ − ∥xk+1 − xk∥2

≤ ∥xk − x∥2 − 2αk

(⟨∇f(xk), xk+1 − x⟩+

Lf

2∥xk+1 − xk∥2

+ (1

2αk−Lf

2)∥xk+1 − xk∥2

)= ∥xk − x∥2 + (Lfαk − 1)∥xk+1 − xk∥2

− 2αk

(⟨∇f(xk), xk−x⟩+⟨∇f(xk), xk+1−xk⟩+

Lf

2∥xk+1−xk∥2

)(168)

≤ ∥xk − x∥2 + (Lfαk − 1)∥xk+1 − xk∥2

+ 2αk(f(x)− f(xk)) + 2αk(f(xk)− f(xk+1))

αk≤L−1f

≤ ∥xk − x∥2 − 2αk(f(xk+1)− f(x)) ∀x ∈ X.

Taking now in the previous relations x = xk, using ∥xk+1 − xk+1∥ ≤ ∥xk+1 − xk∥ and the quadraticfunctional growth of f (144), we get:

∥xk+1 − xk+1∥2(144)

≤ ∥xk − xk∥2 − κfαk∥xk+1 − xk+1∥2

or equivalently

∥xk+1 − xk+1∥2 ≤ 1

1 + κfαk∥xk − xk∥2. (174)

However, the best decrease is obtained for the constant step size αk = 1/Lf and using the definitionof the condition number µf = κf/Lf , we get:

∥xk+1 − xk+1∥2 ≤ 1

1 + µf∥xk − xk∥2.

Thus, we have obtained the linear convergence rate for (GM) with constant step size αk = 1/Lf fromthe theorem.

92

Page 93: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Using similar arguments as for (171) and combining with (174) we can also derive linear convergenceof (GM) in terms of the function values:

f(xk+1)− f∗ ≤ 1

2αk∥xk − xk∥2

(174)

≤ 1

2αk

(1

1 + κfαk

)∥xk−1 − xk−1∥2,

and the best convergence rate is obtained for constant step size αk = 1/Lf :

f(xk)− f∗(173)

≤Lf∥x0 − x0∥2

2

(1

1 + µf

)k−1

∀k ≥ 1. (175)

However, this rate is not continuous as µf → 0. We can interpolate between the right hand side termsin (126) and (175) to obtain convergence rates in terms of function values of the form:

f(xk)− f∗ ≤Lf∥xt − xt∥2

2(k − t)≤Lf∥x0 − x0∥2

2(k − t)

1

(1 + µf )t∀t = 0 : k − 1,

or equivalently

f(xk)− f∗ ≤Lf∥x0 − x0∥2

2min

t=0:k−1

1

(1 + µf )t(k − t).

Finally, in the next theorem we establish necessary and sufficient conditions for linear convergence ofthe gradient method (GM).

Theorem 3.19. On the class of optimization problems (P) the sequence generated by the gradientmethod (GM) with constant step size is converging linearly to some optimal point in X∗ if and only ifthe objective function f satisfies the quadratic functional growth (144), i.e f belongs to the functionalclass FLf ,κf

.

Proof. The fact that linear convergence of the gradient method implies f satisfying the second ordergrowth property (144) follows from Theorem 3.9. The other implication follows from Theorem 3.18,eq. (174).

3.1.5 Linear convergence of fast gradient method (FGM)

In this section we consider the following fast gradient algorithm, which is a version of Nesterov’s optimalgradient method [53]:

Algorithm (FGM)

Given x0 = y0 ∈ X, for k ≥ 1 do:

1. Compute xk+1 =[yk − 1

Lf∇f(yk)

]X

and

2. yk+1 = xk+1 + βk(xk+1 − xk

)

93

Page 94: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

for appropriate choice of the parameter βk > 0 for all k ≥ 0.Linear convergence of (FGM) for qSLf ,κf

.

When the objective function f ∈ qSLf ,κf(X) we take the following expression for the parameter βk:

βk =

√Lf −√

κf√Lf +

√κf

∀k ≥ 0.

First of all we can easily observe that if f ∈ qSLf ,κf(X), then the gradient mapping g(x) satisfies the

following inequality:

f∗ ≥ f(x+) + ⟨g(x), x− x⟩+ 1

2Lf∥g(x)∥2 +

κf2∥x− x∥ ≡ qLf ,κf

(x, x) (176)

for all x ∈ Rn (recall that x = [x]X∗ and x+ = [x − 1/Lf∇f(x)]X). The convergence proof followssimilar steps as in [53][Section 2.2.4].

Lemma 3.20. Let optimization problem (P) have the objective function f belonging to the class qSLf ,κf

and an arbitrary sequence {yk}k≥0 satisfying yk = [yk]X∗ = y∗ for all k ≥ 0. Define an initial function:

ϕ0(x) = ϕ∗0 +γ02∥x− v0∥2, where γ0 = κf , v0 = y0 and ϕ∗0 = f(y0),

and a sequence {αk}k≥0 satisfying αk ∈ (0, 1). Then, the following two sequences, iteratively definedas:

λk+1 = (1− αk)λk, with λ0 = 1,

ϕk+1(x) = (1− αk)ϕk(x) (177)

+ αk

(f(xk+1)+

1

2Lf∥g(yk)∥2+⟨g(yk), x−yk⟩+

κf2∥x−yk∥2

),

where x0 = y0 and xk+1 =[yk − 1

Lf∇f(yk)

]X, satisfy the following property:

ϕk(y∗) ≤ (1− λk)f

∗ + λkϕ0(y∗) ∀k ≥ 0. (178)

Proof. We prove this statement by induction. Since λ0 = 1, we observe that:

ϕ0(y∗) = (1− λ0)f

∗ + λ0ϕ0(y∗).

Assume that the following inequality is valid:

ϕk(y∗) ≤ (1− λk)f

∗ + λkϕ0(y∗), (179)

then we have:

ϕk+1(y∗) = ϕk+1(y

k) = (1− αk)ϕk(yk) + αkqLf ,κf

(yk, yk)

(176)

≤ (1− αk)ϕk(yk) + αkf

= [1− (1− αk)λk]f∗ + (1− αk)

(ϕk(y

k)− (1− λk)f∗)

yk=y∗+(179)

≤ (1− λk+1)f∗ + λk+1ϕ0(y

∗).

which proves our statement.

94

Page 95: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Lemma 3.21. Under the same assumptions as in Lemma 3.20 and assuming also that the sequence

{xk}k≥0, defined as x0 = y0 and xk+1 =[yk − 1

Lf∇f(yk)

]X, satisfies:

f(xk) ≤ ϕ∗k = minx∈Rn

ϕk(x) ∀k ≥ 0, (180)

then we obtain the following convergence:

f(xk)− f∗ ≤ λk

(f(x0)− f∗ +

γ02∥y∗ − y0∥

). (181)

Proof. Indeed we have:

f(xk)− f∗ ≤ ϕ∗k − f∗ = minx∈Rn

ϕk(x)− f∗ ≤ ϕk(y∗)− f∗

(178)

≤ (1− λk)f∗ + λkϕ0(y

∗)− f∗ = λk (ϕ0(y∗)− f∗) ,

which proves the statement of the lemma.

Theorem 3.22. Under the same assumptions as in Lemma 3.20, the sequence xk generated by fastgradient method (FGM) with constant parameter βk = (

√Lf −

√κf )/(

√Lf +

√κf ) converges linearly

in terms of function values with the rate:

f(xk)− f∗ ≤(1−√

µf)k · 2 (f(x0)− f∗

), where µf =

κfLf, (182)

provided that all iterates yk produce the same projection onto optimal set X∗.

Proof. Let us consider x0 = y0 = v0 ∈ X. Further, for the sequence of functions ϕk(x) as defined in(177) take αk =

√µf ∈ (0, 1) for all k ≥ 0 and denote α =

õf . First, we need to show that the

method (FGM) defined above generates a sequence xk satisfying ϕ∗k ≥ f(xk). Assuming that ϕk(x)has the following two properties:

ϕk(x) = ϕ∗k +κf2∥x− vk∥2 and ϕ∗k ≥ f(xk),

where ϕ∗k = minx∈Rn ϕk(x) and vk = argminx∈Rn ϕk(x), then we will show that ϕk+1(x) has similarproperties. First of all, from the definition of ϕk+1(x), we get:

∇2ϕk+1(x) = ((1− α)κf + ακf ) In = κfIn,

i.e. ϕk+1(x) is also a quadratic function of the same form as ϕk(x):

ϕk+1(x) = ϕ∗k+1 +κf2∥x− vk+1∥2,

where the expression of vk+1 = argminx∈Rn ϕk+1(x) is obtained from the equation ∇ϕk+1(x) = 0,which leads to:

vk+1 =1

κf

((1− α)κfv

k + ακfyk − αg(yk)

).

95

Page 96: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Evaluating ϕk+1 in yk leads to:

ϕ∗k+1 +κf2∥yk − vk+1∥2 =(1− α)

(ϕ∗k +

κf2∥yk − vk∥2

)+ α

(f(xk+1) +

1

2Lf∥g(yk)∥2

).

On the other hand, we have:

vk+1 − yk =1

κf

(κf (1− α)(vk − yk)− αg(yk)

).

If we substitute this expression above, we obtain:

ϕ∗k+1 =(1− α)ϕ∗k + αf(xk+1) +

2Lf− α2

2κf

)∥g(yk)∥2

+ α(1− α)(κf2∥yk − vk∥2 + ⟨g(yk), vk − yk⟩

).

Using the main property of the gradient mapping (151), valid for functions with Lipschitz continuousgradient, we have:

ϕ∗k ≥ f(xk) ≥ f(xk+1) + ⟨g(yk), xk − yk⟩+ 1

2Lf∥g(yk)∥2.

Substituting this inequality in the previous one we get:

ϕ∗k+1 ≥ f(xk+1) +

(1

2Lf− α2

2κf

)∥g(yk)∥2 + (1− α)⟨g(yk), α(vk − yk) + xk − yk⟩.

Since α =√µf , then

12Lf

− α2

2κf= 0. Moreover, we have the freedom to choose yk, which is obtained

from the condition α(vk − yk) + xk − yk = 0:

yk =1

1 + α(αvk + xk).

Then, we can conclude that ϕ∗k+1 ≥ f(xk+1). Moreover, replacing the expression of yk in vk+1

leads to the conclusion that we can eliminate the sequence vk since it can be expressed as: vk+1 =xk + 1

α(xk+1 − xk). Then, we find that yk+1 has the expression as in our scheme (FGM) above with

βk = (√Lf − √

κf )/(√Lf +

√κf ). Using, now Lemmas 3.20 and 3.21 we get the convergence rate

from (182) (we also use thatκf

2 ∥x0 − x0∥2 ≤ f(x0)− f∗).

Remark 3.23. For unconstrained problem minx∈Rn g(Ax), the gradient in some point y is given byAT∇g(Ay) ∈ Range(AT ). Then, the method (FGM) generates in this case a sequence yk of the form:

yk = y0 +AT zk, zk ∈ Rm ∀k ≥ 0.

Moreover, for this problem the optimal set X∗ = {x : Ax = t∗} and the projection onto this affinesubspace is given by:

[ · ]X∗ =(In −AT (AAT )−1A

)(·) +AT (AAT )−1t∗.

96

Page 97: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

In conclusion, all vectors yk generated by algorithm (FGM) produce the same projection onto the optimalset X∗:

yk = y0 −AT (AAT )−1Ay0 +AT (AAT )−1t∗ ∀k ≥ 0,

i.e. the assumptions of Theorem 3.22 are valid for this optimization problem.

Linear convergence of restart (FGM) for FLf ,κf.

It is known that for the convex optimization problem (P), whose objective function f has Lipschitzcontinuous gradient, and for the choice:

βk =θk − 1

θk+1, with θ1 = 1 and θk+1 =

1 +√

1 + 4θ2k

2,

the algorithm (FGM) has the following convergence rate [53]:

f(xk)− f∗ ≤2Lf∥x0 − x0∥2

(k + 1)2∀k > 0. (183)

We will show next that on the optimization problem (P) whose objective function satisfies additionallythe quadratic functional growth (144), i.e. f ∈ FLf ,κf

, a restarting version of algorithm (FGM) with

the above choice of βk has linear convergence without the assumption yk = y∗ for all k ≥ 0. By fixinga positive constant c ∈ (0, 1) and then combining (183) and (144), we get:

f(xk)− f∗ ≤2Lf

(k + 1)2∥x0 − x0∥2 ≤

4Lf

κf (k + 1)2(f(x0)− f∗) ≤ c(f(x0)− f∗),

which leads to the following expression:

c =4Lf

κfk2.

Then, for fixed c, the number of iterations Kc that we need to perform in order to obtain f(xKc)−f∗ ≤c(f(x0)− f∗) is given by:

Kc =

⌈√4Lf

cκf

⌉=

⌈√4

cµf

⌉.

Therefore, after each Kc steps of Algorithm (FGM) we restart it obtaining the following scheme:

Algorithm (R-FGM)

Given x0,0 = y0,0 = x0 ∈ X and restart interval Kc. For j ≥ 0 do:

1. Run Algorithm (FGM) for Kc iterations to get xKc,j

2. Restart: x0,j+1 = xKc,j , y0,j+1 = xKc,j and θ1 = 1.

97

Page 98: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Then, after p restarts of Algorithm (R-FGM) we obtain the linear convergence:

f(x0,p)− f∗ = f(xKc,p−1)− f∗ ≤2Lf∥x0,p−1 − x0,p−1∥2

(Kc + 1)2

≤ c(f(x0,p−1)− f∗) ≤ · · · ≤ cp(f(x0,0)− f∗) = cp(f(x0)− f∗).

Thus, total number of iterations is k = pKc and denote xk = x0,p. Then, we have:

f(xk)− f∗ ≤(c

1Kc

)k(f(x0)− f∗).

We want to optimize e.g. the number of iteration Kc:

minKc

c1

Kc ⇔ minKc

1

Kclog c ⇔ min

Kc

1

Kclog

4

µfK2c

,

which leads to

K∗c =

2eõf

and c = e−2.

In conclusion, we get the following convergence rate for (R-FGM) method:

f(xk)− f∗ ≤(e−2

õf

2e

)k

(f(x0)− f∗) =

(e−

õf

e

)k

(f(x0)− f∗), (184)

and since eα ≈ 1 + α as α ≈ 0, then for√µf

e ≈ 0 we get:

f(xk)− f∗ ≤(e−

õf

e

)k

(f(x0)− f∗) ≈(1−

õf

e

)k

(f(x0)− f∗). (185)

Note that if the optimal value f∗ is known in advance, then we just need to restart algorithm (R-FGM)at the iteration Kc ≤ K∗

c when the following condition holds:

f(xKc,j)− f∗ ≤ c(f(x0,j)− f∗),

which can be practically verified. Using the second order growth property (144) we can also obtaineasily linear convergence of the generated sequence xk to some optimal point in X∗.

3.1.6 Linear convergence of feasible descent methods (FDM)

We now consider a more general descent version of Algorithm (GM) where the gradients are perturbed:

Algorithm (FDM)

Given x0 ∈ X and β, L > 0 for k ≥ 0 do:

1. Compute xk+1 =[xk − αk∇f(xk) + ek

]X

such that

2. ∥ek∥ ≤ β∥xk+1 − xk∥ and f(xk+1) ≤ f(xk)− L2 ∥x

k+1 − xk∥2,

98

Page 99: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where the stepsize αk is chosen such that αk ≥ L−1f > 0 for all k. It has been showed in [79] that

algorithm (FDM) covers important particular schemes: e.g. proximal point minimization, random/cycliccoordinate descent, extragradient descent and matrix splitting methods are all feasible descent methods.Note that linear convergence of algorithm (FDM) under the error bound assumption (153), i.e. f ∈ELf ,κf

, is proved e.g. in [79]. Hence, in the next theorem we prove that the feasible descent method

(FDM) converges linearly in terms of function values on optimization problems (P) whose objectivefunctions belong to the class FLf ,κf

.

Theorem 3.24. Let the optimization problem (P) have the objective function belonging to the classFLf ,κf

. Then, the sequence xk generated by the feasible descent method (FDM) on (P) convergeslinearly in terms of function values with the rate:

f(xk)− f∗ ≤

1

1 +Lκf

4(Lf+Lf+βLf )2

k

(f(x0)− f∗). (186)

Proof. The optimality conditions for computing xk+1 are:

⟨xk+1 − xk + αk∇f(xk)− ek, x− xk+1⟩ ≥ 0 ∀x ∈ X. (187)

Then, using convexity of f and Cauchy-Schwartz inequality, we get:

f(xk+1)− f(xk+1) ≤ ⟨∇f(xk+1), xk+1 − xk+1⟩= ⟨∇f(xk+1)−∇f(xk) +∇f(xk), xk+1 − xk+1⟩(123)+(187)

≤ Lf∥xk+1− xk∥∥xk+1− xk+1∥+ 1

αk⟨xk+1− xk− ek, xk+1− xk+1⟩

≤ (Lf + Lf )∥xk+1 − xk∥∥xk+1 − xk+1∥+ Lf∥ek∥∥xk+1 − xk+1∥≤ (Lf + Lf + βLf )∥xk+1 − xk∥∥xk+1 − xk+1∥.

Since f ∈ FLf ,κfthen it satisfies the second order growth property, i.e. f(xk+1)−f(xk+1) ≥ κf

2 ∥xk+1−xk+1∥2, and using it in the previous derivations we obtain:

f(xk+1)− f(xk+1) ≤2(Lf + Lf + βLf )

2

κf∥xk+1 − xk∥2. (188)

Combining (188) with the descent property of the algorithm (FDM), that is the inequality ∥xk+1−xk∥2 ≤2L

(f(xk)− f(xk+1)

), we get:

f(xk+1)− f(xk+1) ≤4(Lf + Lf + βLf )

2

Lκf

(f(xk)− f(xk+1)

),

which leads to

f(xk+1)− f(xk+1) ≤ 1

1 +Lκf

4(Lf+Lf+βLf )2

(f(xk)− f(xk)

).

Using an inductive argument we get the statement of the theorem.

Note that, once we have obtained linear convergence in terms of function values for the algorithm(FDM), we can also obtain linear convergence of the generated sequence xk to some optimal point inX∗ by using the second order growth property (144).

99

Page 100: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3.1.7 Applications

In this section we present several applications having the objective function in one of the structuredfunctional classes of Section 3.1.3.Solution of linear systemsIt is well known that finding a solution of a symmetric linear system Qx+q = 0, where Q ≽ 0 (notationfor positive semi-definite matrix), is equivalent to solving a convex quadratic program (QP):

minx∈Rn

f(x)

(=

1

2xTQx+ qTx

).

Let Q = LTQLQ be the Cholesky decomposition of Q. For simplicity, let us assume that our symmetric

linear system has a solution, e.g. xs, then q is in the range of Q, i.e. q = −Qxs = −LTQLQxs.

Therefore, if we define the strongly convex function g(z) = 12∥z∥

2 − (LQxs)T z, having Lg = σg = 1,

then our objective function is the composition of g with the linear map LQx:

f(x) =1

2∥LQx∥2 − (LT

QLQxs)Tx = g(LQx).

Thus, our convex quadratic problem is in the form of unconstrained structured optimization problem(160) and from Section 3.1.3 we conclude that the objective function of this QP is in the class qSLf ,κf

with:

Lf = λmax(Q) and κf =σ2min(LQ)=λmin(Q) ⇒ µf =

λmin(Q)

λmax(Q)≡ 1

cond(Q),

where λmin(Q) denotes the smallest non-zero eigenvalue of Q and λmax(Q) is the largest eigenvalue ofQ. Since we assume that our symmetric linear system has a solution, i.e. f∗ = 0, from Theorem 3.22and Remark 3.23 we conclude that when solving this convex QP with the algorithm (FGM) we get theconvergence rate in terms of function values:

f(xk) ≤

(1−

√1

cond(Q)

)k

· 2f(x0)

or in terms of residual (gradient) or distance to the solution:

∥Qxk + q∥2 = ∥∇f(xk)∥2 ≤ L2f∥xk − xk∥2 ≤

2L2f

κf

(f(xk)− f∗

)≤

(1−

√1

cond(Q)

)k

· λmax(Q) · cond(Q)

(1

2(x0)TQx0 + qTx0

).

Therefore, the usual (FGM) algorithm without restart attains an ϵ optimal solution in a number ofiterations of order

√cond(Q) log 1

ϵ , i.e. the condition number cond(Q) of the matrix Q is squarerooted. From our knowledge, this is one of the first results showing linear convergence depending on thesquare root of the condition number for the fast gradient method on solving a symmetric linear systemwith positive semi-definite matrix Q ≽ 0. Note that the linear conjugate gradient method can alsoattain an ϵ approximate solution in much fewer than n steps, i.e. the same

√cond(Q) log 1

ϵ iterations.Usually, in the literature the condition number appears linearly in the convergence rate of first order

100

Page 101: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

methods for solving linear systems with positive semi-definite matrices. For example, the coordinatedescent method from [30] requires

√n · cond(Q) log 1

ϵ iterations for obtaining an ϵ optimal solution.Our results can be extended for solving general linear systems Ax + b = 0, where A ∈ Rm×n. In thiscase we can formulate the equivalent unconstrained optimization problem:

minx∈Rn

∥Ax+ b∥2

which is a particular case of (160) and from Section 3.1.3 we can also conclude that the objectivefunction of this QP is in the class qSLf ,κf

with:

Lf = σ2max(A) and κf =σ2min(A) ⇒ µf =

σ2min(A)

σ2max(A),

where σmin(A) denotes the smallest non-zero singular value of A and σmax(A) is the largest singularvalue of A. In this case the usual (FGM) algorithm attains and ϵ optimal solution in a number of

iterations of order σmax(A)σmin(A) log

1ϵ .

Dual of linearly constrained convex problemsLet (P) be the dual formulation of a linearly constrained convex problem:

minug(u)

s.t. : c−ATu ∈ K = Rn1 × Rn2+ .

Then, the dual of this optimization problem can be written in the form of structured problem (164),where g is the convex conjugate of g. From duality theory we know that g is strongly convex and withLipschitz gradient, provided that g is strongly convex and with Lipschitz gradient. The reader shouldnote that the dual of the model predictive control problem for linear systems can be formulated in theform of the previous optimization problem, see [45] for more details.

Lasso problemThe Lasso problem is defined as:

minx:Cx≤d

∥Ax− b∥2 + λ∥x∥1.

Then, the Lasso problem is a particular case of the structured optimization problem (164), providedthat e.g. the feasible set of this problem is bounded (polytope).

Linear programmingFinding a primal-dual solution of a linear cone program can also be written in the form of a structuredoptimization problem (160). Indeed, let c ∈ RN , b ∈ Rm and K ⊆ RN be a closed convex cone, thenwe define the linear cone programming:

minu

⟨c, u⟩ s.t. Eu = b, u ∈ K, (189)

and its associated dual problem

minv,s

⟨b, v⟩ s.t. ET v + s = c, s ∈ K∗, (190)

101

Page 102: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

where K∗ denotes the dual cone. We assume that the pair of cone programming (189)–(190) haveoptimal solutions and their associated duality gap is zero. Therefore, a primal-dual solution of (189)–(190) can be found by solving the following convex feasibility problem, also called homogeneous self-dualembedding:

find (u, v, s) such that

{ET v + s = c, Eu = b, ⟨c, u⟩ = ⟨b, v⟩u ∈ K, s ∈ K∗, v ∈ Rm,

(191)

or, in a more compact formulation:

find x such that

{Ax = d

x ∈ K,

where x =

uvs

, A =

0 ET InE 0 0cT −bT 0

, d =

cb0

, K = K × Rm × K∗. In this work we propose

solving a linear program in the homogeneous self-dual embedding form using the first order methodspresented above. A simple reformulation of this constrained linear system as optimization problem is:

minx∈K

∥Ax− d∥2. (192)

Denote the dimension of the variable x as n = 2N+m. Let us note that the optimization problem (192)is a particular case of (160) with objective function of the form f(x) = g(Ax), with g(·) = ∥ · −d∥2.Moreover, the conditions of Theorem 3.13 hold provided that K = RN

+ . We conclude that we can alwayssolve a linear program in linear time using the first order methods described in the present work.

3.1.8 Numerical simulations

We test the performance of first order algorithms described above on randomly generated Linear Pro-grams (189) with K = RN

+ . We assume Linear Programs with finite optimal values. Then, we canreformulate (189) as the quadratic convex problem (192) for which f∗ = 0. We compare the followingalgorithms for problem (192) (the results are given in Figures 1 and 2):

1. Projected gradient algorithm with fixed stepsize (GM): αk = ∥A∥−2 (in this case the Lipschitzconstant is Lf = ∥A∥2).

2. Fast gradient algorithm with restart (R-FGM): where c = 10−1 and we restart when ∥AxK∗c ,j −

d∥ ≤ c∥Ax0,j − d∥.

3. Exact cyclic coordinate descent algorithm (Cyclic CD):

xk+1i = arg min

xi∈Ki

∥Axi(k)− d∥2,

where Ki is either R+ or R and xi(k) = [xk+11 · · ·xk+1

i−1 xi xki+1 · · ·xkn]. It has been proved in [?]

that this algorithm is a particular version of the feasible descent method (FDM) with parameters:

αk = 1, β = 1 + Lf

√n, L = min

i∥Ai∥2,

provided that all the columns of A are nonzeros, i.e. ∥Ai∥ > 0 for all i = 1 : n.

102

Page 103: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Figure 7: Linear convergence of algorithms R-FGM (left) and GM (right): log scale of the error∥Axk − d∥2. We also compare with the theoretical sublinear estimates (dot lines) for the convergencerate of algorithms FGM (O(LfR

2f/k

2)) and GM (O(LfR2f/k)) for smooth convex problems. The plots

clearly show our theoretical findings, i.e. linear convergence.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104k

LfRf 2/k2

R−FGM

0.5 1 1.5 2 2.5 3 3.5 4

x 104k

LfRf2/kGM

The comparisons use Linear Programs whose data (E, b, c) are generated randomly from the standardGaussian distribution with full or sparse matrix E. Matrix E has 100 rows and 150 columns in the fullcase and 900 rows and 1000 columns in the sparse case. Figure 7 depicts the error ∥Axk − d∥. We canobserve that the gradient method has a slower convergence than the fast gradient method with restart,but both have a linear behaviour as we can see from the comparison with the theoretical sublinearestimates, see Figure 7. Moreover, the fast gradient method with restart is performing much faster thanthe gradient or cyclic coordinate descent methods on sparse and full Linear Programs, see Figure 8.

Figure 8: The behavior of algorithms GM, R-FGM and Cyclic CD: log scale of the error ∥Axk − d∥along iterations k (left - full E, right - sparse E).

1000 2000 3000 4000 5000 6000k

Cyclic CD R−FGM GM

0 2000 4000 6000 8000 10000

10−2

100

102

104

log

(||A

x−d

||)

Cyclic CDR−FGMGM

103

Page 104: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3.2 Parallel random coordinate descent for composite minimization: convergenceanalysis and error bounds

Despite widespread use of coordinate descent methods for solving large convex problems [47,51], thereare some aspects that have not been fully studied. In particular, in applications, the assumption ofLipschitz gradient and strong convexity is restrictive and the main interest is in finding larger classes offunctions for which we can still prove linear convergence. We are also interested in providing schemesbased on parallel and/or distributed computations and analyzing in which manner the number of com-ponents to be updated enters into the convergence rate. Finally, the convergence analysis has beenalmost exclusively limited to centralized step size rules and local convergence results. These representthe main issues we pursue here.In this work we employ a parallel version of a randomized (block) coordinate descent method forminimizing the sum of a partially separable smooth convex function and a fully separable non-smoothconvex function. Under the assumption of Lipschitz continuity of the gradient of the smooth function,this method has a sublinear convergence rate. Linear convergence rate of the method is obtained for thenewly introduced class of generalized error bound functions. We prove that the new class of generalizederror bound functions encompasses both global/local error bound functions and smooth strongly convexfunctions. We also show that the theoretical estimates on the convergence rate depend on the number ofblocks chosen randomly and a natural measure of separability of the smooth component of the objectivefunction.

3.2.1 Problem formulation

In many big data applications arising from e.g. networks, control or data ranking, we have a systemformed from several entities, with a communication graph which indicates the interconnections betweenentities (e.g. sources and links in network optimization [57], website pages in data ranking [8] orsubsystems in control [40]). We denote this bipartite graph as G = ([N ] × [N ], E), where [N ] =

{1, . . . , N}, [N ] ={1, . . . , N

}and E ∈ {0, 1}N×N is an incidence matrix. We also introduce two sets

of neighbors Nj and Ni associated to the graph, defined as:

Nj = {i ∈ [N ] : Eij = 1} ∀j ∈ [N ] and Ni = {j ∈ [N ] : Eij = 1} ∀i ∈ [N ].

The index sets Nj and Ni, which e.g. in the context of network optimization may represent the setof sources which share the link j ∈ [N ] and the set of links which are used by the source i ∈ [N ],respectively, describe the local information flow in the graph. We denote the entire vector of variablesfor the graph as x ∈ Rn. The vector x can be partitioned accordingly in block components xi ∈ Rni , withn =

∑Ni=1 ni. In order to easily extract subcomponents from the vector x, we consider a partition of the

identity matrix In = [U1 . . . UN ], with Ui ∈ Rn×ni , such that xi = UTi x and matrices UNi ∈ Rn×nNi ,

such that xNi = UTNix, with xNi being the vector containing all the components xj with j ∈ Ni. In

this work we address problems arising from such systems, where the objective function can be writtenin a general form as:

F ∗ = minx∈Rn

F (x)

=N∑j=1

fj(xNj ) +N∑i=1

Ψi(xi)

, (193)

where fj : RnNj → R and ψi : Rni → R. We denote f(x) =∑N

j=1 fj(xNj ) and Ψ(x) =∑N

i=1Ψi(xi).The function f(x) is a smooth partially separable convex function, while Ψ(x) is fully separable convex

104

Page 105: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

non-smooth function. The local information structure imposed by the graph G should be consideredas part of the problem formulation. We consider the following natural measure of separability of theobjective function F :

(ω, ω) = (maxj∈[N ]

|Nj |, maxi∈[N ]

|Ni|).

Note that 1 ≤ ω ≤ N , 1 ≤ ω ≤ N and the definition of the measure of separability (ω, ω) is moregeneral than the one considered in [61] that is defined only in terms of ω. It is important to notethat coordinate gradient descent type methods for solving problem (193) are appropriate where ω isrelatively small, otherwise incremental type methods [57] should be considered for solving (193). Indeed,difficulties may arise when f is the sum of a large number of component functions and ω is large, sincein that case exact computation of the components of gradient (i.e. ∇if(x) =

∑j∈Ni

∇ifj(xNj )) canbe either very expensive or impossible due to noise. In conclusion, we assume that the algorithm isemployed for problems (193), with (ω, ω) relatively small, i.e. ω, ω ≪ n (see Section 3.2.2 for practicalapplications satisfying this condition).By x∗ we denote an optimal solution of problem (193) and by X∗ the set of optimal solutions. Wedefine the index and the set indicator functions as:

1Nj (i) =

{1, if i ∈ Nj

0, otherwise,and IX(x) =

{0, if x ∈ X

+∞, otherwise.

Also, by ∥ · ∥ we denote the standard Euclidean norm and we introduce an additional norm ∥x∥2W =xTWx, where W ∈ Rn×n is a positive diagonal matrix. Considering these, we denote by ΠW

X (x)the projection of a point x onto a set X in the norm ∥ · ∥W , i.e.: ΠW

X (x) = argminy∈X ∥y − x∥2W .Furthermore, for simplicity of exposition, we denote by x the projection of a point x on the optimal setX∗, i.e. x = ΠW

X∗(x). In this work we consider that the smooth component f(x) of (193) satisfies theassumption:

Assumption 3.25. We assume the functions fj(xNj ) have LNj -Lipschitz gradient:

∥∇fj(xNj )−∇fj(yNj )∥ ≤ LNj∥xNj − yNjj∥ ∀xNj , yNj ∈ RnNj . (194)

Note that our assumption is different from the ones in [35, 47, 51, 61], where the authors consider thatthe gradient of the function f is coordinate-wise Lipschitz continuous, which states the following: if wedefine the partial gradient ∇if(x) = UT

i ∇f(x), then there exists some constants Li > 0 such that

∥∇if(x+ Uiyi)−∇if(x)∥ ≤ Li∥yi∥ ∀x ∈ Rn, yi ∈ Rni . (195)

As a consequence of Assumption 3.25 we have that [53]:

fj(xNj + yNj ) ≤ fj(xNj ) + ⟨∇fj(xNj ), yNj ⟩+LNj

2∥yNj∥2. (196)

From Assumption 3.25 we derive the following descent lemma, which is central in our derivation of aparallel coordinate descent method and proving its convergence rate.

Lemma 3.26. Under Assumption 3.25 the following holds for f(x)=N∑j=1

fj(xNj ):

f(x+ y) ≤ f(x) + ⟨∇f(x), y⟩+ 1

2∥y∥2W ∀x, y ∈ Rn, (197)

where W ≻0 is block-diagonal with its blocks Wii∈Rni×ni , Wii=∑

j∈Ni

LNjIni , i∈ [N ].

105

Page 106: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. If we sum up (196) for j ∈ [N ] and by the definition of f we have that:

f(x+ y) ≤ f(x) +N∑j=1

[⟨∇fj(xNj ), yNj ⟩+

LNj

2∥yNj∥2

]. (198)

Given matrices UNj , we can express the first term in the right hand side as follows:

N∑j=1

⟨∇fj(xNj ), yNj

⟩=

N∑j=1

⟨∇fj(xNj ), U

TNjy⟩=

N∑j=1

⟨UNj∇fj(xNj ), y

⟩=⟨∇f(x), y⟩.

Since W is a diagonal matrix we can express the norm ∥ · ∥W as: ∥y∥2W =∑N

i=1

(∑j∈Ni

LNj

)∥yi∥2.

From the definition of Nj and Ni, note that 1Nj (i) is equivalent to 1Ni(j). Thus, for the final term of

the right hand side of (198) we have:

1

2

N∑j=1

LNj∥yNj∥2 =1

2

N∑j=1

LNj

∑i∈Nj

∥yi∥2 =1

2

N∑j=1

LNj

N∑i=1

∥yi∥21Nj (i)

=1

2

N∑i=1

∥yi∥2N∑j=1

LNj1Ni(j) =

1

2

N∑i=1

∥yi∥2∑j∈Ni

LNj =1

2∥y∥2W ,

which proves the statement of the lemma.

Note that the convergence results from this section hold for any descent lemma in the form (197) andthus the expression of the matrixW above can be replaced with any other block-diagonal matrixW ≻ 0for which (197) is valid. Based on (195) a similar inequality as in (197) can be derived, but the matrixW is replaced in this case with the matrix ωW ′ = ωdiag(LiIni ; i ∈ [N ]). These differences in thematrices will lead to different step sizes in the algorithms of our work and of e.g. [61]. The followingrelation establishes Lipschitz continuity for ∇f but in the norm ∥ ·∥W , whose proof can be derived usingsimilar arguments as in [53]:

∥∇f(x)−∇f(y)∥W−1 ≤ ∥x− y∥W ∀x, y ∈ Rn. (199)

3.2.2 Motivating practical applications

We now present important applications from which the interest for problems of type (193) stems. Oneapplication is found in data mining or machine learning [58], where we must solve a sparse problem:

minx∈Rn

f(x) + λ ∥x∥1 ,

where λ > 0, ∥·∥1 denotes the 1-norm and f(x) is the loss function. E.g., in the sparse logistic regression,

the average logistic loss function is: f(x) =∑N

j=1 fj(xNj ) =1N

∑Nj=1 log

(1 + exp

(−bj⟨aj , x⟩

)), where

the vectors aj ∈ Rn represent N samples, and bj represent the binary class labels with bj ∈ {−1,+1}.Note that Ψ(x) = λ∥x∥1 is the separable non-smooth component which promotes the sparsity of thedecision variable x. If we associate to this problem a bipartite graph G where the incidence matrix E isdefined such that Eij = 1 provided that aji = 0, then the vectors aj have a certain sparsity according

106

Page 107: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

to this graph, i.e. they only have nonzero components in ajNj. It can be easily proven that the objective

function f in this case satisfies (194) with LNj =∑

l∈Nj∥ajl ∥

2/4 and (195) with Li =∑

j∈Ni∥aji∥2/4.

Another classical problem which implies functions fj of type (194) is:

minxi∈Xi⊆Rni

F (x)

(=

1

2∥Ax− b∥2 +

N∑i=1

λi∥xi∥1

), (200)

where A ∈ RN×n, the sets Xi are convex, n =∑N

i=1 ni and λ > 0. This problem is also knownas the constrained lasso problem and is widely used e.g. in signal processing, fused or generalizedlasso and monotone curve estimation [14]. For example, in image restoration, incorporating a prioriinformation (such as box constraints on x) can lead to substantial improvements in the restoration andreconstruction process (see [14] for more details). Note that this problem is a special case of problem(193), with Ψ(x) =

∑Ni=1[λi∥xi∥1 + IXi(xi)] being block separable and with fj defined as: fj(xNj ) =

12(a

TNjxNj − bj)

2, where aNj are the nonzero components of row j of A, corresponding to Nj . In this

application the functions fj satisfy (194) with Lipschitz constants LNj = ∥aNj∥2. Given these constants,

we find that f in this case satisfies (197) with the matrixW = diag(∑

j∈Ni∥aNj∥2Ini ; i ∈ [N ]

). Also,

note that functions of type (200) satisfy Lipschitz continuity (195) with Li = ∥Ai∥2, where Ai ∈ RN×ni

denotes block column i of the matrix A.A third problem which falls under the same category is derived from the primal:

f∗ = minu∈Rm

N∑j=1

gj(uj), s.t: Au ≤ b, (201)

where A ∈ Rn×m, uj ∈ Rmj and the functions gj are strongly convex with convexity parameters σj . Thistype of problem is often found in network control [40], network optimization or utility maximization [57].We formulate the dual problem of (201) as:

f∗ =maxx∈Rn

N∑j=1

−g∗j (−ATj x)− ⟨x, b⟩ −Ψ(x), (202)

where Aj ∈ Rn×mj is the jth block column of A, x denotes the Lagrange multiplier, Ψ(x) = IRn+(x)

is the indicator function for the nonnegative orthant Rn+ and g∗j (z) the convex conjugate of the func-

tion gj(uj). Note that, given the strong convexity of gj(uj), then the functions g∗j (z) have Lipschitz

continuous gradient in z of type (194) with constants 1σj

[53]. Now, if the matrix A has some sparsity

induced by a graph, i.e. the blocks Aij = 0 if the corresponding incidence matrix has Eij = 0, whichin turn implies that the block columns Aj are sparse according to some index set Nj , then the matrix-

vector products ATj x depend only on xNj , such that fj(xNj ) = −g∗j

(−AT

NjxNj

)− ⟨xNj , bNj ⟩, with∑

j⟨xNj , bNj ⟩ = ⟨x, b⟩. Then, fj has Lipschitz continuous gradient of type (194) with LNj =∥ANj

∥2

σj.

For this problem we also have componentwise Lipschitz continuous gradient of type (195) with Li =∑j∈Ni

∥Aij∥2σj

. Note that there are many applications in the form (201) with matrix A given as column

linked block angular form for which ω = N and ω = 2 is small.

107

Page 108: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

3.2.3 Parallel random coordinate descent method

In this section we employ a parallel version of the random coordinate descent method [35,51,62], whichwe call P-RCD. Before we discuss the method however, we first need to introduce some concepts. Fora function F (x) as defined in (193), we introduce the following mapping in the norm ∥ · ∥W :

t[N ](x, y) = f(x) + ⟨∇f(x), y − x⟩+ 1

2∥y − x∥2W +Ψ(y). (203)

Note that the mapping t[N ](x, y) is a fully separable and strongly convex in y w.r.t. to the norm ∥ · ∥Wwith the constant 1. We denote by T[N ](x) the proximal step for function F (x), which is the optimalpoint of the mapping t[N ](x, y), i.e.:

T[N ](x) = arg miny∈Rn

t[N ](x, y). (204)

The proximal step T[N ](x) can also be defined via the proximal operator of Ψ:

proxΨ(x) = arg minu∈Rn

Ψ(u) +1

2∥u− x∥2W .

We recall an important property of the proximal operator: [64]:

∥proxΨ(x)− proxΨ(y)∥W ≤ ∥x− y∥W . (205)

Based on this proximal operator, note that we can write:

T[N ](x) = proxΨ(x−W−1∇f(x)). (206)

Given that Ψ(x) is generally not differentiable, we denote by ∂Ψ(x) a vector belonging to the set ofsubgradients of Ψ(x). Evidently, in both definitions, the optimality conditions of the resulting problemfrom which we obtain T[N ](x) are the same, i.e.:

0 ∈∇f(x)+W (T[N ](x)−x)+∂Ψ(T[N ](x)). (207)

It will become evident further on that the optimal solution T[N ](x) will play a crucial role in the parallelrandom coordinate descent method. We now establish some properties which involve the function F (x),the mapping t[N ](x, y) and the proximal step T[N ](x). Given that t[N ](x, y) is strongly convex in y andthat T[N ](x) is an optimal point when minimizing over y, we have the following inequality:

F (x)− t[N ](x, T[N ](x)) = t[N ](x, x)− t[N ](x, T[N ](x))≥1

2∥x− T[N ](x)∥2W . (208)

Further, given that f is convex and by definition of t[N ](x, y) we get:

t[N ](x, T[N ](x)) ≤ miny∈Rn

F (y) +1

2∥y − x∥2W . (209)

In the algorithm that we discuss, at a step k, the (block) components of the iterate xk which are to beupdated are dictated by a set of indices Jk ⊆ [N ] which is randomly chosen. Let us denote by xJ ∈ Rn

the vector whose blocks xi, with i ∈ J ⊆ [N ], are identical to those of x, while the remaining blocksare zeroed out, i.e. xJ =

∑i∈J Uixi. Also, for the separable function Ψ(x), we denote the partial

108

Page 109: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

sum ΨJ(x)=∑

i∈J Ψi(xi) and the vector ∂JΨ(x) = [∂Ψ(x)]J ∈ Rn. A random variable J is uniquelycharacterized by the probability density function:

PJ = P (J = J) where J ⊆ [N ].

For the random variable J , we also define the probability with which a subcomponent i ∈ [N ] can befound in J as: pi = P(i ∈ J). In our algorithm, we consider a uniform sampling of τ unique coordinatesi, 1 ≤ τ ≤ N that make up J , i.e. |J | = τ . For a random variable J with |J | = τ , we observe thatwe have a total number of

(Nτ

)possible values that J can take, and with the uniform sampling we have

that PJ = 1

(Nτ ). Given that J is random, we can express the probability that i ∈ J as: pi =

∑J : i∈J PJ .

For a single index i, note that we have a total number of(N−1τ−1

)possible sets that J can take which

will include i and therefore the probability that this index is included in J is:

pi =

(N−1τ−1

)(Nτ

) =τ

N. (210)

We can also consider other ways in which J can be chosen, however due to space limitations we restrictour presentation to uniform sampling. Having defined the proximal step as T[N ](x

k) in (204), in the

algorithm that follows we generate randomly at step k an index set Jk of cardinality 1 ≤ τ ≤ N . Wedenote the vector TJk(xk) = [T[N ](x

k)]Jk which will be used to update xk+1, i.e. in the sense that

[xk+1]Jk = TJk(xk). Also, by Jk we denote the complement set of Jk, i.e. Jk = {i ∈ [N ] : i /∈ Jk}.Thus, the parallel algorithm we propose below consists of the following steps:

Distributed and parallel random coordinate descent method (P-RCD)

1. Consider an initial point x0 ∈ Rn and 1 ≤ τ ≤ N . For k ≥ 0:

2. Generate with uniform probability a random set of indices Jk ⊆ [N ], with |Jk| = τ

3. Compute the update: xk+1Jk = TJk(xk) and xk+1

Jk = xkJk .

Clearly, the optimization problem from which we compute the iterate of (P-RCD) is fully separable.Then, it follows that for updating component i ∈ Jk of xk+1 we need the following data: Ψi(x

ki ),Wii

and ∇if =∑

j∈Ni∇ifj . Therefore, if algorithm (P-RCD) runs on a multi-core machine or as a multi-

thread process, it can be observed that component updates can be done in parallel by each core/threadusing the communication graph G. Note that the iterate update of (P-RCD) method can also beexpressed in the following ways:{

xk+1 = xk + TJk(xk)− xkJk = proxΨ

Jk

(xk −W−1∇Jkf(xk)

)xk+1 = argminy∈Rn⟨∇Jkf(xk), y − xk⟩+ 1

2∥y − xk∥2W +ΨJk(y)(211)

Note that the right hand sides of the last two equalities contain the same optimization problem whoseoptimality conditions are:

W [xk − xk+1]Jk ∈ ∇Jkf(xk) + ∂ΨJk(xk+1) and [xk+1]Jk = [xk]Jk . (212)

109

Page 110: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

We now establish that method (P-RCD) is a descent method, i.e. F (xk+1) ≤ F (xk) for all k ≥ 0.From the convexity of Ψ(·) and (197) we obtain the following:

F (xk+1) ≤ F (xk) + ⟨∇Jkf(xk) + ∂ΨJk(xk+1), [xk+1 − xk]Jk⟩+

1

2∥xk+1 − xk∥2W

(212)= F (xk) + ⟨W [xk − xk+1]Jk , [xk+1 − xk]Jk⟩+

1

2∥xk+1 − xk∥2W

= F (xk)− 1

2∥xk+1 − xk∥2W . (213)

With (P-RCD) being a descent method, we can now introduce the following term:

RW (x0) = maxx: F (x)≤F (x0)

minx∗∈X∗

∥x− x∗∥W . (214)

and assume it to be bounded. We also define the random variable comprising the whole history ofprevious events as:

ηk = {J0, . . . , Jk}.

3.2.4 Sublinear convergence for smooth convex minimization

In this section we establish the sublinear convergence rate of method (P-RCD) for problems of type (193)with the objective function satisfying Assumption 3.25. First we recall a basic relation from probabilitytheory proven e.g. in [61, Lemma 3]: let there be some constants θi with i = 1, . . . , N , and a samplingJ chosen as described above and define the sum

∑i∈J θi, then the expected value of the sum satisfies

E

[∑i∈J

θi

]=

N∑i=1

piθi. (215)

For any vector d ∈ Rn we consider its counterpart dJ for a sampling J taken as described above. Giventhe previous relation and by taking into account the separability of the inner product and of the squarednorm ∥ · ∥2W it follows immediately that:

E [⟨x, dJ⟩] =τ

N⟨x, d⟩ and E

[∥dJ∥2W

]=

τ

N∥d∥2W . (216)

Based on relations (216), the separability of the function Ψ(x), and the properties of the expectationoperator, the following inequalities can be immediately derived:

E [Ψ(x+ dJ)] =τ

NΨ(x+ d) +

(1− τ

N

)Ψ(x) (217)

E [F (x+ dJ)] ≤(1− τ

N

)F (x) +

τ

Nt[N ](x, d). (218)

By the definition of the operator t[N ](x, y), the convexity of f and Ψ and the optimality conditions

110

Page 111: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

(207) we have the following inequalities:

t[N ](x, T[N ](x)) ≤ f(y) + ⟨∇f(x), x− y⟩+ ⟨∇f(x), T[N ](x)− x⟩

+1

2∥T[N ](x)− x∥2W +Ψ(y) +

⟨∂Ψ(T[N ](x)), T[N ](x)− y

⟩(207)

≤ f(y) + ⟨∇f(x), x− y⟩+ ⟨∇f(x), T[N ](x)− x⟩+ 1

2∥T[N ](x)− x∥2W

+Ψ(y) +⟨−∇f(x)−W

(T[N ](x)− x

), T[N ](x)− y

⟩= F (y)−

⟨W(T[N ](x)− x

), x− y

⟩− 1

2∥T[N ](x)− x∥2W . (219)

This property will prove useful in the following theorem, which provides the sublinear convergence ratefor method (P-RCD).

Theorem 3.27. If Assumption 3.25 holds and considering that RW (x0) defined in (214) is bounded,then for the sequence xk generated by algorithm (P-RCD) we have:

E[F (xk)]− F ∗ ≤N(1/2(RW (x0))2 + F (x0)− F ∗)

τk +N. (220)

Proof. Our proof uses the tools developed above and generalizes the proof of Theorem 1 in [35] fromone component update per iterate to the case of τ component updates, based on uniform sampling andon Assumption 3.25, and consequently on a different descent lemma. Thus, by taking expectation inboth sides of (213) w.r.t. Jk conditioned on ηk−1 we arrive at:

E[F (xk+1)] ≤ F (xk)− 1

2E[∥xk+1 − xk∥2W

]≤ F (xk). (221)

Now, if we take x = xk, J = Jk and dJk = TJk(xk)− xkJk in (218) we get:

E[F (xk+1)

]≤(1− τ

N

)F (xk) +

τ

Nt[N ]

(xk, T[N ](x

k)). (222)

From this and (219) we obtain:

τ

NF (y) +

N − τ

NF (xk) ≥E

[F (xk+1)

]+τ

N

⟨W(T[N ](x

k)− xk), xk − y

⟩(223)

2N∥T[N ](x

k)−xk∥2W .

Denote rk = ∥xk − x∗∥W . From the definition of xk+1 we have that:

(rk+1)2 = (rk)2 +∑i∈Jk

[2Wii⟨Ti(xk)− xki , x

ki − x∗i ⟩+Wii∥Ti(xk)− xki ∥2

].

If we divide both sides of the above inequality by 2 and take expectation, we obtain:

E[1

2(rk+1)2

]=

(rk)2

2+τ

N

⟨W(T[N ](x

k)− xk), xk − x∗

⟩+

τ

2N∥T[N ](x

k)− xk∥2W .

111

Page 112: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Through this inequality and (223) we arrive at:

E[1

2(rk+1)2

]≤ (rk)2

2+τ

NF ∗ +

N − τ

NF (xk)− E

[F (xk+1)

].

After some rearranging of terms we obtain the following inequality:

E[1

2(rk+1)2 + F (xk+1)− F ∗

]≤((rk)2

2+ F (xk)− F ∗

)− τ

N(F (xk)− F ∗).

By applying this inequality repeatedly, taking expectation over ηk−1 and from the fact that E[F (xk)] isdecreasing from (221), we obtain the following:

E[F (xk+1)

]− F ∗ ≤ E

[1

2(rk+1)2 + F (xk+1)− F ∗

]≤ (r0)2

2+ F (x0)− F ∗

− τ

N

k∑j=0

(E[F (xj)

]− F ∗) ≤ (r0)2

2+ F (x0)− F ∗ − τ(k + 1)

N

(E[F (xk+1)

]− F ∗

).

Rearranging some items and since (r0)2 ≤ (RW (x0))2, we arrive at (220).

We notice that the sublinear convergence rate (220) of order O(N/τk) depends linearly on the choiceof τ = |J |, so that if the algorithm is implemented on a cluster, then τ reflects the available numberof cores. Furthermore, given a suboptimality level ϵ and a confidence level 0 < ρ < 1, using standardarguments as in [51, 61] we can easily establish a total number of iterations kϵρ which will ensure anϵ-suboptimal solution with probability at least 1 − ρ. More precisely, for the iterates generated byalgorithm (P-RCD) and a kϵρ that satisfies:

kϵρ ≥ c

ϵ

(1 + log

(N

τ

(RW (x0)

)2+ 2

(F (x0)− F ∗)

4cρ

))+ 2−N, (224)

with c = 2Nτ max

{(RW (x0))2, F (x0)− F ∗}, we get the following result in probability:

P(F(xk

ϵρ

)− F ∗ ≤ ϵ

)≥ 1− ρ.

We notice that in the smooth case, given the choice of τ , we obtain different sublinear convergenceresults of order O(1/k) . E.g., for τ = 1 we obtain a similar sublinear convergence rate to that of therandom coordinate descent method in [35, 47, 51, 62], i.e. of the form O(NR2

W /k), while for τ = Nwe get a similar convergence rate to that of the full composite gradient method of [52]. However,the distances are measured in different norms in these papers. E.g., when τ = N the comparison, ofconvergence rates in our work and [52] is reduced to comparing the quantities LfR(x

0)2 of [52] with ourRW (x0)2, where Lf is the Lipschitz constant of the smooth component of the objective function, i.e. off , while R(x0) is defined in a similar fashion as our RW (x0) but in the Euclidean norm, instead of thenorm ∥ · ∥W . Let us now consider the two extreme cases. First, consider the smooth component of the

objective function as follows: f(x) =∑N

j=1 fj(xj). In this case, it can be seen that Lf = maxj∈[N ] LNj .

Thus considering the definition of the matrix W = diag(LNj ; j ∈ [N ]) in Lemma 3.26 we have that:Lf∥x0 − x∗∥2 ≥ ∥x0 − x∗∥2W , i.e. our convergence rate is usually better. On the other hand, if

112

Page 113: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

we have f defined as f(x) =∑N

j=1 fj(x), then it can be easily proven that Lf =∑

j∈N LNj and

W = LfIn and the quantities LfR(x0)2 and RW (x0)2 would be the same. Thus, we get better rates

of convergence when ω < N . Furthermore, our sublinear convergence results are also similar with thoseof [61], but are obtained under further knowledge regarding the objective function and with a modifiedanalysis. In particular, using a similar reasoning as in [35], we can argue, based on our analysis, thatthe expected value type of convergence rate given in (220) has better constants than the one in [61]under certain separability properties given below. E.g., the convergence rate of the algorithm, apartfrom essentially being of order O

(1k

), depends on the step sizes involved when computing the next

iterate xk+1, see (211). Thus, let us compare the weighted step sizes W = diag(Wii) in our algorithm(P-RCD) and the ones in [61]. To this purpose, let us consider the smooth component in (193) in theform f(x) = 1

2∥Ax − b∥2 and ni = 1. Under these considerations, we observe from (225) below thatour step sizes are better than those in [61] as τ increases and ω ≪ ω:

Wii =∑

j:i∈Nj

n∑t=1

A2jt and W

[16]ii =

N∑j=1

βA2ji, (225)

where β = 1 + (ω−1)(τ−1)max{1,n−1} or β = min(ω, τ) depending whether monotonicity is enforced or not in the

algorithm of [61].

3.2.5 Linear convergence for error bound convex minimization

In this section we prove that, for certain minimization problems, the sublinear convergence rate of (P-RCD) from the previous section can be improved to a linear convergence rate. In particular, we prove thatunder additional assumptions on the objective function, which are often satisfied in practical applications(e.g. the dual of a linearly constrained smooth convex problem, a control problem or constrained lassoproblem), we have a generalized error bound property for our optimization problem. In these settings weare able to provide for the first time global linear convergence rate for algorithm (P-RCD), as opposed tothe results in [72] where only local linear convergence was derived for deterministic descent methods orthe results in [79] where global linear convergence is proved for gradient type methods but applied onlyto problems where Ψ is the set indicator function of a polyhedron. Therefore, we proceed by introducingthe proximal gradient mapping of the objective function F (x):

∇+F (x) = x− proxΨ(x−W−1∇f(x)

). (226)

Clearly, a point x∗ is an optimal solution of the original problem (193) if and only if ∇+F (x∗) = 0. Inthe following definition we introduce the new concept of Generalized Error Bounded Property (GEBP)for problem (193):

Definition 3.28. Problem (193) has the generalized error bound property (GEBP) w.r.t. the norm∥ · ∥W if there exist two nonnegative constants κ1 and κ2 such that the composite objective function Fsatisfies the relation (we use x = ΠW

X∗(x)):

∥x− x∥W ≤(κ1 + κ2∥x− x∥2W

)∥∇+F (x)∥W ∀x ∈ Rn. (227)

Note that the class of problems introduced in Definition 3.28 includes other known categories of prob-lems: e.g, problems with objective functions F composed of a smooth strongly convex function f with

113

Page 114: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

a constant σW w.r.t. the norm ∥ · ∥W and a general convex function Ψ satisfy our definition (227)with κ1 = 2

σWand κ2 = 0; or problems satisfying the classical error bound property [72, 79], i.e.

∥x − x∥W ≤ κ∥∇+F (x)∥W for all x ∈ Rn, satisfy our definition (227) with κ1 = κ and κ2 = 0 (seeSection 3.2.6 for more details and for other classes of problems (193) satisfying the (GEBP) property).Next, we prove that on optimization problems having the (GEBP) property (227) our algorithm (P-RCD)has global linear convergence. Our analysis will employ ideas from the convergence proof of deterministicdescent methods in [72]. However, the random nature of our method and the nonsmooth property ofthe objective function requires a new approach. For example, the typical proof for linear convergence ofgradient descent type methods for solving convex problems with an error bound like property is basedon deriving an inequality of the form F (xk+1) − F ∗ ≤ c∥xk+1 − xk∥ (see e.g. [72, 79]). Under oursettings, we cannot derive this type of inequality but instead we obtain a weaker inequality, where wereplace ∥xk+1−xk∥ with another term, which still allows us to prove linear convergence. We start withthe following lemma which shows an important property of algorithm (P-RCD) when it is applied toproblems (193) having generalized error bound objective functions:

Lemma 3.29. If problem (193) satisfies (GEBP) given in (227), then a point xk generated by algorithm(P-RCD) and its projection onto X∗, denoted xk, satisfy:

∥xk − xk∥2W ≤(κ1 + κ2∥xk − xk∥2W

)2 NτE[∥xk+1 − xk∥2W

]. (228)

Proof. For the iteration defined by algorithm (P-RCD) we have:

E[∥xk+1 − xk∥2W

]= E

[∥xk + TJk(xk)− xkJk − xk∥2W

]= E

[∥xkJk − TJk(xk)∥2W

](216)=

τ

N∥xk − T[N ](x

k)∥2W

N∥xk − proxΨ(x

k −W−1∇f(xk))∥2W =τ

N∥∇+F (xk)∥2W .

Through this equality and (227) we have that:

∥xk − xk∥2W ≤ (κ1 + κ2∥xk − xk∥2W )2∥∇F+(xk)∥2W

≤ (κ1 + κ2∥xk − xk∥2W )2N

τE[∥xk+1 − xk∥2W

], (229)

and the proof is complete.

Remark 3.30. Note that if the iterates of an algorithm satisfy ∥xk − x∗∥ ≤ ∥x0 − x∗∥ for all k ≥ 0

(see e.g. the case of the full gradient method [53]), then by taking κ(x0) =(κ1 + κ2∥x0 − x∗∥2W

)2we

have the inequality:

∥xk − xk∥2W ≤ κ(x0)N

τE[∥xk+1 − xk∥2W

]∀k ≥ 0. (230)

On the other hand, if the iterates of an algorithm satisfy (214) with RW (x0) bounded, see e.g. thecase of our algorithm (P-RCD) which is a descent method, as proven in (213), then (230) is satisfiedwith κ(x0) = (κ1 + κ2(RW (x0))2)2.

114

Page 115: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Let us now note that given the separability of function Ψ : Rn → R, then for any vector d ∈ Rn, if weconsider their counterparts ΨJ and dJ for a sampling J taken as described above, the expected valueE [ΨJ(dJ)] satisfies:

E[ΨJ(dJ)]=∑

J⊆[N ]

(∑i∈J

Ψi(di)

)PJ=

∑J⊆[N ]

(N∑i=1

Ψi(di)1J(i)

)PJ (231)

=

N∑i=1

Ψi(di)∑

J⊆[N ]:i∈J

PJ =

N∑i=1

piΨi(di)(210)=

τ

N

N∑i=1

Ψi(di) =τ

NΨ(d).

Furthermore, considering that xk ∈ X∗, then from (228) we obtain:∥∥∥xk − xk∥∥∥W

≤ cκ(τ)√

E [∥xk+1 − xk∥2], (232)

where cκ(τ) =(κ1 + κ2(RW (x0))2

)√Nτ . We now need to express E[Ψ(xk+1)] explicitly, where xk+1

is generated by algorithm (P-RCD). Note that xk+1Jk = xk

Jk . As a result, we have:

E[Ψ(xk+1)] = E

∑i∈Jk

Ψi

([TJk(xk)]i

)+∑i∈Jk

Ψi

([xk]i

)(231)=

τ

NΨ(T[N ](x

k))+N − τ

NΨ(xk). (233)

The following lemma establishes an important upper bound for E[F (xk+1)−F (xk)

].

Lemma 3.31. If problem (193) satisfies Assumption 3.25 and the (GEBP) property (227), then theiterate xk generated by (P-RCD) method has the following property:

E[F (xk+1)− F (xk)

]≤ E[Λk] ∀k ≥ 0, (234)

where Λk = ⟨∇f(xk), xk+1 − xk⟩+ 12∥x

k+1 − xk∥2W +Ψ(xk+1)−Ψ(xk).Furthermore, we have that:

1

2∥xk+1 − xk∥2W ≤ −Λk ∀k ≥ 0. (235)

Proof. Taking x = xk and y = xk+1 − xk in (197) we get:

f(xk+1) ≤ f(xk) + ⟨∇f(xk), xk+1 − xk⟩+ 1

2∥xk+1 − xk∥2W .

By adding Ψ(xk+1) and subtracting Ψ(xk) in both sides of this inequality and by taking expectation inboth sides we obtain (234). Recall the iterate update (211) of our algorithm (P-RCD):

xk+1 = arg miny∈Rn

⟨∇Jkf(xk), y − xk⟩+ 1

2∥y − xk∥2W +ΨJk(y).

115

Page 116: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Given that xk+1 is optimal for the problem above and if we take a vector y = αxk+1 + (1−α)xk, withα ∈ [0, 1], we have that:

⟨∇Jkf(xk), xk+1 − xk⟩+ 1

2∥xk+1 − xk∥2W +ΨJk(xk+1)

≤ α⟨∇Jkf(xk), xk+1 − xk⟩+ α2

2∥xk+1 − xk∥2W +ΨJk(αxk+1 + (1− α)xk).

Further, if we rearrange the terms and through the convexity of ΨJk we obtain:

(1− α)

[⟨∇Jkf(xk), xk+1−xk⟩+ 1 + α

2∥xk+1−xk∥2W +ΨJk(xk+1)−ΨJk(xk)

]≤ 0.

If we divide this inequality by (1− α) and let α ↑ 1 we have that:

⟨∇Jkf(xk), xk+1 − xk⟩+ (ΨJk(xk+1)−ΨJk(xk)) ≤ −∥xk+1 − xk∥2W .

By adding 12∥x

k+1 − xk∥2W in both sides of this inequality and observing that:

⟨∇Jkf(xk), xk+1 − xk⟩ = ⟨∇f(xk), xk+1 − xk⟩ and

ΨJk(xk+1)−ΨJk(xk) = Ψ(xk+1)−Ψ(xk),

we obtain (235).

Additionally, note that by applying expectation in Jk to Λk we get:

E[Λk](216)=

τ

N⟨∇f(xk), T[N ](x

k)− xk⟩+ 1

2E[∥xk+1 − xk∥2W

]+ E

[Ψ(xk+1)

]−Ψ(xk)

(233)=

τ

N⟨∇f(xk), T[N ](x

k)− xk⟩+ 1

2E[∥xk+1 − xk∥2W

](236)

N

(Ψ(T[N ](x))−Ψ(xk)

).

The following theorem, which is the main result of this section, proves the linear convergence rate forthe algorithm (P-RCD) on optimization problems having the generalized error bound property (GEBP)(227).

Theorem 3.32. On optimization problems (193) with the objective function satisfying Assumption 3.25and the generalized error bound property (227), the algorithm (P-RCD) has the following global linearconvergence rate for the expected values of the objective function:

E[F (xk)− F ∗

]≤ θk(F (x0)− F ∗) ∀k ≥ 0, (237)

where θ < 1 is a constant depending on N, τ, κ1, κ2 and RW (x0).

116

Page 117: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. We first need to establish an upper bound for E[F (xk+1)]− F (xk). By the definition of F andits convexity we have that:

F (xk+1)− F (xk) = f(xk+1)− f(xk) + Ψ(xk+1)−Ψ(xk)

≤ ⟨∇f(xk+1), xk+1 − xk⟩+Ψ(xk+1)−Ψ(xk)

= ⟨∇f(xk+1)−∇f(xk), xk+1 − xk⟩+ ⟨∇f(xk), xk+1 − xk⟩+Ψ(xk+1)−Ψ(xk)

≤ ∥∇f(xk+1)−∇f(xk)∥W−1∥xk+1−xk∥W + ⟨∇f(xk), xk+1−xk⟩+Ψ(xk+1)−Ψ(xk)

(199)

≤ ∥xk+1−xk∥W ∥xk+1 − xk∥W + ⟨∇f(xk), xk+1 − xk⟩+Ψ(xk+1)−Ψ(xk)

≤ ∥xk+1−xk∥2W+∥xk+1−xk∥W ∥xk−xk∥W+⟨∇f(xk), xk+1−xk⟩+Ψ(xk+1)−Ψ(xk).

By taking expectation in both sides of the previous inequality we have:

E[F (xk+1)]− F (xk) ≤ E[∥xk+1 − xk∥2W ] + E[∥xk+1−xk∥W ∥xk−xk∥W

]+ E

[⟨∇f(xk), xk+1 − xk⟩+Ψ(xk+1)

]−Ψ(xk). (238)

From (214) we have that ∥xk − xk∥ ≤ RW (x0) and derive the following:

E[∥xk+1−xk∥W ∥xk−xk∥W

]= ∥xk−xk∥WE

[∥xk+1−xk∥W

](232)

≤ cκ(τ)√

E[∥xk+1−xk∥2W

]√(E [∥xk+1−xk∥W ])

2

≤ cκ(τ) E[∥xk+1−xk∥2W

],

where the last step comes from Jensen’s inequality. Thus, (238) becomes:

E[F (xk+1)]− F (xk) ≤ c1(τ)E[∥xk+1−xk∥2W ]+E[⟨

∇f(xk), xk+1−xk⟩]

+ E[Ψ(xk+1)]−Ψ(xk), (239)

where c1(τ) = (1+cκ(τ)). We now explicitly express the second term in the right hand side of theabove inequality:

E[⟨∇f(xk), xk+1 − xk⟩

](211)= E

[⟨∇f(xk), xk + TJk(xk)− xkJk − xk⟩

]= ⟨∇f(xk), xk − xk⟩+ E

[⟨∇f(xk), TJk(xk)− xkJk⟩

](216)= ⟨∇f(xk), xk − xk⟩+ τ

N⟨∇f(xk), T[N ](x

k)− xk⟩

=(1− τ

N

)⟨∇f(xk), xk − xk⟩+ τ

N⟨∇f(xk), T[N ](x

k)− xk⟩

So, by replacing it in (239) and through (233) we get:

E[F (xk+1)]− F (xk) ≤ c1(τ)E[∥xk+1 − xk∥2W ] +τ

N⟨∇f(xk), T[N ](x

k)− xk⟩ (240)

+(1− τ

N

)⟨∇f(xk), xk − xk⟩+ τ

NΨ(T[N ](x

k)) +(1− τ

N

)Ψ(xk)−Ψ(xk).

117

Page 118: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

By taking y = xk and x = xk in (197) we obtain:

f(xk) ≤ f(xk) + ⟨∇f(xk), xk − xk⟩+ 1

2∥xk − xk∥2W .

Through this and by rearranging terms in (240), we obtain:

E[F (xk+1)]−F (xk) ≤ c1(τ)E[∥xk+1−xk∥2W ] +1

2

(1− τ

N

)∥xk − xk∥2W

+(1− τ

N

)(F (xk)− F (xk)

)+τ

N

(Ψ(T[N ](x

k)) + ⟨∇f(xk), T[N ](xk)−xk⟩ −Ψ(xk)

).

Furthermore, from (232) we obtain:

E[F (xk+1)]−F (xk) ≤(c1(τ) +

1

2

(1− τ

N

)cκ(τ)

2

)E[∥xk+1−xk∥2W ] (241)

+(1− τ

N

)(F (xk)−F (xk)

)+τ

N

(Ψ(T[N ](x

k)) + ⟨∇f(xk), T[N ](xk)−xk⟩ −Ψ(xk)

).

Through the convexity of Ψ(x) we have:

Ψ(T[N ](xk))−Ψ(xk) ≤

⟨∂Ψ(T[N ](x

k)), T[N ](xk)− xk

⟩.

From this and the optimality condition (207) and by replacing in (241) we obtain:

E[F (xk+1)]−F (xk) ≤(c1(τ) +

1

2

(1− τ

N

)cκ(τ)

2

)E[∥xk+1−xk∥2W ] (242)

+(1− τ

N

)(F (xk)− F (xk)

)+τ

N⟨−W

(T[N ](x

k)− xk), T[N ](x

k)− xk⟩.

By rearranging some terms and through the Cauchy-Schwartz inequality we get:⟨−W

(T[N ](x

k)− xk), T[N ](x

k)− xk⟩=⟨−W

(T[N ](x

k)− xk), T[N ](x

k)− xk + xk − xk⟩

≤⟨W(T[N ](x

k)− xk), xk − xk

⟩≤ ∥W

(T[N ](x

k)− xk)∥W−1∥xk − xk∥W

= ∥T[N ](xk)− xk∥W ∥xk − xk∥W .

Now, recall that:

E[∥xk+1 − xk∥2W

]=

τ

N∥xk − T[N ](x

k)∥2W .

Thus, from this and (232) we get:

τ

N∥T[N ](x

k)− xk∥W ∥xk − xk∥W ≤ cκ(τ)

√τ

NE[∥xk+1 − xk∥2W

].

By replacing this in (242) we obtain:

E[F (xk+1)]−F (xk) ≤(c1(τ) +

1

2

(1− τ

N

)cκ(τ)

2 + cκ(τ)

√τ

N

)︸ ︷︷ ︸

c2(τ)

E[∥xk+1−xk∥2W ]

+(1− τ

N

)(F (xk)− F (xk)

)= c2(τ)E[∥xk+1−xk∥2W ] +

(1− τ

N

)(F (xk)− F (xk)

). (243)

118

Page 119: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

From (235) we have: E[∥xk+1−xk∥2W ] ≤ −2E[Λk]. Now, through this and by rearranging some termsin (243) we obtain:

τ

N

(E[F (xk+1)]−F (xk)

)≤ −2c2(τ)E[Λk] +

(1− τ

N

)(F (xk)− E[F (xk+1)]

).

Furthermore, from (234) we obtain:

E[F (xk+1)]−F (xk) ≤ N

τ

(2c2(τ) +

(1− τ

N

))︸ ︷︷ ︸

c3(τ)

(F (xk)− E[F (xk+1)]

)

= c3(τ)(F (xk)− E[F (xk+1)]

).

By rearranging this inequality, we obtain:

E[F (xk+1)]−F (xk) ≤ c3(τ)

1 + c3(τ)

(F (xk)− F (xk)

). (244)

We denote θ = c3(τ)1+c3(τ)

< 1 and define δk = F (xk+1) − F (xk). By taking expectation over ηk−1 in

(244) we arrive at:E[δk] ≤ θE[δk−1] ≤ · · · ≤ θkE[δ0],

and linear convergence is proved.

Finally, we establish the number of iterations kϵρ which will ensure a ϵ-suboptimal solution with probabilityat least 1 − ρ. We first recall the well-known inequality: for constants ϵ > 0 and γ ∈ (0, 1) such that

δ0 > ϵ > 0 and k ≥ 1γ log

(δ0

ϵ

)we have:

(1−γ)kδ0 =(1− 1

1/γ

)(1/γ)(γk)

δ0 ≤ exp(−γk)δ0 ≤ exp(− log(δ0/ϵ)

)δ0 = ϵ. (245)

Now, for problem (193) satisfying Assumption 3.25 and (GEBP) property (227), consider a probabilitylevel ρ ∈ (0, 1), suboptimality 0 < ϵ < δ0 and iteration counter:

kϵρ ≥ 1

1− θlog

(δ0

ϵρ

),

where δ0 = F (x0) − F ∗ and θ is defined in Theorem 3.32. Then, from Markov’s inequality and (245)we have that the iterate xk

ϵρ generated by (P-RCD) satisfies:

P(F (xkϵρ)− F ∗ ≤ ϵ) ≥ 1− ρ.

Note that we have obtained global linear convergence for our parallel random coordinate descent method(P-RCD) on the general class of problems satisfying the generalized error bound property (GEBP) givenin (227), as opposed to the results in [72] where the authors show only local linear convergence fordeterministic coordinate descent methods applied to local error bound problems, i.e. for all k ≥ k0 > 1,where k0 is an iterate after which some error bound condition of the form ∥xk − xk∥ ≤ κ∥∇+F (xk)∥is implicitly satisfied. In [79] global linear convergence is also proved for the full gradient method butapplied only to problems withΨ as indicator function of a polyhedron and having an error bound property.

119

Page 120: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Further, our convergence results are also more general than the ones in [35,49,51,61], in the sense thatwe can show linear convergence of algorithm (P-RCD) for larger classes of problems than in these paperswhere linear convergence is proved for the more restricted class of problems having smooth and stronglyconvex objective functions. For example, up to our knowledge the best global convergence rate resultsknown for gradient type methods for solving constrained lasso (200) or dual formulation of a linearlyconstrained convex problem (202) were of the sublinear form O

(1k2

)[41, 52]. In this section we prove

global linear convergence rate for random coordinate gradient descent methods for solving this type ofproblems (200) or (202). Note that for the particular case of least-square problems minx∈Rn ∥Ax− b∥2the authors in [30], using also an error bound like property, were able to show linear convergence for arandom coordinate gradient descent method. Our results can be viewed as a generalization of the resultsfrom [30] to more general optimization problems (193). Moreover, our proof for linear convergence isdifferent from those in [35, 51, 61]. Finally, our approach allows us to analyze in the same frameworkseveral methods: full gradient, serial coordinate descent and any parallel coordinate descent method inbetween.

3.2.6 Conditions for generalized error bound functions

In this section we investigate under which conditions an objective function F of (193) satisfying As-sumption 3.25 has the generalized error bound property (GEBP) given in Definition 3.28.

Case 1: f strongly convex and Ψ convexIn the first case we consider f satisfying Assumption 3.25 and additionally strong convexity, while Ψ isa general convex function. Then, F has the generalized error bound property defined in (227). Indeed,let us consider f to be σW -strongly convex w.r.t. the norm ∥ · ∥W , i.e.:

f(y) ≥ f(x) + ⟨∇f(x), y − x⟩+ σW2

∥y − x∥2W . (246)

Then, we can prove the following error bound property:

∥x− x∥W ≤ 2

σW∥∇+F (x)∥W ∀x ∈ Rn,

i.e. we have κ1 = 2σW

and κ2 = 0 in our Definition 3.28 of the generalized error bound property.It follows that objective functions of (193), written as the sum between a nonsmooth and a stronglyconvex function, are included in our class of generalized error bound problems (227). Combining (197)with (246) we get σW ≤ 1. In this case we have the following linear convergence rate for algorithm(P-RCD):

E[F (xk+1)− F ∗

]≤ (1− γebsc)

k(F (x0)− F ∗) , (247)

where γebsc =τσWN . We notice that, given the choice of τ , we obtain different linear convergence results

of order O(θk). E.g., for τ = 1 we obtain a similar linear convergence rate to that of the randomcoordinate descent method in [35, 47, 51, 62], i.e. γebsc = O(σW /N), while for τ = N we get a similarconvergence rate to that of the full composite gradient method of [52], i.e γebsc = O(σW ). Finally, if weconsider f to be σW ′-strongly convex in the norm ∥ · ∥W ′ , where the matrix W ′ = diag (LiIni ; i ∈ [N ]),then algorithm (PCDM1) in [61] has a convergence rate as above with γsc =

τσW ′N+ωτ . However, the

distances are measured in different norms in all these papers.

120

Page 121: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Case 2: Ψ indicator function of a polyhedral setAnother important category of optimization problems (193) that we consider has the following form:

minx∈Rn

F (x)(= f(Px) + cTx+ IX(x)

), (248)

where f(x) = f(Px) + cTx is a smooth convex function, P ∈ Rp×n \ {0} is a constant matrix uponwhich we make no assumptions and Ψ(x) = IX(x) is the indicator function of the polyhedral set X.Note that an objective function F with the structure (248) appears in many applications, see e.g. thedual problem (202) obtained from the primal formulation (201) given in Section 3.2.2. Now, for provingthe generalized error bound property (227), we require that f satisfies the following assumption:

Assumption 3.33. We consider that f(x) = f(Px) + cTx satisfies Assumption 3.25. We also assumethat f(z) is σ-strongly convex in z, the set of optimal solutions X∗ for problem (193) is bounded andP = 0.

For problem (248), functions f under which the set X∗ is bounded include e.g. continuously differen-tiable coercive functions [64]. Also, if (248) is the dual formulation of the primal problem (201) forwhich the Slater condition holds, then by Gauvin’s theorem we have that the set of optimal Lagrangemultipliers, i.e. X∗ in this case, is compact [64]. Note that for the nonsmooth component Ψ(x) = IX(x)we only assume that X is a polyhedron (possibly unbounded).Our generalized error bound property for problem (248) is in a way similar to the one in [72,79]. However,our results are more general in the sense that they hold globally, while in [72] the authors prove theirresults only locally and in the sense that we allow the constraints set X to be an unbounded polyhedronwhile in [79] an error bound like property is proved only for bounded polyhedra or Rn. This extension isvery important since it allows us e.g. to tackle the dual formulation of a primal problem (201), in whichX = Rn

+ (nonnegative orthant), appearing in many practical applications. Last but not least importantis that our error bound definition is more general than the one used in [72, 79], as we can see from thefollowing example:

Example 3.34. Let us consider the following quadratic problem: minx∈R2+1/2(x1 − x2)

2 + x1 + x2.

We can easily see that X∗ = {0} and thus this example satisfies Assumption 3.33. Clearly, for thisexample the generalized error bound property (227) holds with e.g. κ1 = κ2 = 1. However, there is nofinite constant κ satisfying the classical error bound property [72, 79]: ∥x − x∥W ≤ κ∥∇+F (x)∥W forall x ∈ R2

+ (we can see this by taking x1 = x2 ≥ 1 in the previous inequality).

Since Ψ(x) is a set indicator function, the gradient mapping of F can be expressed:

∇+F (x) = x−ΠWX

(x−W−1∇f(x)

).

The next lemma establishes the Lipschitz continuity of the proximal gradient mapping.

Lemma 3.35. For composite function F of (248) satisfying Assumption 3.25, we have:

∥∇+F (x)−∇+F (y)∥W ≤ 3∥x− y∥W ∀x, y ∈ X. (249)

121

Page 122: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. By definition of ∇+F (x) we have that:

∥∇+F (x)−∇+F (y)∥W = ∥x− y + T[N ](y)− T[N ](x)∥W(206)

≤ ∥x−y∥W+∥proxΨ(x−W−1∇f(x))−proxΨ(y−W−1∇f(y))∥W(205)

≤ ∥x−y∥W + ∥x− y +W−1(∇f(y)−∇f(x))∥W

≤ 2∥x− y∥W + ∥∇f(x)−∇f(y)∥W−1

(199)

≤ 3∥x− y∥W ,

and the proof is complete.

The following lemma introduces an important property for projection operator ΠWX .

Lemma 3.36. Given a convex set X, its projection operator ΠWX satisfies:⟨

W(ΠW

X (x)− x),ΠW

X (x)− y⟩≤ 0 ∀y ∈ X. (250)

Proof. Following the definition of ΠWX , we have that:

∥x−ΠWX (x)∥2W ≤ ∥x− d∥2W ∀d ∈ X. (251)

Since X is a convex set, consider a point d = αy + (1 − α)ΠWX (x) ∈ X, with y ∈ X and α ∈ [0, 1],

and by (251) we obtain:

∥x−ΠWX (x)∥2W ≤ ∥x− (αy + (1− α)ΠW

X (x))∥2W .

If we elaborate the squared norms in the inequality above we arrive at:

0 ≤ α⟨W(ΠW

X (x)− x), y −ΠW

X (x)⟩+

1

2α2∥y −ΠW

X (x)∥2.

If we divide both sides by α and let α ↓ 0, we get (250).

The following lemma establishes an important property between ∇f(x) and ∇+F (x).

Lemma 3.37. Given a function f that satisfies (199) and a convex set X, then the following inequalityholds:

⟨∇f(x)−∇f(y), x− y⟩ ≤ 2∥∇+F (x)−∇+F (y)∥W ∥x− y∥W ∀x, y ∈ X.

Proof. Denote z = x−W−1∇f(x), then by replacing x = z and y = ΠWX (y −W−1∇f(y)) in Lemma

3.36 we obtain the following inequality:⟨W(ΠW

X (z)− x)+∇f(x),ΠW

X(z)−ΠWX

(y −W−1∇f(y)

)⟩≤ 0.

From the definition of the projected gradient map, this inequality can be written as:⟨∇f(x)−W∇+F (x), x−∇+F (x)− y +∇+F (y)

⟩≤ 0.

If we further elaborate the inner product we obtain:

⟨∇f(x), x− y⟩ ≤ ⟨W∇+F (x), x− y⟩ (252)

+ ⟨∇f(x),∇+F (x)−∇+F (y)⟩ − ⟨W∇+F (x),∇+F (x)−∇+F (y)⟩.

122

Page 123: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

By adding two copies of (252) with x and y interchanged we have the inequality:

⟨∇f(x)−∇f(y), x− y⟩≤⟨W (∇+F (x)−∇+F (y)), x− y

⟩+⟨∇f(x)−∇f(y),∇+F (x)−∇+F (y)

⟩.

From this inequality, through Cauchy-Schwartz and (199) we arrive at:

⟨∇f(x)−∇f(y), x−y⟩ ≤ ∥∇+F (x)−∇+F (y)∥W(∥x− y∥W + ∥∇f(x)−∇f(y)∥−1

W

)≤ 2∥∇+F (x)−∇+F (y)∥W ∥x− y∥W ,

and the proof is complete.

We now introduce the following lemma regarding the optimal set X∗.

Lemma 3.38. [72] Under Assumption 3.33, there exists a unique z∗ such that:

Px∗ = z∗ ∀x∗ ∈ X∗ and ∇f(x) = P T∇f(z∗) + c

for all x ∈ Q = {y ∈ X : Py = z∗}.

Consider now a point x ∈ X and denote by q = ΠWQ (x) the projection of the point x onto the set

Q = {y ∈ X : Py = z∗}, as defined in Lemma 3.38, and by q its projection onto the optimal set X∗,i.e. q = ΠW

X∗(q). Given the set Q, the distance to the optimal set can be decomposed as:

∥x− x∥W ≤ ∥x− q∥W ≤ ∥x− q∥W + ∥q− q∥W .

Given this inequality, the outline for proving the generalized error bound property (GEBP) from (227) inthis case is to obtain appropriate upper bounds for ∥x−q∥W and ∥q− q∥W . In the sequel we introducelemmas for establishing bounds for each of these two terms.

Lemma 3.39. Under Assumption 3.33, there exists a constant γ1 such that:

∥x− q∥2W ≤ γ212

σ∥∇+F (x)∥W ∥x− x∥W ∀x ∈ X.

Proof. Corollary 2.2 in [63] states that if we have two polyhedra:

Ay ≤ b1, Py = d1 and Ay ≤ b2, Py = d2, (253)

then there exists a finite constant (so-called Hoffman constant) γ1 > 0 such that for a point y1 whichsatisfies the first set of constraints and a point y2 which satisfies the second one we have:

∥y1 − y2∥W ≤ γ1

∥∥∥∥ΠR+(b1 − b2)d1 − d2

∥∥∥∥W

. (254)

Furthermore, the constant γ1 is only dependent on the matrices A and P . Given that X is polyhedralset, we can express it as X = {x ∈ Rn : Ax ≤ b}. Thus, for any x ∈ X, we can take (b1 = b, d1 = Px)and (b2 = b, d2 = z∗) in (253) such that:

Ay ≤ b, Py = Px and Ay ≤ b, Py = z∗. (255)

123

Page 124: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Evidently, the point x ∈ X is feasible for first polyhedra in (255). Consider now a point y2 feasible forthe second polyhedra in (255). Therefore, from (254) there exists a Hoffman constant γ1 such that:

∥x− y2∥W ≤ γ1∥Px− z∗∥W ∀x ∈ X.

Furthermore, from the definition of q we get:

∥x− q∥2W ≤ ∥x− y2∥2W ≤ γ21∥Px− z∗∥2W ∀x ∈ X. (256)

From the strong convexity of f(z) we have the following property:

σ∥Px− z∗∥2W ≤⟨∇f(Px)−∇f(Px), Px− Px

⟩= ⟨∇f(x)−∇f(x), x− x⟩

for all x ∈ X∗. From this inequality and Lemma 3.37 we obtain:

σ∥Px− z∗∥2W ≤ 2∥∇+F (x)−∇+F (x)∥W ∥x− x∥W .

Since x ∈ X∗, then ∇+F (x) = 0. Thus, from the inequality above and (256) we get:

∥x− q∥2W ≤ γ212

σ∥∇+F (x)∥W ∥x− x∥W

and the proof is complete.

Note that, if in (248) we have c = 0, then by definition we have that Q = X∗, and thus we get∥q − q∥W = 0. In such a case, also note that q = x and through the previous lemma, in which weestablished an upper bound for ∥x− q∥W , we can prove outright the error bound property (227) withκ1 = γ21

2σ and κ2 = 0. If c = 0, the following two lemmas are introduced to investigate the distance

between a point and a solution set of a linear program and then to establish a bound for ∥q− q∥W .

Lemma 3.40. Consider a linear program on a nonempty polyhedral set Y :

miny∈Y

bT y, (257)

and assume that the optimal set Y ∗ ⊆ Y is nonempty, convex and bounded. Let y be the projection ofa point y ∈ Y on the optimal set Y ∗. For this problem we have that:

∥y − y∥W ≤ γ2 (∥y − y∥W + ∥b∥W−1) ∥y −ΠWZ (y −W−1b)∥W ∀y ∈ Y, (258)

where Z is any convex set satisfying Y ⊆ Z and γ2 is a constant depending on (Y, b).

Proof. Because the solution set Y ∗ is nonempty, convex and bounded, then the linear program (257)is equivalent to the problem miny∈Y ∗ bT y, and as a result, (257) is solvable. By the duality theorem oflinear programming, the dual problem of (257) is well defined, solvable and strong duality holds for thedual:

maxµ∈Y ′

l(µ), (259)

where Y ′ ⊆ Rm is the dual feasible set. For any pair of primal-dual feasible points (y, µ) for problems(257) and (259), we have a corresponding pair of optimal solutions (y∗, µ∗). By the solvability of (257)

124

Page 125: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

we have from Theorem 2 of [63], that there exists a constant γ2 depending on Y and b such that wehave the bound: ∥∥∥∥y − y∗

µ− µ∗

∥∥∥∥diag(W,Im)

≤ γ2|bT y − l(µ)|.

By strong duality, we have that l(µ∗) = bT y. Thus, taking µ = µ∗ and through the optimality conditionsof (257) we obtain: ∥y − y∗∥W ≤ γ2⟨b, y − y⟩. From this inequality and ∥y − y∥W ≤ ∥y − y∗∥W wearrive at:

∥y − y∥W ≤ γ2⟨b, y − y⟩. (260)

By Lemma 3.36, we have that:⟨W(ΠW

Z

(y −W−1b

)−(y −W−1b

)),ΠW

Z (y −W−1b)− y⟩≤ 0.

This inequality can be rewritten as:

⟨b, y − y⟩ ≤⟨W(y −ΠW

Z (y −W−1b)), y − y +W−1b+ΠW

Z (y −W−1b)− y⟩

≤ ∥y −ΠWZ (y −W−1b)∥W (∥y − y∥W + ∥b∥W−1) .

From this inequality and (260) we obtain:

∥y − y∥W ≤ γ2 (∥y − y∥W + ∥b∥W−1) ∥y −ΠWZ (y −W−1b)∥W ,

and the proof is complete.

Lemma 3.41. If Assumption 3.33 holds for optimization problem (248), then there exists a constantγ2 > 0 such that:

∥q− q∥W ≤ γ2 (∥q− q∥W + ∥∇f(x)∥W−1) ∥∇F+(q)∥W ∀x ∈ X. (261)

Proof. By Lemma 3.38, we have that Px = z∗ for all x ∈ Q. As a result, the following optimizationproblem minx∈Q f(z

∗) + cTx has the same solution set as problem (248), due to the fact that X∗ ⊆Q ⊆ X. Since z∗ is a constant, then we can formulate the equivalent problem:

minx∈Q

∇f(x)Tx(= ∇f(z∗)T z∗ + cTx

).

Note that ∇f(x) = P T∇f(z∗) + c is constant and under Assumption 3.33 we have that X∗ is convexand bounded. Furthermore, since x,q ∈ Q, then ∇f(x) = ∇f(q). Considering these details, and bytaking Y = Q, Z = X, y = q and b = ∇f(x) in Lemma 3.40 and applying it to the previous problem,we obtain (261).

The next theorem establishes the generalized error bound property for optimization problems in the form(248) having objective functions satisfying Assumption 3.33.

Theorem 3.42. Under Assumption 3.33, the optimization problem (248) with F (x) = f(Px) + cTx+IX(x) satisfies the following global generalized error bound property:

∥x− x∥W ≤(κ1 + κ2∥x− x∥2W

)∥∇+F (x)∥W ∀x ∈ X, (262)

where κ1 and κ2 are two nonnegative constants.

125

Page 126: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Proof. Since x ∈ X∗, then ∇+F (x) = 0 and by Lemma 3.35 we have:

∥∇+F (x)∥W = ∥∇+F (x)−∇+F (x)∥W ≤ 3∥x− x∥W .

From this inequality and by applying Lemma 3.35, we also have:

∥∇+F (q)∥2W ≤(∥∇+F (x)∥W + ∥∇+F (q)−∇+F (x)∥W

)2≤ 2∥∇+F (x)∥2W + 2∥∇+F (q)−∇+F (x)∥2W≤ 6

(∥∇+F (x)∥W ∥x− x∥W + 3∥q− x∥2

).

From this and Lemma 3.41, we arrive at the following:

∥q−q∥2W (263)

≤ γ22 (∥q−q∥W + ∥∇f(x)∥W−1)2 ∥∇F+(q)∥2W≤ 6γ22 (∥q−q∥W + ∥∇f(x)∥W−1)2

(∥∇+F (x)∥W ∥x−x∥W + 3∥q−x∥2

).

Note that since ∇f(x) is constant on Q ⊇ X∗, then we define the bound:

∥∇f(x)∥W−1 = β ∀x ∈ X∗.

Furthermore, q ∈ Q since X∗ ⊆ Q. From this and through the nonexpansive property of the projectionoperator we obtain:

∥q− q∥W ≤ ∥q− x∥W+∥x− q∥W ≤ ∥x− x∥W+∥x−q∥W ≤ ∥x− x∥W + ∥x− q∥W≤ 2∥x− x∥W + ∥x− q∥W ≤ 3∥x− x∥W .

From this and (263) we get the following bound:

∥q−q∥2W (264)

≤ 6γ22(3∥x− x∥W + β)2(∥∇+F (x)∥W ∥x−x∥W + 3∥q−x∥2W

)≤ 6γ22(18∥x− x∥2W + 2β2)

(∥∇+F (x)∥W ∥x−x∥W + 3∥q−x∥2W

).

Given the definition of x we have that:

∥x− x∥2W ≤ ∥x− q∥2W ≤ (∥x− q∥W + ∥q− q∥W )2 ≤ 2∥x− q∥2W + 2∥q− q∥2W .

From Lemma 3.39 and (264), we can establish an upper bound for the right hand side of the aboveinequality:

∥x− x∥2W ≤ (κ1 + κ2∥x− x∥2W )∥∇+F (x)∥W ∥x− x∥W , (265)

where:

κ1 = 24γ22β2

(1 +

6γ21σ

)+

4γ21σ

and κ2 = 256γ22

(1 +

6γ21σ

).

If we divide both sides of (265) by ∥x− x∥W , the proof is complete .

126

Page 127: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Case 3: Ψ polyhedral functionWe now consider general optimization problems of the form:

minx∈Rn

F (x)(= f(Px) + cTx+Ψ(x)

), (266)

where Ψ(x) is a polyhedral function. A function Ψ : Rn → R is polyhedral if its epigraph, epi Ψ ={(x, ζ) : Ψ(x) ≤ ζ}, is a polyhedral set. There are numerous functions Ψ which are polyhedral, e.g.IX(x) with X a polyhedral set, ∥x∥1, ∥x∥∞ or combinations of these functions. Note that an objectivefunction with the structure (266) appears in many applications (see e.g. the constrained Lasso problem(200) in Section 3.2.2). Now, for proving the generalized error bound property, we require that theobjective function F satisfies the following assumption.

Assumption 3.43. We consider that f(x) = f(Px) + cTx satisfies Assumption 3.25. Further, weassume that f(z) is σ-strongly convex in z and the optimal set X∗ is bounded. We also assume thatΨ(x) is bounded above on its domain, i.e. Ψ(x) ≤ Ψ < ∞ for all x ∈ dom Ψ, and is LΨ-Lipschitzcontinuous w.r.t. norm ∥ · ∥W .

The proof of the generalized error bound property under Assumption 3.43 is similar to that of [72], butit requires new proof ideas and is done under different assumptions, e.g. that Ψ(x) is bounded aboveon its domain. Boundedness of Ψ is in practical applications usually not restrictive. Since Ψ(x) ≤ Ψ issatisfied for any x ∈ dom Ψ, then problem (266) is equivalent to the following one:

minx∈Rn

f(x) + Ψ(x) s.t. Ψ(x) ≤ Ψ.

Consider now an additional variable ζ ∈ R. Then, the previous problem is equivalent to the followingproblem:

minx∈Rn,ζ∈R

f(x) + ζ s.t. Ψ(x) ≤ ζ, Ψ(x) ≤ Ψ. (267)

Take an optimal pair (x∗, ζ∗) for problem (267). We now prove that ζ∗ = Ψ(x∗). Consider that (x∗, ζ∗)is strictly feasible, i.e. Ψ(x∗) < ζ∗. Then, we can imply that (x∗,Ψ(x∗)) is feasible for (267) and thefollowing inequality holds:

f(x∗) + Ψ(x∗) < f(x∗) + ζ∗,

which contradicts the fact that (x∗, ζ∗) is optimal. Thus, the only possibility remains that Ψ(x∗) = ζ∗.Further, it can be easily proved that (267) is equivalent to the following problem:

minx∈Rn,ζ∈R

f(x) + ζ s.t. Ψ(x) ≤ ζ, ζ ≤ Ψ. (268)

Now, if we denote z = [xT ζ]T , then problem (268) can be rewritten as:

minz∈Z⊆Rn+1

F (z)(= f(P z) + cT z

)(269)

where P = [P 0] and c = [cT 1]T . The constraint set for this problem is:

Z ={z = [xT ζ]T : z ∈ epi Ψ, ζ ≤ Ψ

}.

127

Page 128: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Recall that from Assumption 3.43 we have that epi Ψ is polyhedral, i.e. there exists a matrix C and avector d such that we can express epi Ψ =

{(x, ζ) : C[xT ζ]T ≤ d

}. Thus, we can write the constraint

set Z as:

Z =

{z = [xT ζ]T :

[CeTn+1

]z ≤

[dΨ

]},

i.e. Z is polyhedral. Denote by Z∗ the set of optimal points of problem (268). Then, from X∗ beingbounded in accordance with Assumption 3.43, and the fact that Ψ(x∗) = ζ∗, with Ψ a continuous

function, it can be observed that Z∗ is also bounded. We now denote z = ΠWZ∗(z), where W =

diag(W, 1). Since the problems (267) and (269) are equivalent, then we can apply the theory of theprevious subsection to problem (269). That is, we can find two nonnegative constants κ1 and κ2 suchthat:

∥z − z∥W ≤(κ1 + κ2∥z − z∥W 2

)∥∇+F (z)∥W ∀z ∈ Z. (270)

The proximal gradient mapping in this case, ∇+F (z) is defined as:

∇+F (z) = z −ΠWZ

(z − W−1∇F (z)

),

where the projection operator ΠWZ is defined in the same manner as ΠW

X . We now show that fromthe error bound inequality (270) we can derive an error bound inequality for problem (266). From thedefinitions of z, z and W , we derive the following lower bound for the term on the right-hand side:

∥z − z∥W =

∥∥∥∥x− xζ − ζ

∥∥∥∥W

≥ ∥x− x∥W . (271)

Further, note that we can express:

∥z − z∥2W

= ∥x− x∥2W + (ζ − ζ)2 = ∥x− x∥2W + |ζ − ζ|2. (272)

Now, if ζ ≤ ζ, then from ζ = Ψ(x) and the Lipschitz continuity of Ψ we have that:

|ζ − ζ| = ζ − ζ ≤ Ψ(x)−Ψ(x) ≤ LΨ∥x− x∥W .

Otherwise, if ζ > ζ, we have that:

|ζ − ζ| = ζ − ζ ≤ Ψ− ζ ≤ |Ψ|+ |ζ| ∆= κ′1.

From these two inequalities we derive the following inequality for |ζ − ζ|2:

|ζ − ζ|2 ≤ (κ′1 + LΨ∥x− x∥W )2 ≤ 2κ′21 + 2L2Ψ∥x− x∥2W .

Therefore, the following upper bound for ∥z − z∥2W

is established:

∥z − z∥2W

≤ 2κ′21 + (2L2Ψ + 1)∥x− x∥2W . (273)

We are now ready to present the main result of this section that shows a generalized error bound propertyfor problems (266) under general polyhedral functions Ψ:

128

Page 129: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Theorem 3.44. Under Assumption 3.43, the problem (266) with F (x) = f(Px)+ cTx+Ψ(x) satisfiesthe following global generalized error bound property:

∥x− x∥W ≤(κ1 + κ2∥x− x∥2W

)∥∇+F (x)∥W ∀x ∈ dom Ψ, (274)

where κ1 = (κ1 + 2κ′21 κ2)(2LΨ + 1) and κ2 = 2κ2(2LΨ + 1)(2L2Ψ + 1).

Proof. From the previous discussion, it remains to show that we can find an appropriate upper boundfor ∥∇+F (z)∥W . Given a point z = [xT ζ]T , it can be observed that the gradient of F (z) is:

∇F (z) =[P T∇f(Px) + c

1

]=

[∇f(x)

1

].

Now, denote z+ = ΠWZ

(z − W−1∇F (z)

). Following the definitions of the projection operator and of

∇+F , note that z+ is expressed as:

z+ = arg miny∈Rn,ζ′∈R

1

2

∥∥∥∥y − (x−W−1∇f(x))

ζ ′ − (ζ − 1)

∥∥∥∥2W

s.t. Ψ(y) ≤ ζ ′, ζ ′ ≤ Ψ.

Furthermore, from the definition of ∥ · ∥W , note that we can also express z+ as:

z+ = arg miny∈Rn,ζ′∈R

⟨∇f(x), y − x⟩+ 1

2∥y − x∥2W +

1

2(ζ ′ − ζ + 1)2

s.t. Ψ(y) ≤ ζ ′, ζ ′ ≤ Ψ.

Also, given the structure of z, consider that z+ = [T[N ](x)T ζ ′′]T . Now, by a simple change of variable,

we can define a pair (T[N ](x), ζ) as follows:

(T[N ](x), ζ) = arg miny∈Rn,ζ′∈R

⟨∇f(x), y − x⟩+ 1

2∥y − x∥2W +

1

2(ζ ′ + 1)2 (275)

s.t. Ψ(y)− ζ ≤ ζ ′, ζ ′ ≤ Ψ− ζ.

Note that ζ = ζ ′′ − ζ and that we can express z+ = [T[N ](x)T ζ + ζ]T and:

∥∇+F (z)∥W =

∥∥∥∥x− T[N ](x)

−ζ

∥∥∥∥W

.

From (206) and (226), we can write ∇+F (x) = x − T[N ](x) and recall that T[N ](x) can be expressedas:

T[N ](x) = arg miny∈Rn

⟨∇f(x), y − x⟩+ 1

2∥y − x∥2W +Ψ(y)−Ψ(x).

Thus, we can consider that T[N ](x) belongs to a pair (T[N ](x), ζ) which is the optimal solution of thefollowing problem:

(T[N ](x), ζ) = arg miny∈Rn,ζ′∈R

⟨∇f(x), y − x⟩+ 1

2∥y − x∥2W + ζ ′. (276)

s.t.: Ψ(y)−Ψ(x) ≤ ζ ′.

129

Page 130: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Following the same reasoning as in problem (267), note that ζ = Ψ(T[N ](x))−Ψ(x). Through Fermat’s

rule [64] and problem (276), we establish that (T[N ](x), ζ) can also be expressed as:

(T[N ](x), ζ) = arg miny∈Rn,ζ′

⟨∇f(x) +W (T[N ](x)− x), y − x⟩+ ζ ′ (277)

s.t.: Ψ(y)−Ψ(x) ≤ ζ ′.

Therefore, since (T[N ](x), ζ) is optimal for the problem above, we get the inequality:

⟨∇f(x) +W (T[N ](x)− x), T[N ](x)− x⟩+ ζ

≤ ⟨∇f(x) +W (T[N ](x)− x), T[N ](x)− x⟩+ ζ. (278)

Furthermore, since the pair (T[N ](x), ζ) is optimal for problem (275), we can derive:

⟨∇f(x), T[N ](x)−x⟩+1

2∥T[N ](x)− x∥2W +

1

2(ζ + 1)2 (279)

≤ ⟨∇f(x), T[N ](x)−x⟩+1

2∥T[N ](x)− x∥2W +

1

2(ζ + 1)2.

By adding up (278) and (279) we get the following relation:

∥T[N ](x)− x∥2W +1

2∥T[N ](x)− x∥2W +

1

2(ζ + 1)2 + ζ

≤ 1

2∥T[N ](x)− x∥2W + ⟨W (T[N ](x)− x), T[N ](x)− x⟩+ 1

2(ζ + 1)2 + ζ.

If we further simplify this inequality we obtain:

1

2∥T[N ](x)− x∥2W +

1

2∥T[N ](x)− x∥2W − ⟨W (T[N ](x)− x), T[N ](x)− x⟩+ 1

2ζ2 ≤ 1

2ζ2.

Combining the first three terms in the left hand side under the norm and if we multiply both sides by2, the inequality becomes: ∥∥∥(T[N ](x)− x)− (T[N ](x)− x)

∥∥∥2W

+ ζ2 ≤ ζ2.

From this, we derive the following two inequalities:

ζ2 ≤ ζ2 and∥∥∥(T[N ](x)− x)− (T[N ](x)− x)

∥∥∥2W

≤ ζ2.

If we take square root in both of these inequalities, and by applying the triangle inequality to the second,we obtain:

|ζ| ≤ |ζ| and∥∥∥T[N ](x)− x

∥∥∥W

− ∥T[N ](x)− x∥W ≤ |ζ|. (280)

Recall that ζ = Ψ(T[N ](x))−Ψ(x), and through the Lipschitz continuity of Ψ, we have from the firstinequality of (280) that:

|ζ| ≤ |ζ| = |Ψ(T[N ](x))−Ψ(x)| ≤ LΨ∥T[N ](x)− x∥W .

130

Page 131: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

Furthermore, from the second inequality of (280) we obtain:∥∥∥T[N ](x)− x∥∥∥W

≤ (LΨ + 1)∥T[N ](x)− x∥W .

From these, we arrive at the following upper bound on ∥∇+F (z)∥:

∥∇+F (z)∥ =

∥∥∥∥x− T[N ](x)

−ζ

∥∥∥∥W

≤∥∥∥T[N ](x)− x

∥∥∥W

+ |ζ| (281)

≤ (2LΨ + 1)∥T[N ](x)− x∥W = (2LΨ + 1)∥∇+F (x)∥.

Finally, from (270), (273) and (281) we obtain the following error bound property for problem (266):

∥x− x∥W ≤(κ1 + κ2∥x− x∥2

)∥∇+F (x)∥,

where κ1 = (κ1 + 2κ′21 κ2)(2LΨ + 1) and κ2 = 2κ2(2LΨ + 1)(2L2Ψ + 1).

Case 4: dual formulationConsider now linearly constrained problems:

minu∈Rm

g(u) s.t. Au ≤ b. (282)

where A ∈ Rn×m. In many applications however, its dual formulation is used since the dual structureof the problem is easier, see e.g. applications such as network optimization [57] or network control[40]. Now, for proving the generalized error bound property, we require that g satisfies the followingassumption:

Assumption 3.45. We consider that g is σg-strongly convex and has Lg-Lipschitz continuous gradientw.r.t. the Euclidean norm and there exists u such that Au < b.

Denoting by g∗ the convex conjugate of function g, then from Assumption 3.45 it follows that g∗ is1Lg

-strongly convex and has 1σg-Lipschitz gradient [64]. Moreover, from Au < b it follows that the set

of optimal Lagrange multipliers is compact [64]. In conclusion, the primal problem (282) is equivalentto the following dual problem:

maxx∈Rn

−g∗(−ATx)− ⟨x, b⟩ −Ψ(x), (283)

where Ψ(x) = IRn+(x) is the set indicator function for the nonnegative orthant Rn

+. From previous

discussion for P = −AT , it follows that the dual problem (283) satisfies our generalized error boundproperty from Definition 3.28.

3.2.7 Numerical simulations

We now present some preliminary numerical results on solving constrained lasso problems in the form(200). The individual constraint sets Xi ⊆ Rni are box constraints, i.e. Xi = {xi : lbi ≤ xi ≤ ubi}.The regularization parameters λi were chosen uniform for all components, i.e. λi = λ for all i. Thenumerical experiments were done for two instances of the regularization parameter λ = 1 and λ = 10.The numerical tests were conducted on a machine with 2 Intel(R) Xeon(R) E5410 quad core CPUs @

131

Page 132: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

2.33GHz and 8GB of RAM. The matrices A ∈ Rm×n were randomly generated in Matlab and have ahigh degree of sparsity (i.e. the measures of partial separability ω, ω ≪ n). We solve a single randomlygenerated constrained lasso problem with matrix A of dimensionm = N = 0.99∗106 and n = N = 106.In this case the two measure of separability have the values: ω = 37 and ω = 35. The problem wassolved on τ = 1, 2, 4 and 7 cores in parallel using MPI for λ = 10. From the figure we can observe thatfor each τ our algorithm needs almost the same number of coordinate updates τk

n to solve the problem.

On the other hand increasing the number of cores reduces substantially the number of iterations kn .

0 20 40 60 80 100 120 14010

−4

10−3

10−2

10−1

100

101

102

τk

N

F(x

k)−

F∗

τ = 1τ = 2τ = 4τ = 7

Figure 9: Evolution of F (xk)−F ∗ along coor-dinate updates normalized τk

n .

0 20 40 60 80 100 120 14010

−4

10−3

10−2

10−1

100

101

102

k

N

F(x

k)−

F∗

τ = 1τ = 2τ = 4τ = 7

Figure 10: Evolution of F (xk) − F ∗ along it-erations normalized k

n .

132

Page 133: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

4 Papers having acknowledgement the MoCOBiDS project

4.1 Papers published in ISI journals

• I. Necoara, D. Clipici, Parallel random coordinate descent methods for composite minimization:convergence analysis and error bounds, SIAM Journal on Optimization, 26(1): 197–226, 2016.

• I. Necoara, A. Patrascu, Iteration complexity analysis of dual first order methods for conic convexprogramming, Optimization Methods and Software, 31(3): 645–678, 2016.

• N. Nguyen, S. Olaru, P. Rodriguez-Ayerbe, M. Hovd, I. Necoara, Constructive solution of in-verse parametric linear/quadratic programming problems, Journal of Optimization Theory andApplications, DOI 10.1007/s10957-016-0968-0, 2016.

• A. Patrascu, I. Necoara, Q. Tran-Dinh, Adaptive inexact fast augmented Lagrangian methods forconstrained convex optimization, Optimization Letters, DOI:10.1007/s11590- 016-1024-6: 1-18,2016.

• I. Necoara, Yu. Nesterov, F. Glineur, Random block coordinate descent for linearly-constrainedoptimization over networks, Journal of Optimization Theory and Applications, to appear, 2016.

4.2 Journal papers under review/in progress

• A. Patrascu, I. Necoara, On the convergence of inexact projection first order methods for convexminimization, submitted to IEEE Transactions on Automatic Control, November 2016.

• I. Necoara, A. Patrascu, F. Glineur, Complexity of first order Lagrangian and penalty methods forconic convex programming, submitted to Optimization Methods and Software, September 2016.

• I. Necoara, Yu. Nesterov, F. Glineur, Linear convergence of first order methods for non-stronglyconvex optimization, submitted to Mathematical Programming, July 2016.

• I. Necoara, A. Patrascu, P. Richtarik, Randomized projection methods for convex feasibility prob-lems, in preparation, 2016.

• A. Patrascu, I. Necoara, Randomized proximal methods for convex minimization problems, inpreparation, 2016.

• I. Necoara, A. Patrascu, DuQuad: a dual first order algorithm for quadratic programming, inpreparation, 2016.

• I. Necoara, A. Patrascu, Iteration complexity analysis of coordinate descent methods for l0 regu-larized convex problems, in preparation, 2016.

4.3 Books

• I. Necoara, A. Patrascu, Decomposition Methods for Large Scale Mathematical Optimization,John Wiley & Sons, in preparation, 2017.

133

Page 134: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

4.4 Book chapters

• I. Necoara, Coordinate gradient descent methods, chapter in: Big Data and Computational Intel-ligence in Networking, Y. Wu et al. (Eds.), Taylor and Francis LLC-CRC Press, 2016.

• I. Necoara, A. Patrascu, A. Nedich, Complexity certifications of first order inexact Lagrangianmethods for general convex programming, chapter in: Developments in Model-Based Optimizationand Control, S. Olaru et al. (Eds.), Springer, 2015.

4.5 Papers accepted/submitted in conferences

• I. Necoara, A. Patrascu, P. Richtarik, Randomized projection methods for convex feasibility prob-lems, submitted to SIAM Conference on Optimization 2017.

• I. Necoara, V. Nedelcu, D. Clipici, L. Toma, On fully distributed dual first order methods forconvex network optimization, submitted to IFAC World Congress, 2017.

• T. Ionescu, I. Necoara, A scale-free moment matching-based model reduction technique of linearnetworks, submitted to IFAC World Congress, 2017.

• A. Patrascu, I. Necoara, Inexact projection primal first order methods for strongly convex mini-mization, submitted to IFAC World Congress, 2017.

• A. Patrascu, I. Necoara, Complexity certifications of inexact projection primal gradient methodfor convex problems: application to embedded MPC, Proceedings of Mediterranean Conferenceon Control and Automation, 2016.

• I. Necoara, L. Toma, V. Nedelcu, Optimal voltage control for loss minimization based on sequentialconvex programming, Proceedings of IEEE Conference Innovative Smart Grid Technologies Europe,2016.

• I. Necoara, Yu. Nesterov, F. Glineur, Linear convergence of first order methods for nonstronglyconvex optimization, invited paper in session: Recent advances on convergence rates of first-ordermethods, International Conference Continuous Optimization, 2016.

• I. Necoara, A. Patrascu, F. Glineur, Complexity of first order inexact Lagrangian and penaltymethods for conic convex programming, invited paper in session: First order methods for convexoptimization problems, European Conference on Operational Research, 2016.

134

Page 135: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

References

[1] B.D. Anderson, Y. Liu, Controller reduction: concepts and approaches, IEEE Transactions Auto-matic Control, 34(8): 802–812, 1987.

[2] A. C. Antoulas, Approximation of large-scale dynamical systems, SIAM, Philadelphia, 2005.

[3] A. Astolfi, Model reduction by moment matching for linear and nonlinear systems, IEEE Trans-actions Automatic Control, 50(10): 2321–2336, 2010.

[4] A. Astolfi, Model reduction by moment matching, steady-state response and projections, Pro-ceedings of IEEE Conference Decision and Control: 5344–5349, 2010.

[5] S. Bahmani, B. Raj, P.T. Boufounos, Greedy sparsity-constrained optimization, Journal of Ma-chine Learning Research, 14(3): 807–841, 2013.

[6] A. Beck, A. Nedic, A. Ozdaglar, M. Teboulle, Optimal distributed gradient methods for networkresource allocation problems, IEEE Transactions on Control of Network Systems, 1(1): 64–73,2014.

[7] E. Birgin, J. Martinez, M. Raydan, Inexact spectral projected gradient methods on convex set,IMA Journal of Numerical Analysis, 23: 539-559, 2003.

[8] C. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag, 2006.

[9] Y. Chalaoui, D. Lemonnier, A. Vanderdorpe, P. van Dooren, Second-order balanced truncation,Linear Algebra & its Applications, 415(2-3): 1192–1214, 2004.

[10] E. J. Candes, T. Tao, Near-optimal signal recovery from random projections: universal encodingstrategies, IEEE Transactions on Information Theory, 52(12): 5406–5425, 2004.

[11] M. Castells, The rise of the network society, Blackwell Publishing, 2000.

[12] A. Conejo, J. Aguado, Multi-area coordinated decentralized dc optimal power flow, IEEE Trans-actions on Power Systems, 13(4): 1272–1278, 1998.

[13] M. Chiang, S. Low, A. Calderbank, J. Doyle, Layering as optimization decomposition: A mathe-matical theory of network architectures, Proceedings of the IEEE, 91(1): 255–312, 2007.

[14] X. Chen, M. K. Ng, C. Zhang, Non-lipschitz ℓp -regularization and box constrained model forimage restoration, IEEE Transactions on Image Processing, 21(12): 4709–4721, 2012.

[15] E. de Souza, S.P. Bhattacharyya, Controllability, observability and the solution of AX−XB = C,Linear Algebra & Its Applications, 39: 167–188, 1981.

[16] F. Deutsch, H. Hundal, The rate of convergence of Dykstra’s cyclic projections algorithm: Thepolyhedral case, Numerical Functional Analysis and Optimization, 15(5-6), 1994.

[17] A. Domahidi, A. Zgraggen, M. Zeilinger, C. Jones, Efficient interior point methods for multistageproblems arising in receding horizon control, Proceedings of IEEE Conference on Decision andControl, 668–674, 2012.

135

Page 136: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

[18] H. Ferreau, C. Kirches, A. Potschka, H. Bock, M. Diehl, qpOASES: a parametric active-setalgorithm for quadratic programming, Mathematical Programming Computation, 6(4): 327–363,2014.

[19] R.W. Freund, SPRIM: Structure-preserving reduced-order interconnect macromodeling, Proceed-ings of Conference on Computer-Aided Design: 80–87, 2004.

[20] K. Gallivan, A. Vandendorpe, P. Van Dooren, Model reduction of MIMO systems via tangentialinterpolation, SIAM Journal Matrix Analysis Applications, 26(2): 328–349, 2004.

[21] J. Gauvin, A necessary and sufficient regularity condition to have bounded multipliers in nonconvexprogramming, Mathematical Programming, 12: 136–138, 1977.

[22] K. Gallivan, A. Vandendorpe, P. Van Dooren, Sylvester equations and projection based modelreduction, J. Comp. Appl. Math., 162: 213–229,2004.

[23] M. Hong, X. Wang, M. Razaviyayn, Z.Q. Luo, Iteration complexity analysis of block coordinatedescent methods, http://arxiv.org/abs/1310.6957, 2013.

[24] T. C. Ionescu, A. Astolfi, P. Colaneri, Families of moment matching based low order approxima-tions for linear systems, Systems & Control Letters, 64: 47–56, 2014.

[25] T. C. Ionescu, J. M. A. Scherpen, O. V. Iftime, A. Astolfi, Balancing as a moment matchingproblem, International Symposium on Mathematical Theory of Networks and Systems, 2012.

[26] J.L. Jerez, P. Goulart, S. Richter, G. Constantinides, E. Kerrigan, M. Morari, Embedded onlineoptimization for model predictive control at megahertz rates, IEEE Transactions on AutomaticControl, 59(12): 3238–3251, 2014.

[27] M. Journee, Y. Nesterov, P. Richtarik, R. Sepulchre, Generalized power method for sparse principalcomponent analysis, Journal of Machine Learning Research, 11, 517–553, 2010.

[28] D. Klatte, G. Thiere, Error bounds for solutions of linear equations and inequalities, MathematicalMethods on Operations Research, 41: 191–214, 1995.

[29] V. Kekatos, G. Giannakis, Distributed Robust Power System State Estimation, IEEE Transactionson Power Systems, 28(2): 1617–1626, 2013.

[30] D. Leventhal, A.S. Lewis, Randomized methods for linear constraints: convergence rates andconditioning, Mathematics of Operations Research, 35(3): 641–654, 2010.

[31] A. Lutowska, Model order reduction for coupled systems using low-rank approximations, PhDthesis, Eindhoven University of Technology, 2012.

[32] G. Lan, R. Monteiro, Iteration-complexity of first-order augmented lagrangian methods for convexprogramming, Mathematical Programing, 155(1): 511–547, 2016.

[33] J. Liu, S. Wright, C. Re, V. Bittorf, S. Sridhar, An asynchronous parallel stochastic coordinatedescent algorithm, Journal of Machine Learning Research, 16(1): 285–322, 2015.

[34] Z. Lu, Iterative hard thresholding methods for ℓ0 regularized convex cone programming, Mathe-matical Programming, DOI: 10.1007/s10107-013-0714-4, 2014.

136

Page 137: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

[35] Z. Lu, L. Xiao, On the complexity analysis of randomized block-coordinate descent methods,Mathematical Programming, 152(1–2): 615–642, 2015.

[36] M. Meinel, M. Ulbrich, S. Albrecht, A class of distributed optimization methods with event-triggered communication, Computational Optimization and Applications, 2014.

[37] D.G. Meyer, S. Srinivasan, Balancing and model reduction for second-order form linear systems,IEEE Transactions Automatic Control, 41(11), 1996.

[38] W. Murray, T. Tinoco De Rubira, A. Wigington, A robust and informative method for solvinglarge-scale power flow problems, Computational Optimization Applications, 62: 431–475, 2015.

[39] M. Nagahara, D. E. Quevedo, J. Ostergaard, Sparse Packetized Predictive Control for NetworkedControl over Erasure Channels, IEEE Transactions on Automatic Control, 59(7): 1899–1905,2014.

[40] I. Necoara, J. Suykens, Application of a smoothing technique to decomposition in convex opti-mization, IEEE Transactions Automatic Control, 53(11): 2674–2679, 2008.

[41] I. Necoara, V. Nedelcu, Rate analysis of inexact dual first order methods: application to dualdecomposition, IEEE Transactions on Automatic Control, 59(5): 1232–1243, 2014.

[42] I. Necoara, D. Clipici, S. Olaru, Distributed model predictive control of leader-follower systemsusing an interior point method with efficient computations, Proceedings of American ControlConference: 1697–1702, 2013.

[43] I. Necoara, A. Patrascu, Iteration complexity analysis of dual first order methods for conic convexprogramming, Optimization Methods and Software, 31(3): 645–678, 2016.

[44] I. Necoara, V. Nedelcu, I. Dumitrache, Parallel and distributed optimization methods for estima-tion and control in networks, Journal of Process Control, 21(5): 756–766, 2011.

[45] I. Necoara and V. Nedelcu, On linear convergence of a distributed dual gradient algorithm forlinearly constrained separable convex problems, Automatica, 55(5): 209–216, 2015.

[46] I. Necoara, C. Savorgnan, Q. Tran-Dinh, J. Suykens, M. Diehl, Distributed Nonlinear OptimalControl Using Sequential Convex Programming and Smoothing Techniques, IEEE Conference onDecision and Control, 543–548, 2009.

[47] I. Necoara, Random coordinate descent algorithms for multi-agent convex optimization over net-works, IEEE Transactions on Automatic Control, 58(8): 2001–2012, 2013.

[48] I. Necoara, D. Clipici, Parallel random coordinate descent methods for composite minimization:convergence analysis and error bounds, SIAM Journal on Optimization, 26(1): 197–226, 2016.

[49] I. Necoara, A. Patrascu, A random coordinate descent algorithm for optimization problems withcomposite objective function and linear coupled constraints, Computational Optimization andApplications, 57(2): 307–337, 2014.

[50] A. Nedic and A. Ozdaglar, Approximate primal solutions and rate analysis for dual subgradientmethods, SIAM Journal on Optimization, 19(4): 1757–1780, 2009.

137

Page 138: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

[51] Y. Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems,SIAM Journal on Optimization 22(2): 341-362, 2012.

[52] Yu. Nesterov, Gradient methods for minimizing composite objective functions, Mathematical Pro-gramming, 140(1): 125–161, 2013.

[53] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer, 2004.

[54] A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach tostochastic programming, SIAM Journal on Optimization, 19: 1574-1609, 2009.

[55] N. Ozay, M. Sznaier, C.M. Lagoa, O.I. Camps, A Sparsification Approach to Set MembershipIdentification of Switched Affine Systems, IEEE Transactions on Automatic Control, 57(3): 634–648, 2012.

[56] A. Patrascu, I. Necoara, Efficient random coordinate descent algorithms for large-scale structurednonconvex optimization, Journal of Global Optimization, 61(1): 19-46, 2015.

[57] S. Sundhar Ram, A. Nedic, V. Veeravalli, Incremental stochastic subgradient algorithms for convexoptimization, SIAM Journal on Optimization, 20(2): 691–717, 2009.

[58] S. Ryali, K. Supekar, D. A. Abrams, V. Menone, Sparse logistic regression for whole-brain classi-fication of fmri data, NeuroImage, 51(2): 752–764, 2010.

[59] T. Reis, T. Stykel, Stability analysis and model order reduction for coupled systems, Math. Com-put. Model. Dyn. Syst., 13(5): 413–436, 2007.

[60] M. Razaviyayn, Successive convex approximation: analysis and applications, Ph.D thesis, Univ.of Minesota, 2014.

[61] P. Richtarik, M. Takac, Parallel coordinate descent methods for big data optimization, Mathe-matical Programming, 2015.

[62] P. Richtarik, M. Takac, Iteration complexity of randomized block-coordinate descent methods forminimizing a composite function, Mathematical Programming, 144(1-2): 1–38, 2014.

[63] S.M. Robinson, Bounds for error in the solution set of a perturbed linear program, Linear Algebraand its Applications, 6: 69–81, 1973.

[64] R. Rockafellar, R. Wets, Variational Analysis, Springer-Verlag, New York, 1998.

[65] J. Rawlings, D. Mayne, Model Predictive Control: Theory and Design, Nob Hill Publishing, 2009.

[66] M. Schmidt, N. Le Roux, F. Bach, Convergence rates of inexact proximal-gradient methods forconvex optimization, Advances in Neural Information Processing Systems (NIPS), 2011.

[67] G. Stathopoulos, A. Szucs, Y. Pu, C. Jones, Splitting methods in control, European ControlConference, 2014.

[68] B.M. Sanandaji, T.L. Vincent, M.B. Wakin, Concentration of measure for compressive toeplitzmatrices with applications to detection and system identification, IEEE Conference on Decisionsand Control, 2922–2929, 2010.

138

Page 139: Modeling, Control and Optimization for Big Data Systems ...141.85.225.150/raport2016.pdf · Smart Grid Technologies Europe, 2016. P3: I. Necoara, V. Nedelcu, D. Clipici, L. Toma,

[69] H. Sandberg, An extension to balanced truncation with application to structured model reduction,IEEE Transactions Automatic Control, 55(4): 1038–1043, 2010.

[70] H. Sandberg, R.M. Murray, Model reduction of interconnected linear systems, Optimal ControlApplications and Methods, 30(3): 225–245, 2009.

[71] G. Schelfourt, B. De Moor, A note on closed-loop balanced truncation, IEEE Transactions Auto-matic Control, 41(10): 1498–1500, 2002.

[72] P. Tseng, S. Yun, A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization,Mathematical Programming, 117, 387-423, 2009.

[73] A. Vandendorpe, P. Van Dooren, Model reduction of interconnected systems, chapter in: Modelorder reduction: theory, research aspects and applications: 305–321, 2008.

[74] A. Varga, B. Anderson, Accuracy-enhancing methods for balaning-related- frequency-weightedmodel and controller reduction, Automatica, 39(5): 919–927, 2003.

[75] L. Xiao, S. Boyd, Optimal scaling of a gradient method for distributed resource allocation, Journalof Optimization Theory and Applications, 129(3), 2006.

[76] R. Zimmerman, C. Murillo-Sanchez, R. Thomas, Matpower: Steady-state operations, planning,and analysis tools for power systems research and education, IEEE Trans. Power Systems, 26(1):12–19, 2011.

[77] Q.C. Zhong, Robust control of time-delay systems, Springer, 2006.

[78] H. Zhang, L. Cheng, Restricted strong convexity and its applications to convergence analysis ofgradient type methods in convex optimization, Optimization Letters, 9: 961–979, 2015.

[79] P.W. Wang, C.J. Lin, Iteration complexity of feasible descent methods for convex optimization,Journal of Machine Learning Research, 15: 1523–1548, 2014.

[80] S. Wright, Coordinate descent algorithms, Mathematical Programming, 151(1): 3–34, 2015.

139