parallel and distributed sparse optimization algorithms...

36
Parallel and Distributed Sparse Optimization Algorithms Part I Ruoyu Li 1 1 Department of Computer Science and Engineering University of Texas at Arlington March 19, 2015 Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 1 / 36

Upload: others

Post on 04-Feb-2021

24 views

Category:

Documents


0 download

TRANSCRIPT

  • Parallel and Distributed Sparse Optimization AlgorithmsPart I

    Ruoyu Li1

    1Department of Computer Science and EngineeringUniversity of Texas at Arlington

    March 19, 2015

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 1 / 36

  • Outline

    1 Background and ProblemBackgroundsRelated WorksProblem Formulation

    2 AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 2 / 36

  • Outline

    1 Background and ProblemBackgroundsRelated WorksProblem Formulation

    2 AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 3 / 36

  • Paper We Present

    The paper we present today:Title: ”Parallel and distributed sparse optimization.”Authors: Peng, Zhimin, Ming Yan, and Wotao Yin.Publisher: Signals, Systems and Computers, 2013 Asilomar Conferenceon. IEEE, 2013.Affiliation: Dept. Mathematics, UCLA.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 4 / 36

  • Motivations

    1 Data is big right now. Too big to process them in single workstation.

    2 Distributed data. Due to the fact that data is usually collected andstored separately, it is natural to try to process them separately andlocally

    3 Current algorithms are not scalable enough and cannot well reducethe total processing time complexity.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 5 / 36

  • Outline

    1 Background and ProblemBackgroundsRelated WorksProblem Formulation

    2 AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 6 / 36

  • Related WorksSingle Thread Algorithm for Sparse Optimization Algorithms

    Subgradient descent. LP, SOCP1, SDPSmoothing. approximate l1 norm by

    l1+� norm∑i

    √x2i + �

    Huber-norm2

    and apply gradient descent methods or quasi-Newton methodsSplitting: split smooth and non-smooth terms into several simplesubproblems

    operator splitting: GPSR/FPC/SpaRSA/FISTA/SPGL1/....dual operator splitting: Bregman/ADMM/split Bregman3/....

    Greedy algorithms: OMP4, CoSaMP.....1Second-Order-Cone Programming2http://www.seas.ucla.edu/ vandenbe/236C/lectures/smoothing.pdf3Setzer, Simon, Gabriele Steidl, and Tanja Teuber. ”Deblurring Poissonian images by

    split Bregman techniques.” Journal of Visual Communication and Image Representation21.3 (2010): 193-199.

    4Orthogonal Matching PursuitRuoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 7 / 36

  • Related WorksState-of-the-art Parallel Sparse Optimization Algorithms

    Distributed ADMM. The total time complexity is not actuallyreduced, because its iteration number increases with the number ofdistributed blocks. Ref[9-10]

    Parallel Coordinate Descent. Randomly, cyclically or greedily selectpart of blocks to update for each iteration. Ref[11-13]

    The proposed algorithm has obvious advantage over distributed ADMMand have merit on efficiency over the distributed coordinate descent undersome assumption of coordinate orthogonality conditions.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 8 / 36

  • Contributions

    In this paper, the authors proposed two parallel algorithms.

    One is motivated by the separable objective functions

    f (x) =S∑

    s=1

    fs(xs), or (1)

    f (x) =S∑

    s=1

    fs(x) (partially separable),

    and it can parallel existing proximal-linear algorithms, e.g. ISTA,FISTA, FPC.

    One is based on the data orthogonality. Both block and variables ofeach block are selected by greedy means. We argue that greedycoordinate block selection can be also fast.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 9 / 36

  • Outline

    1 Background and ProblemBackgroundsRelated WorksProblem Formulation

    2 AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 10 / 36

  • Problem Definition

    In this talk, we try to solve the sparse optimization problem underlyingsome structure in the solution. The objective function:

    minx∈RnF(x) = λR(x) + L(Ax,b)). (2)

    where R(x) is a non-smooth regularizer. L(Ax,b)) is the data fidelity orloss function, which is usually smooth.

    Notice

    When solving the problem, no matter what algorithm we utilize, many Axand ATy calculation are necessary. Accelerating the messy calculation ofthese matrix operations is a big plus for algorithm speed.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 11 / 36

  • The Separability of Objective Function

    1 For many regularization term R(x), e.g. l1, l1,2, Huber function, andelastic net function. We can write R(x) =

    ∑Bb=1R(xb)

    2 Loss functions L(Ax,b), e.g. square, logistic and hinge loss function,are also feasible for separating: L(Ax,b) =

    ∑Bb=1 L(Abx,bb)

    3 Usually, we only require one of them to be separable.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 12 / 36

  • Data Distribution Scenarios

    Figure: Data Distribution Scenarios for matrix A. a.Row blocks, b.column blocksand c.general blocks.

    Benefits:

    1 When A is very large, small store space for local computer.

    2 Sometimes A is collected or generated separately, and it is easier toprocess and update locally.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 13 / 36

  • Examples

    In scenario a, computing Ax and AT y :

    Ax =

    A(1)xA(2)x· · ·

    A(M)x

    (3)and AT y =

    ∑Mi=1 A

    T(i)yi , where yi is ith block of y .

    For computing Ax , it is all right to independently store and processeach block of A locally, but we need to broadcast x to each nodes,and update it every iteration.

    For computing AT y , we need the reduce method to sum the solutionfrom each node.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 14 / 36

  • Examples-continued

    In scenario b, computing Ax and AT y :

    AT y =

    AT(1)y

    AT(2)y

    · · ·AT(M)y

    (4)and Ax =

    ∑Ni=1 A(i)xi , where xi is ith block of x .

    For computing AT y , we need to broadcast y to each nodes, andupdate it every iteration.

    For computing Ax , we need the reduce method to sum the solutionfrom each node.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 15 / 36

  • Examples-continued

    In scenario c, computing either AT y or Ax requires a mixed use ofbroadcasting and reduce operations. Use Ax for example:

    Ax =

    A(1)xA(2)x· · ·

    A(M)x

    = N∑j=1

    A1,jxjA2,jxj· · ·

    AM,jxj

    (5)

    1 broadcast xj to nodes (1, j), (2, j), · · · , (M, j).2 in parallel, calculate all Ai ,jxj for i = 1,· · · ,M and j = 1,· · · ,N3 reduce method is applied to nodes (i , 1), (i , 2), · · · , (i ,N) and return

    A(i)x =∑N

    j=1 Ai ,jxj .

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 16 / 36

  • Outline

    1 Background and ProblemBackgroundsRelated WorksProblem Formulation

    2 AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 17 / 36

  • Distributed Algorithm of Proximal Linear Algorithm

    The proximal -linear algorithms, e.g. ISTA, FISTA, FPC, are iterations ofgradient descent and proximal operator. When finding the gradient:

    ∇xL(Axk ,b) = AT∇L(Axk ,b) (6)

    , we can see that the it could be accelerated by distributed computing ofAx and AT y .

    When we move to solving the proximal operator, we could utilize theseparability of regularization R(x).And its solution is easy to obtain, e.g.|x|1 with close-form soft-thresholding solution.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 18 / 36

  • Distributed Algorithm of Proximal Linear Algorithm

    Using Taylor expansion, we could have the quadratic approximation ofL(Ax , b) and apply Minorize-Maximization (MM) method, we have thefollowing new problem for each iteration:

    xk+1 ← arg minxλR(x)+ < x ,AT∇L(Axk , b) > + 1

    2δk‖x − xk‖22 (7)

    and it could be given as proximal operator:

    xk+1 = proxλR(xk − δkAT∇L(Axk , b)), (8)

    where proxλR(t) = arg minx λR(x) + 12‖x − t‖22.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 19 / 36

  • Distributed Algorithm of Proximal Linear Algorithm

    In scenario a,

    1 Every node i keeps A(i), bi and current xk .

    2 For every node i , we first compute Ajxkj and then compute

    ∇Li (A(i)x,bi ) in parallel manner. After that, we compute theAT∇L(Ax,b) =

    ∑Mi=1 A

    T(i)∇Li (A(i)x,bi ) by reduce operation.

    3 After that, we will easily solve the Eq(8)

    To distribute the calculation, we require the separability of loss functionL(Ax , b).

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 20 / 36

  • Two Examples

    Distributed LASSO.

    minxλ|x |1 +1

    2‖Ax − b‖22 (9)

    In this case, both R and L are separable.Distributed sparse logistic regression. Sparse logistic solver:

    minw ,cλ‖w‖1 +1

    m

    m∑i=1

    log(1 + exp(−bi (wTai + c))) (10)

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 21 / 36

  • Parallel FISTA

    Figure: Algorithm-1, p-FISTA for scenario b

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 22 / 36

  • Distributed Logistic Regression

    Figure: Algorithm-2, distributed logistic regression for scenario a

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 23 / 36

  • Outline

    1 Background and ProblemBackgroundsRelated WorksProblem Formulation

    2 AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 24 / 36

  • Parallel Greedy Coordinate Descent Method

    Coordinate Descent (CD) algorithm– the objective function is optimizedwith respect to a chosen coordinate or a block of coordinates while therest stay fixed.

    How to decide the coordinates to update:

    Cyclic CD. Go through all coordinates iteratively.

    Random CD. Randomly select coordinates and analysis based onexpectation of objective function.

    Greedy CD. Choose the coordinates with best merit value(decreasethe objective function)

    Mixed CD.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 25 / 36

  • Greedy CD is better for sparse problem

    Under certain orthogonality conditions, such as RIP and incoherenceconditions, certain greedy selection rules can guarantee to select thecoordinates corresponding to non-zero value in the final solution.The greedy selection of coordinates is below:

    1 Recall Eq(7), and we only focus on one coordinate i , we define thepotential of this coordinate as :

    di = arg mindλr(xi + d) + gid +

    β

    2d2 (11)

    where R(x) =∑n

    i=1 r(xi ), n is the number of coordinates.

    2 Out of N blocks, the P blocks with coordinate of highest potential,are selected.

    3 Update these best coordinates of each selected blocks.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 26 / 36

  • GRock for scenario b

    Figure: Algorithm-3

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 27 / 36

  • Demonstration for GRock- divergence

    Figure: Contour is not aligned with the coordinate axis. Left: P = 1, optimizeone coordinate each iteration; Right: P = 3.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 28 / 36

  • Convergence

    In this part, the paper mainly follow the derivations and theoremsproposed in 5. they define a block spectral radius :

    ρP = maxM∈M

    ρ(M) (12)

    where M is the set of all P ×P submatrices that we can obtain from ATAcorresponding to selecting one column from each selected P blocks. ρ(?)is nothing but the maximal eigenvalue.

    The small ρ(M) means the selected P columns from A (scenario b)are independent from each other.

    The higher P, the more likely to have a higher ρ(M).

    5Scherrer, Chad, et al. ”Feature clustering for accelerating parallel coordinatedescent.” Advances in Neural Information Processing Systems. 2012.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 29 / 36

  • Convergence

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 30 / 36

  • GRock for Sparse Optimization

    In CD algorithm usually cause more time to converge due to eachiteration it only update few of coordinates.

    In sparse optimization problem, due to the nature that most entries offinal solution is zero, we only need to focus on those non-zerocoordinates.

    Greedy Rock algorithm always select coordinates corresponding tonon-zero entries in final solution. It will need fewer iteration thanother CD and proximal-linear algorithms.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 31 / 36

  • GRock for Sparse Optimization

    Figure: a comparison of different coordiante descent and FISTA on massivedataset with 0.4% non-zero entries.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 32 / 36

  • Experiments on Cluster

    Dataset I: 1024×2028; dataset II: 2048×4096.

    Problem to solve: LASSO, Eq(9).

    Figure: (a) Dataset I: cores vs iteration number; (b)Dataset II: cores vs iterationnumberRuoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 33 / 36

  • Experiments on Cluster-continued

    Figure: (a) Dataset I: cores vs time; (b)Dataset II: cores vs time

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 34 / 36

  • Complexity Analysis

    Each iteration of distributed prox-linear algorithm (Algorithm 1 and2) takes O(mn/N), dominated by two matrix-vector multiplications,and O(nlogN) on communication for calling MPIAllreduce operatorto assembly solutions from all nodes,

    ∑Ni=1 Aixi

    Each iteration of GRock takes O(mn/N + Pm) on computing,breaking down to O(mn/N) for one matrix-vector multiplication andO(Pm) for updating the residual Ax − b as only P coordinates areupdated.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 35 / 36

  • Summary

    Decomposition of matrix A resulting in three scenarios, and it is thekey to distribute proximal-linear algorithms over nodes.

    Greedy coordinate descent works perfect for sparse optimizationproblem, based on the strong orthogonality between selected columnfrom A corresponding to selected coordinate.

    Based on above argument, a prerequisite condition for successfullyhaving a converged solution by GRock is proposed.

    GRock is straight forward to process in parallel manner.

    Ruoyu Li (UTA) Parallel and Distributed Sparse Optimization Algorithms March 19, 2015 36 / 36

    Background and ProblemBackgroundsRelated WorksProblem Formulation

    AlgorithmsDistributed Algorithm of Proximal Linear AlgorithmParallel Greedy Coordinate Descent Method