a ccelerated, p arallel and prox imal coordinate descent
DESCRIPTION
A ccelerated, P arallel and PROX imal coordinate descent. A. P. PROX. Peter Richt á rik. IPAM February 2014. (Joint work with Olivier Fercoq - arXiv:1312.5799). Contributions. Variants of Randomized Coordinate Descent Methods. Block can operate on “ blocks” of coordinates - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/1.jpg)
Accelerated, Parallel and PROXimal coordinate descent
IPAMFebruary 2014
A P PROXPeter Richtárik
(Joint work with Olivier Fercoq - arXiv:1312.5799)
![Page 2: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/2.jpg)
Contributions
![Page 3: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/3.jpg)
Variants of Randomized Coordinate Descent Methods
• Block– can operate on “blocks” of
coordinates – as opposed to just on individual
coordinates
• General – applies to “general” (=smooth
convex) functions – as opposed to special ones such as
quadratics
• Proximal– admits a “nonsmooth regularizer”
that is kept intact in solving subproblems
– regularizer not smoothed, nor approximated
• Parallel – operates on multiple blocks /
coordinates in parallel– as opposed to just 1 block /
coordinate at a time
• Accelerated– achieves O(1/k^2) convergence rate
for convex functions– as opposed to O(1/k)
• Efficient– avoids adding two full feature
vectors
![Page 4: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/4.jpg)
Brief History of Randomized Coordinate Descent Methods
+ new long stepsizes
![Page 5: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/5.jpg)
Introduction
![Page 6: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/6.jpg)
I. Block
Structure
II. Block
Sampling
IV. Fast or
Normal?
III. Proximal
Setup
![Page 7: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/7.jpg)
I. Block Structure
![Page 8: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/8.jpg)
I. Block Structure
![Page 9: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/9.jpg)
I. Block Structure
![Page 10: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/10.jpg)
I. Block Structure
![Page 11: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/11.jpg)
I. Block Structure
![Page 12: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/12.jpg)
I. Block StructureN = # coordinates
(variables)
n = # blocks
![Page 13: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/13.jpg)
II. Block Sampling
Block sampling
Average # blocks selected by the sampling
![Page 14: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/14.jpg)
III. Proximal Setup
Convex & Smooth Convex & Nonsmooth
Loss Regularizer
![Page 15: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/15.jpg)
III. Proximal SetupLoss Functions: Examples
Quadratic loss
L-infinity
L1 regression
Exponential loss
Logistic loss
Square hinge loss
BKBG’11RT’11bTBRS’13RT ’13a
FR’13
![Page 16: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/16.jpg)
III. Proximal SetupRegularizers: Examples
No regularizer Weighted L1 norm
Weighted L2 normBox constraints
e.g., SVM dual
e.g., LASSO
![Page 17: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/17.jpg)
The Algorithm
![Page 18: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/18.jpg)
APPROX
Olivier Fercoq and P.R. Accelerated, parallel and proximal coordinate descent, arXiv:1312.5799, December 2013
![Page 19: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/19.jpg)
Part CRANDOMIZED
COORDINATE DESCENT
Part BGRADIENT METHODS
B1GRADIENT DESCENT
B2PROJECTED
GRADIENT DESCENT
B3PROXIMAL
GRADIENT DESCENT
B4FAST PROXIMAL
GRADIENT DESCENT
C1PROXIMAL
COORDINATE DESCENT
C2PARALLEL
COORDINATE DESCENT
C3DISTRIBUTED
COORDINATE DESCENT
C4FAST PARALLEL
COORDINATE DESCENT
new FISTAISTA
Olivier Fercoq and P.R. Accelerated, parallel and proximal coordinate descent, arXiv:1312.5799, Dec 2013
![Page 20: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/20.jpg)
PCDM
P.R. and Martin Takac. Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, December 2012IMA Fox Prize in Numerical Analysis, 2013
![Page 21: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/21.jpg)
2D Example
![Page 22: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/22.jpg)
Convergence Rate
![Page 23: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/23.jpg)
Convergence Rate
average # coordinates updated / iteration
# blocks# iterations
implies
Theorem [Fercoq & R. 12/2013]
![Page 24: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/24.jpg)
Special Case: Fully Parallel Variantall blocks are updated in each iteration
# normalized weights (summing to n)
# iterations
implies
![Page 25: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/25.jpg)
New Stepsizes
![Page 26: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/26.jpg)
Expected Separable Overapproximation (ESO):How to Choose Block Stepsizes?
P.R. and Martin Takac. Parallel coordinate descent methods for big data optimization, arXiv:1212.0873, December 2012Olivier Fercoq and P.R. Smooth minimization of nonsmooth functions by parallel coordinate descent methods, arXiv:1309.5885, September 2013P.R. and Martin Takac. Distributed coordinate descent methods for learning with big data, arXiv:1310.2059, October 2013
SPCDM
![Page 27: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/27.jpg)
Assumptions: Function f
Example:
(a)
(b)
(c)
![Page 28: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/28.jpg)
Visualizing Assumption (c)
![Page 29: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/29.jpg)
New ESO
Theorem (Fercoq & R. 12/2013)
(i)
(ii)
![Page 30: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/30.jpg)
Comparison with Other Stepsizes for Parallel Coordinate Descent Methods
Example:
![Page 31: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/31.jpg)
Complexity for New Stepsizes
Average degree of separability
“Average” of the Lipschitz constants
With the new stepsizes, we have:
![Page 32: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/32.jpg)
Work in 1 Iteration
![Page 33: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/33.jpg)
Cost of 1 Iteration of APPROX
Assume N = n (all blocks are of size 1)and that
Sparse matrixThen the average cost of 1 iteration of APPROX is
Scalar function: derivative = O(1)
arithmetic ops
= average # nonzeros in a column of A
![Page 34: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/34.jpg)
Bottleneck: Computation of Partial Derivatives
maintained
![Page 35: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/35.jpg)
PreliminaryExperiments
![Page 36: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/36.jpg)
L1 Regularized L1 Regression
Dorothea dataset:
Gradient Method
Nesterov’s Accelerated Gradient Method
SPCDM
APPROX
![Page 37: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/37.jpg)
L1 Regularized L1 Regression
![Page 38: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/38.jpg)
L1 Regularized Least Squares (LASSO)
KDDB dataset:
PCDM
APPROX
![Page 39: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/39.jpg)
Training Linear SVMs
Malicious URL dataset:
![Page 40: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/40.jpg)
Importance Sampling
![Page 41: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/41.jpg)
with Importance Sampling
Zheng Qu and P.R. Accelerated coordinate descent with importance sampling, Manuscript 2014P.R. and Martin Takac. On optimal probabilities in stochastic coordinate descent methods, aXiv:1310.3438, 2013
![Page 42: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/42.jpg)
Convergence Rate
Theorem [Qu & R. 2014]
![Page 43: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/43.jpg)
Serial Case: Optimal ProbabilitiesNonuniform serial sampling:
Optimal ProbabilitiesUniform Probabilities
![Page 44: A ccelerated, P arallel and PROX imal coordinate descent](https://reader036.vdocuments.site/reader036/viewer/2022070408/568143c9550346895db057da/html5/thumbnails/44.jpg)
Extra 40 Slides