peter richtárik (joint work with martin takáč) distributed coordinate descent method amplab all...
TRANSCRIPT
![Page 1: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/1.jpg)
Peter Richtárik (joint work with Martin Takáč)
Distributed Coordinate Descent Method
AmpLab All Hands Meeting - Berkeley - October 29, 2013
![Page 2: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/2.jpg)
Randomized Coordinate Descent
in 2D
![Page 3: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/3.jpg)
Find the minimizer
2D OptimizationContours of a function
Goal:
a2 =b2
![Page 4: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/4.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
N
S
EW
![Page 5: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/5.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
![Page 6: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/6.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
N
S
EW
2
![Page 7: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/7.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23 N
S
EW
![Page 8: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/8.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
![Page 9: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/9.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
4N
S
EW
5
![Page 10: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/10.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
6
N
S
EW
![Page 11: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/11.jpg)
Randomized Coordinate Descent in 2D
a2 =b2
1
23
45
N
S
EW
67SOLVED!
![Page 12: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/12.jpg)
Convergence of Randomized Coordinate Descent
Strongly convex f
Smooth or ‘simple’ nonsmooth f‘difficult’ nonsmooth f
Focus on d
(big data = big d)
![Page 13: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/13.jpg)
Parallelization Dream
Serial Parallel
In reality we get something in between
![Page 14: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/14.jpg)
How (not) to ParallelizeCoordinate Descent
![Page 15: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/15.jpg)
“Naive” parallelization
Do the same thing as before, but with more or all coordinates
and add up the updates
![Page 16: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/16.jpg)
Failure of naive parallelization
1a
1b
0
![Page 17: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/17.jpg)
Failure of naive parallelization
1
1a
1b
0
![Page 18: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/18.jpg)
Failure of naive parallelization
1
2a
2b
![Page 19: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/19.jpg)
Failure of naive parallelization
1
2a
2b
2
![Page 20: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/20.jpg)
Failure of naive parallelization
2
OOPS!
![Page 21: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/21.jpg)
1
1a
1b
0
Idea: averaging updates may help
SOLVED!
![Page 22: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/22.jpg)
Averaging can be too conservative
1a
1b
0
12a
2b
2
and so on...
![Page 23: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/23.jpg)
Averaging may be too conservative
WANT
BAD!!!
![Page 24: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/24.jpg)
Minimizing Regularized Loss
![Page 25: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/25.jpg)
Minimizing Regularized Loss
Convex (smooth)
Convex (smooth or nonsmooth)- separable- allow
Loss Regularizer
![Page 26: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/26.jpg)
Regularizer: examples
No regularizer Weighted L1 norm
Weighted L2 normBox constraints
e.g., SVM dual
e.g., LASSO
![Page 27: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/27.jpg)
Structure of f
Considered in [BKBG, ICML 2011]
![Page 28: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/28.jpg)
Loss: examples
Quadratic loss
L-infinity
L1 regression
Exponential loss
Logistic loss
Square hinge loss
BKBG’11RT’11bTBRS’13RT ’13a
FR’13
![Page 29: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/29.jpg)
Distributed CoordinateDescent Method
![Page 30: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/30.jpg)
I. Distribution of Datad = # features / variables / coordinates Data matrix
![Page 31: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/31.jpg)
II. Choice of Coordinates
![Page 32: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/32.jpg)
II. Choice of Coordinates
Random set of coordinates (‘sampling’)
![Page 33: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/33.jpg)
III. Computing Updates to Selected Coordinates
Random set of coordinates (‘sampling’)
Current iterate New iterate
Update to i-th coordinate
All nodes need to be able to compute this (communication)
![Page 34: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/34.jpg)
Iteration Complexity
implies
Strong convexity constant of the regularizer
Strong convexity constant of the loss f
Theorem [RT’13]# coordinates
# nodes # coordinates updated / node
![Page 35: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/35.jpg)
Bad partitioning at most doubles # of iterations
spectral norm of the “partitioning”
Theorem [RT’13]
![Page 36: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/36.jpg)
Experiment 1
1 node (c = 1)
LASSO problemn = 2 billions d = 1 billion
![Page 37: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/37.jpg)
Coordinate Updates
![Page 38: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/38.jpg)
Iterations
![Page 39: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/39.jpg)
Wall Time
![Page 40: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/40.jpg)
Experiment 2
128 nodes (c = 512, 4096 cores)
LASSO problemn = 1 billion d = 0.5 billion
data size = 3 TB
![Page 41: Peter Richtárik (joint work with Martin Takáč) Distributed Coordinate Descent Method AmpLab All Hands Meeting - Berkeley - October 29, 2013](https://reader038.vdocuments.site/reader038/viewer/2022103022/56649d8b5503460f94a71917/html5/thumbnails/41.jpg)
LASSO: 3TB data + 128 nodes