Consensus Optimization and Machine Learning
Stephen Boyd and Steven Diamond
EE & CS Departments
Stanford University
H2O World, 11/10/2015
1
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
2
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
Convex optimization 3
Convex optimization problem
convex optimization problem:
minimize f0(x)subject to fi (x) ≤ 0, i = 1, . . . ,m
Ax = b
I variable x ∈ Rn
I equality constraints are linear
I f0, . . . , fm are convex: for θ ∈ [0, 1],
fi (θx + (1− θ)y) ≤ θfi (x) + (1− θ)fi (y)
i.e., fi have nonnegative (upward) curvature
Convex optimization 4
Why convex optimization?
I we can solve convex optimization problems effectively
I there are lots of applications
Convex optimization 5
Why convex optimization?
I we can solve convex optimization problems effectively
I there are lots of applications
Convex optimization 5
Why convex optimization?
I we can solve convex optimization problems effectively
I there are lots of applications
Convex optimization 5
Application areas
I machine learning, statistics
I finance
I supply chain, revenue management, advertising
I control
I signal and image processing, vision
I networking
I circuit design
I and many others . . .
Convex optimization 6
Convex optimization solvers
I medium scale (1000s–10000s variables, constraints)interior-point methods on single machine
I large-scale (100k – 1B variables, constraints)custom (often problem specific) methods, e.g., SGD
I lots of on-going research
I growing list of open source solvers
Convex optimization 7
Convex optimization modeling languages
I (new) high level language support for convex optimizationI describe problem in high level languageI problem compiled to standard form and solved
I implementations:I YALMIP, CVX (Matlab)I CVXPY (Python)I Convex.jl (Julia)
Convex optimization 8
CVXPY
(Diamond & Boyd, 2013)
minimize ‖Ax − b‖22 + γ‖x‖1subject to ‖x‖∞ ≤ 1
from cvxpy import *
x = Variable(n)
cost = sum_squares(A*x-b) + gamma*norm(x,1)
prob = Problem(Minimize(cost),
[norm(x,"inf") <= 1])
opt_val = prob.solve()
solution = x.value
Convex optimization 9
Example: Image in-painting
I guess pixel values in obscured/corrupted parts of image
I total variation in-painting: choose pixel values xij ∈ R3 tominimize total variation
TV(x) =∑ij
∥∥∥∥[ xi+1,j − xijxi ,j+1 − xij
]∥∥∥∥2
I a convex problem
Convex optimization 10
Example
512× 512 color image (n ≈ 800000 variables)
Original Corrupted
Convex optimization 11
Example
Original Recovered
Convex optimization 12
Example
80% of pixels removed
Original Corrupted
Convex optimization 13
Example
80% of pixels removed
Original Recovered
Convex optimization 14
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
Model fitting via convex optimization 15
Predictor
I given data (xi , yi ), i = 1, . . . ,m
I x is feature vector, y is outcome or label
I find predictor ψ so that
y ≈ y = ψ(x) for data (x , y) that you haven’t seen
I ψ is a regression model for y ∈ R
I ψ is a classifier for y ∈ {−1, 1}
Model fitting via convex optimization 16
Loss minimization predictor
I predictor parametrized by θ ∈ Rn
I loss function L(xi , yi , θ) gives miss-fit for data point (xi , yi )
I for given θ, predictor is
ψ(x) = argminy
L(x , y , θ)
I how do we choose parameter θ?
Model fitting via convex optimization 17
Model fitting via regularized loss minimization
I choose θ by minimizing regularized loss
1
m
m∑i=1
L(xi , yi , θ) + λr(θ)
I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior
I λ > 0 scales regularization
I for many useful cases, this is a convex problem
Model fitting via convex optimization 18
Model fitting via regularized loss minimization
I choose θ by minimizing regularized loss
1
m
m∑i=1
L(xi , yi , θ) + λr(θ)
I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior
I λ > 0 scales regularization
I for many useful cases, this is a convex problem
Model fitting via convex optimization 18
Examples
predictor L(x , y , θ) ψ(x) r(θ)
least-squares (θT x − y)2 θT x 0ridge regression (θT x − y)2 θT x ‖θ‖22lasso (θT x − y)2 θT x ‖θ‖1logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0SVM (1− yθT x)+ sign(θT x) ‖θ‖22
I can mix and match, e.g., r(θ) = ‖θ‖1 sparsifies
I all lead to convex fitting problems
Model fitting via convex optimization 19
Robust (Huber) regression
I loss L(x , y , θ) = φhub(θT x − y)
I φhub is Huber function (with threshold M > 0):
φhub(u) =
{u2 |u| ≤ M2Mu −M2 |u| > M
I same as least-squares for small residuals, but allows (some)large residuals
I and so, robust to outliers
Model fitting via convex optimization 20
Example
I m = 450 measurements, n = 300 regressors
I choose θtrue; xi ∼ N (0, I )
I set yi = (θtrue)T xi + εi , εi ∼ N (0, 1)
I with probability p, replace yi with −yiI data has fraction p of (non-obvious) wrong measurements
I distribution of ‘good’ and ‘bad’ yi are the same
I try to recover θtrue ∈ Rn from measurements y ∈ Rm
I ‘prescient’ version: we know which measurements are wrong
Model fitting via convex optimization 21
Example
50 problem instances, p varying from 0 to 0.15
Model fitting via convex optimization 22
Example
Model fitting via convex optimization 23
Quantile regression
I quantile regression: use tilted `1 loss
L(x , y , θ) = τ(r)+ + (1− τ)(r)−
with r = θT x − y , τ ∈ (0, 1)
I τ = 0.5: equal penalty for over- and under-estimatingI τ = 0.1: 9× more penalty for under-estimatingI τ = 0.9: 9× more penalty for over-estimating
I τ -quantile of residuals is zero
Model fitting via convex optimization 24
Example
I time series xt , t = 0, 1, 2, . . .
I auto-regressive predictor:
xt+1 = θT (1, xt , . . . , xt−M)
I M = 10 is memory of predictor
I use quantile regression for τ = 0.1, 0.5, 0.9
I at each time t, gives three one-step-ahead predictions:
x0.1t+1, x0.5t+1, x0.9t+1
Model fitting via convex optimization 25
Example
time series xt
Model fitting via convex optimization 26
Example
xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (training set, t = 0, . . . , 399)
Model fitting via convex optimization 27
Example
xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (test set, t = 400, . . . , 449)
Model fitting via convex optimization 28
Example
residual distributions for τ = 0.9, 0.5, and 0.1 (training set)
Model fitting via convex optimization 29
Example
residual distributions for τ = 0.9, 0.5, and 0.1 (test set)
Model fitting via convex optimization 30
Outline
Convex optimization
Model fitting via convex optimization
Consensus optimization and model fitting
Consensus optimization and model fitting 31
Consensus optimization
I want to solve problem with N objective terms
minimize∑N
i=1 fi (x)
e.g., fi is the loss function for ith block of training data
I consensus form:
minimize∑N
i=1 fi (xi )subject to xi − z = 0
I xi are local variablesI z is the global variableI xi − z = 0 are consistency or consensus constraints
Consensus optimization and model fitting 32
Consensus optimization via ADMM
with xk = (1/N)∑N
i=1 xki (average over local variables)
xk+1i := argmin
xi
(fi (xi ) + (ρ/2)‖xi − xk + uki ‖22
)uk+1i := uki + (xk+1
i − xk+1)
I get global minimum, under very general conditions
I uk is running sum of inconsistencies (PI control)
I minimizations carried out independently and in parallel
I coordination is via averaging of local variables xi
Consensus optimization and model fitting 33
Consensus model fitting
I variable is θ, parameter in predictor
I fi (θi ) is loss + (share of) regularizer for ith data block
I θk+1i minimizes local loss + additional quadratic term
I local parameters converge to consensus, same as if wholedata set were handled together
I privacy preserving: agents don’t reveal data to each other
Consensus optimization and model fitting 34
Example
I SVM:I hinge loss l(u) = (1− u)+I sum square regularization r(θ) = ‖θ2‖2
I baby problem with n = 2, m = 400 to illustrate
I examples split into N = 20 groups, in worst possible way:each group contains only positive or negative examples
Consensus optimization and model fitting 35
Iteration 1
−3 −2 −1 0 1 2 3−10
−8
−6
−4
−2
0
2
4
6
8
10
Consensus optimization and model fitting 36
Iteration 5
−3 −2 −1 0 1 2 3−10
−8
−6
−4
−2
0
2
4
6
8
10
Consensus optimization and model fitting 37
Iteration 40
−3 −2 −1 0 1 2 3−10
−8
−6
−4
−2
0
2
4
6
8
10
Consensus optimization and model fitting 38
CVXPY implementation
(Steven Diamond)
I N = 105 samples, n = 103 (dense) features
I hinge (SVM) loss with `1 regularization
I data split into 100 chunks
I 100 processes on 32 cores
I 26 sec per ADMM iteration
I 100 iterations for objective to converge
I 10 iterations (5 minutes) to get good model
Consensus optimization and model fitting 39
CVXPY implementation
Consensus optimization and model fitting 40
H2O implementation
(Tomas Nykodym)
I click-through data derived from a kaggle data set
I 20000 features, 20M examples
I logistic loss, elastic net regularization
I examples divided into 100 chunks (of different sizes)
I run on 100 H2O instances
I 5 iterations to get good global model
Consensus optimization and model fitting 41
H2O implementation
ROC, iteration 1
Consensus optimization and model fitting 42
H2O implementation
ROC, iteration 2
Consensus optimization and model fitting 43
H2O implementation
ROC, iteration 3
Consensus optimization and model fitting 44
H2O implementation
ROC, iteration 5
Consensus optimization and model fitting 45
H2O implementation
ROC, iteration 10
Consensus optimization and model fitting 46
Summary
ADMM consensus
I can do machine learning across distributed data sources
I the data never moves
I get same model as if you had collected all data in one place
Consensus optimization and model fitting 47
Resources
many researchers have worked on the topics covered
I Convex Optimization
I Distributed Optimization and Statistical Learning via theAlternating Direction Method of Multipliers
I EE364a (course slides, videos, code, homework, . . . )
I software CVX, CVXPY, Convex.jl
all available online
Consensus optimization and model fitting 48