h2o world - consensus optimization and machine learning - stephen boyd

Consensus Optimization and Machine Learning

Stephen Boyd and Steven Diamond

EE & CS Departments

Stanford University

H2O World, 11/10/2015

1

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

2

Outline

Convex optimization



Convex optimization 3

Convex optimization problem

convex optimization problem:

minimize f0(x)subject to fi (x) ≤ 0, i = 1, . . . ,m

Ax = b

I variable x ∈ Rn

I equality constraints are linear

I f0, . . . , fm are convex: for θ ∈ [0, 1],

fi (θx + (1− θ)y) ≤ θfi (x) + (1− θ)fi (y)

i.e., fi have nonnegative (upward) curvature


Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications


Application areas

I machine learning, statistics

I finance

I supply chain, revenue management, advertising

I control

I signal and image processing, vision

I networking

I circuit design

I and many others . . .


Convex optimization solvers

I medium scale (1000s–10000s variables, constraints)interior-point methods on single machine

I large-scale (100k – 1B variables, constraints)custom (often problem specific) methods, e.g., SGD

I lots of on-going research

I growing list of open source solvers


Convex optimization modeling languages

I (new) high level language support for convex optimizationI describe problem in high level languageI problem compiled to standard form and solved

I implementations:I YALMIP, CVX (Matlab)I CVXPY (Python)I Convex.jl (Julia)


CVXPY

(Diamond & Boyd, 2013)

minimize ‖Ax − b‖22 + γ‖x‖1subject to ‖x‖∞ ≤ 1

from cvxpy import *

x = Variable(n)

cost = sum_squares(A*x-b) + gamma*norm(x,1)

prob = Problem(Minimize(cost),

[norm(x,"inf") <= 1])

opt_val = prob.solve()

solution = x.value


Example: Image in-painting

I guess pixel values in obscured/corrupted parts of image

I total variation in-painting: choose pixel values xij ∈ R3 tominimize total variation

TV(x) =∑ij

∥∥∥∥[ xi+1,j − xijxi ,j+1 − xij

]∥∥∥∥2

I a convex problem


Example

512× 512 color image (n ≈ 800000 variables)

Original Corrupted


Example

Original Recovered


Example

80% of pixels removed

Original Corrupted


Example

80% of pixels removed

Original Recovered


Outline

Convex optimization



Model fitting via convex optimization 15

Predictor

I given data (xi , yi ), i = 1, . . . ,m

I x is feature vector, y is outcome or label

I find predictor ψ so that

y ≈ y = ψ(x) for data (x , y) that you haven’t seen

I ψ is a regression model for y ∈ R

I ψ is a classifier for y ∈ {−1, 1}


Loss minimization predictor

I predictor parametrized by θ ∈ Rn

I loss function L(xi , yi , θ) gives miss-fit for data point (xi , yi )

I for given θ, predictor is

ψ(x) = argminy

L(x , y , θ)

I how do we choose parameter θ?


Model fitting via regularized loss minimization

I choose θ by minimizing regularized loss

1

m

m∑i=1

L(xi , yi , θ) + λr(θ)

I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior

I λ > 0 scales regularization

I for many useful cases, this is a convex problem


Examples

predictor L(x , y , θ) ψ(x) r(θ)

least-squares (θT x − y)2 θT x 0ridge regression (θT x − y)2 θT x ‖θ‖22lasso (θT x − y)2 θT x ‖θ‖1logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0SVM (1− yθT x)+ sign(θT x) ‖θ‖22

I can mix and match, e.g., r(θ) = ‖θ‖1 sparsifies

I all lead to convex fitting problems


Robust (Huber) regression

I loss L(x , y , θ) = φhub(θT x − y)

I φhub is Huber function (with threshold M > 0):

φhub(u) =

{u2 |u| ≤ M2Mu −M2 |u| > M

I same as least-squares for small residuals, but allows (some)large residuals

I and so, robust to outliers


Example

I m = 450 measurements, n = 300 regressors

I choose θtrue; xi ∼ N (0, I )

I set yi = (θtrue)T xi + εi , εi ∼ N (0, 1)

I with probability p, replace yi with −yiI data has fraction p of (non-obvious) wrong measurements

I distribution of ‘good’ and ‘bad’ yi are the same

I try to recover θtrue ∈ Rn from measurements y ∈ Rm

I ‘prescient’ version: we know which measurements are wrong


Example

50 problem instances, p varying from 0 to 0.15


Example


Quantile regression

I quantile regression: use tilted `1 loss

L(x , y , θ) = τ(r)+ + (1− τ)(r)−

with r = θT x − y , τ ∈ (0, 1)

I τ = 0.5: equal penalty for over- and under-estimatingI τ = 0.1: 9× more penalty for under-estimatingI τ = 0.9: 9× more penalty for over-estimating

I τ -quantile of residuals is zero


Example

I time series xt , t = 0, 1, 2, . . .

I auto-regressive predictor:

xt+1 = θT (1, xt , . . . , xt−M)

I M = 10 is memory of predictor

I use quantile regression for τ = 0.1, 0.5, 0.9

I at each time t, gives three one-step-ahead predictions:

x0.1t+1, x0.5t+1, x0.9t+1


Example

time series xt


Example

xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (training set, t = 0, . . . , 399)


Example

xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (test set, t = 400, . . . , 449)


Example

residual distributions for τ = 0.9, 0.5, and 0.1 (training set)


Example

residual distributions for τ = 0.9, 0.5, and 0.1 (test set)


Outline

Convex optimization



Consensus optimization and model fitting 31

Consensus optimization

I want to solve problem with N objective terms

minimize∑N

i=1 fi (x)

e.g., fi is the loss function for ith block of training data

I consensus form:

minimize∑N

i=1 fi (xi )subject to xi − z = 0

I xi are local variablesI z is the global variableI xi − z = 0 are consistency or consensus constraints


Consensus optimization via ADMM

with xk = (1/N)∑N

i=1 xki (average over local variables)

xk+1i := argmin

xi

(fi (xi ) + (ρ/2)‖xi − xk + uki ‖22

)uk+1i := uki + (xk+1

i − xk+1)

I get global minimum, under very general conditions

I uk is running sum of inconsistencies (PI control)

I minimizations carried out independently and in parallel

I coordination is via averaging of local variables xi


Consensus model fitting

I variable is θ, parameter in predictor

I fi (θi ) is loss + (share of) regularizer for ith data block

I θk+1i minimizes local loss + additional quadratic term

I local parameters converge to consensus, same as if wholedata set were handled together

I privacy preserving: agents don’t reveal data to each other


Example

I SVM:I hinge loss l(u) = (1− u)+I sum square regularization r(θ) = ‖θ2‖2

I baby problem with n = 2, m = 400 to illustrate

I examples split into N = 20 groups, in worst possible way:each group contains only positive or negative examples


Iteration 1

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10


Iteration 5

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10


Iteration 40

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10


CVXPY implementation

(Steven Diamond)

I N = 105 samples, n = 103 (dense) features

I hinge (SVM) loss with `1 regularization

I data split into 100 chunks

I 100 processes on 32 cores

I 26 sec per ADMM iteration

I 100 iterations for objective to converge

I 10 iterations (5 minutes) to get good model


CVXPY implementation


H2O implementation

(Tomas Nykodym)

I click-through data derived from a kaggle data set

I 20000 features, 20M examples

I logistic loss, elastic net regularization

I examples divided into 100 chunks (of different sizes)

I run on 100 H2O instances

I 5 iterations to get good global model


H2O implementation

ROC, iteration 1


H2O implementation

ROC, iteration 2


H2O implementation

ROC, iteration 3


H2O implementation

ROC, iteration 5


H2O implementation

ROC, iteration 10


Summary

ADMM consensus

I can do machine learning across distributed data sources

I the data never moves

I get same model as if you had collected all data in one place


Resources

many researchers have worked on the topics covered

I Convex Optimization

I Distributed Optimization and Statistical Learning via theAlternating Direction Method of Multipliers

I EE364a (course slides, videos, code, homework, . . . )

I software CVX, CVXPY, Convex.jl

all available online


h2o world - consensus optimization and machine learning - stephen boyd

Software