h2o world - consensus optimization and machine learning - stephen boyd

51
Consensus Optimization and Machine Learning Stephen Boyd and Steven Diamond EE & CS Departments Stanford University H2O World, 11/10/2015 1

Upload: srisatish-ambati

Post on 08-Jan-2017

2.191 views

Category:

Software


2 download

TRANSCRIPT

Page 1: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Consensus Optimization and Machine Learning

Stephen Boyd and Steven Diamond

EE & CS Departments

Stanford University

H2O World, 11/10/2015

1

Page 2: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

2

Page 3: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

Convex optimization 3

Page 4: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Convex optimization problem

convex optimization problem:

minimize f0(x)subject to fi (x) ≤ 0, i = 1, . . . ,m

Ax = b

I variable x ∈ Rn

I equality constraints are linear

I f0, . . . , fm are convex: for θ ∈ [0, 1],

fi (θx + (1− θ)y) ≤ θfi (x) + (1− θ)fi (y)

i.e., fi have nonnegative (upward) curvature

Convex optimization 4

Page 5: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications

Convex optimization 5

Page 6: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications

Convex optimization 5

Page 7: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Why convex optimization?

I we can solve convex optimization problems effectively

I there are lots of applications

Convex optimization 5

Page 8: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Application areas

I machine learning, statistics

I finance

I supply chain, revenue management, advertising

I control

I signal and image processing, vision

I networking

I circuit design

I and many others . . .

Convex optimization 6

Page 9: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Convex optimization solvers

I medium scale (1000s–10000s variables, constraints)interior-point methods on single machine

I large-scale (100k – 1B variables, constraints)custom (often problem specific) methods, e.g., SGD

I lots of on-going research

I growing list of open source solvers

Convex optimization 7

Page 10: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Convex optimization modeling languages

I (new) high level language support for convex optimizationI describe problem in high level languageI problem compiled to standard form and solved

I implementations:I YALMIP, CVX (Matlab)I CVXPY (Python)I Convex.jl (Julia)

Convex optimization 8

Page 11: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

CVXPY

(Diamond & Boyd, 2013)

minimize ‖Ax − b‖22 + γ‖x‖1subject to ‖x‖∞ ≤ 1

from cvxpy import *

x = Variable(n)

cost = sum_squares(A*x-b) + gamma*norm(x,1)

prob = Problem(Minimize(cost),

[norm(x,"inf") <= 1])

opt_val = prob.solve()

solution = x.value

Convex optimization 9

Page 12: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example: Image in-painting

I guess pixel values in obscured/corrupted parts of image

I total variation in-painting: choose pixel values xij ∈ R3 tominimize total variation

TV(x) =∑ij

∥∥∥∥[ xi+1,j − xijxi ,j+1 − xij

]∥∥∥∥2

I a convex problem

Convex optimization 10

Page 13: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

512× 512 color image (n ≈ 800000 variables)

Original Corrupted

Convex optimization 11

Page 14: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

Original Recovered

Convex optimization 12

Page 15: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

80% of pixels removed

Original Corrupted

Convex optimization 13

Page 16: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

80% of pixels removed

Original Recovered

Convex optimization 14

Page 17: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

Model fitting via convex optimization 15

Page 18: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Predictor

I given data (xi , yi ), i = 1, . . . ,m

I x is feature vector, y is outcome or label

I find predictor ψ so that

y ≈ y = ψ(x) for data (x , y) that you haven’t seen

I ψ is a regression model for y ∈ R

I ψ is a classifier for y ∈ {−1, 1}

Model fitting via convex optimization 16

Page 19: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Loss minimization predictor

I predictor parametrized by θ ∈ Rn

I loss function L(xi , yi , θ) gives miss-fit for data point (xi , yi )

I for given θ, predictor is

ψ(x) = argminy

L(x , y , θ)

I how do we choose parameter θ?

Model fitting via convex optimization 17

Page 20: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Model fitting via regularized loss minimization

I choose θ by minimizing regularized loss

1

m

m∑i=1

L(xi , yi , θ) + λr(θ)

I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior

I λ > 0 scales regularization

I for many useful cases, this is a convex problem

Model fitting via convex optimization 18

Page 21: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Model fitting via regularized loss minimization

I choose θ by minimizing regularized loss

1

m

m∑i=1

L(xi , yi , θ) + λr(θ)

I regularization r(θ) penalizes model complexity, enforcesconstraints, or represents prior

I λ > 0 scales regularization

I for many useful cases, this is a convex problem

Model fitting via convex optimization 18

Page 22: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Examples

predictor L(x , y , θ) ψ(x) r(θ)

least-squares (θT x − y)2 θT x 0ridge regression (θT x − y)2 θT x ‖θ‖22lasso (θT x − y)2 θT x ‖θ‖1logistic classifier log(1 + exp(−yθT x)) sign(θT x) 0SVM (1− yθT x)+ sign(θT x) ‖θ‖22

I can mix and match, e.g., r(θ) = ‖θ‖1 sparsifies

I all lead to convex fitting problems

Model fitting via convex optimization 19

Page 23: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Robust (Huber) regression

I loss L(x , y , θ) = φhub(θT x − y)

I φhub is Huber function (with threshold M > 0):

φhub(u) =

{u2 |u| ≤ M2Mu −M2 |u| > M

I same as least-squares for small residuals, but allows (some)large residuals

I and so, robust to outliers

Model fitting via convex optimization 20

Page 24: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

I m = 450 measurements, n = 300 regressors

I choose θtrue; xi ∼ N (0, I )

I set yi = (θtrue)T xi + εi , εi ∼ N (0, 1)

I with probability p, replace yi with −yiI data has fraction p of (non-obvious) wrong measurements

I distribution of ‘good’ and ‘bad’ yi are the same

I try to recover θtrue ∈ Rn from measurements y ∈ Rm

I ‘prescient’ version: we know which measurements are wrong

Model fitting via convex optimization 21

Page 25: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

50 problem instances, p varying from 0 to 0.15

Model fitting via convex optimization 22

Page 26: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

Model fitting via convex optimization 23

Page 27: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Quantile regression

I quantile regression: use tilted `1 loss

L(x , y , θ) = τ(r)+ + (1− τ)(r)−

with r = θT x − y , τ ∈ (0, 1)

I τ = 0.5: equal penalty for over- and under-estimatingI τ = 0.1: 9× more penalty for under-estimatingI τ = 0.9: 9× more penalty for over-estimating

I τ -quantile of residuals is zero

Model fitting via convex optimization 24

Page 28: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

I time series xt , t = 0, 1, 2, . . .

I auto-regressive predictor:

xt+1 = θT (1, xt , . . . , xt−M)

I M = 10 is memory of predictor

I use quantile regression for τ = 0.1, 0.5, 0.9

I at each time t, gives three one-step-ahead predictions:

x0.1t+1, x0.5t+1, x0.9t+1

Model fitting via convex optimization 25

Page 29: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

time series xt

Model fitting via convex optimization 26

Page 30: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (training set, t = 0, . . . , 399)

Model fitting via convex optimization 27

Page 31: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

xt and predictions x0.1t+1, x0.5t+1, x0.9t+1 (test set, t = 400, . . . , 449)

Model fitting via convex optimization 28

Page 32: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

residual distributions for τ = 0.9, 0.5, and 0.1 (training set)

Model fitting via convex optimization 29

Page 33: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

residual distributions for τ = 0.9, 0.5, and 0.1 (test set)

Model fitting via convex optimization 30

Page 34: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Outline

Convex optimization

Model fitting via convex optimization

Consensus optimization and model fitting

Consensus optimization and model fitting 31

Page 35: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Consensus optimization

I want to solve problem with N objective terms

minimize∑N

i=1 fi (x)

e.g., fi is the loss function for ith block of training data

I consensus form:

minimize∑N

i=1 fi (xi )subject to xi − z = 0

I xi are local variablesI z is the global variableI xi − z = 0 are consistency or consensus constraints

Consensus optimization and model fitting 32

Page 36: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Consensus optimization via ADMM

with xk = (1/N)∑N

i=1 xki (average over local variables)

xk+1i := argmin

xi

(fi (xi ) + (ρ/2)‖xi − xk + uki ‖22

)uk+1i := uki + (xk+1

i − xk+1)

I get global minimum, under very general conditions

I uk is running sum of inconsistencies (PI control)

I minimizations carried out independently and in parallel

I coordination is via averaging of local variables xi

Consensus optimization and model fitting 33

Page 37: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Consensus model fitting

I variable is θ, parameter in predictor

I fi (θi ) is loss + (share of) regularizer for ith data block

I θk+1i minimizes local loss + additional quadratic term

I local parameters converge to consensus, same as if wholedata set were handled together

I privacy preserving: agents don’t reveal data to each other

Consensus optimization and model fitting 34

Page 38: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Example

I SVM:I hinge loss l(u) = (1− u)+I sum square regularization r(θ) = ‖θ2‖2

I baby problem with n = 2, m = 400 to illustrate

I examples split into N = 20 groups, in worst possible way:each group contains only positive or negative examples

Consensus optimization and model fitting 35

Page 39: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Iteration 1

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10

Consensus optimization and model fitting 36

Page 40: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Iteration 5

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10

Consensus optimization and model fitting 37

Page 41: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Iteration 40

−3 −2 −1 0 1 2 3−10

−8

−6

−4

−2

0

2

4

6

8

10

Consensus optimization and model fitting 38

Page 42: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

CVXPY implementation

(Steven Diamond)

I N = 105 samples, n = 103 (dense) features

I hinge (SVM) loss with `1 regularization

I data split into 100 chunks

I 100 processes on 32 cores

I 26 sec per ADMM iteration

I 100 iterations for objective to converge

I 10 iterations (5 minutes) to get good model

Consensus optimization and model fitting 39

Page 43: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

CVXPY implementation

Consensus optimization and model fitting 40

Page 44: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

H2O implementation

(Tomas Nykodym)

I click-through data derived from a kaggle data set

I 20000 features, 20M examples

I logistic loss, elastic net regularization

I examples divided into 100 chunks (of different sizes)

I run on 100 H2O instances

I 5 iterations to get good global model

Consensus optimization and model fitting 41

Page 45: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

H2O implementation

ROC, iteration 1

Consensus optimization and model fitting 42

Page 46: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

H2O implementation

ROC, iteration 2

Consensus optimization and model fitting 43

Page 47: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

H2O implementation

ROC, iteration 3

Consensus optimization and model fitting 44

Page 48: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

H2O implementation

ROC, iteration 5

Consensus optimization and model fitting 45

Page 49: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

H2O implementation

ROC, iteration 10

Consensus optimization and model fitting 46

Page 50: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Summary

ADMM consensus

I can do machine learning across distributed data sources

I the data never moves

I get same model as if you had collected all data in one place

Consensus optimization and model fitting 47

Page 51: H2O World - Consensus Optimization and Machine Learning - Stephen Boyd

Resources

many researchers have worked on the topics covered

I Convex Optimization

I Distributed Optimization and Statistical Learning via theAlternating Direction Method of Multipliers

I EE364a (course slides, videos, code, homework, . . . )

I software CVX, CVXPY, Convex.jl

all available online

Consensus optimization and model fitting 48