stochastic optimization introduction + sparse...

70
Stochastic Optimization Introduction + Sparse regularization + Convex analysis †‡ Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO Intensive course @ Nagoya University 1 / 55

Upload: others

Post on 06-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Stochastic OptimizationIntroduction + Sparse regularization + Convex analysis

† ‡ Taiji Suzuki

†Tokyo Institute of TechnologyGraduate School of Information Science and EngineeringDepartment of Mathematical and Computing Sciences

‡JST, PRESTO

Intensive course @ Nagoya University

1 / 55

Page 2: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Outline

1 Introduction

2 Short course to convex analysisConvexity and related conceptsDualitySmoothness and strong convexity

2 / 55

Page 3: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Lecture plan

Day 1:

Convex analysisFirst order method“Online” stochastic optimization method: SGD, SRDA

Day 2:

AdaGrad, acceleration of SGD“Batch” stochastic optimization method: SDCA, SVRG, SAGDistributed optimization (if possible)

3 / 55

Page 4: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Outline

1 Introduction

2 Short course to convex analysisConvexity and related conceptsDualitySmoothness and strong convexity

4 / 55

Page 5: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Machine learning as optimization

Machine learning is a methodology to deal with a lot of uncertain data.

Generalization error minimization minθ∈Θ

EZ [ℓθ(Z )]

Empirical approximation minθ∈Θ

1

n

n∑i=1

ℓθ(zi )

Stochastic optimization is an intersection of learning and optimization.5 / 55

Page 6: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

New data inputMassive data

x1x2x3x4....

Recently stochastic optimization is used to treat huge data.

1

n

n∑i=1

ℓθ(zi )︸ ︷︷ ︸Huge

+ψ(θ)

How to optimize this in efficient way?Do we need to go through the whole data at every iteration? 6 / 55

Page 7: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

History of stochastic optimization for ML

1951 Robbins and Monro Stochastic approximationfor root finding problem

1957 Rosenblatt Perceptron

1978 Nemirovskii and Yudin Robustification for non-smooth obj.1983 and optimality

1988 Ruppert Robust step size policy and averaging1992 Polyak and Juditsky for smooth obj.

1998 Bottou Online stochastic optimization2004 Bottou and LeCun for large scale ML task

2009-2012

Singer and Duchi; Duchi

et al.; Xiao

FOBOS, AdaGrad, RDA

2012- Le Roux et al. Linear convergence on batch data2013 Shalev-Shwartz and Zhang (SAG,SDCA,SVRG)

Johnson and Zhang7 / 55

Page 8: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Overview of stochastic optimization

minx

f (x)

Stochastic approximation (SA)Optimization for systems with uncertainty,e.g., machine control, traffic management, social science, and so on.gt = ∇f (x (t)) + ξt is observed where ξt is noise (typically i.i.d.).

Stochastic approximation for machine learning and statisticsTypically generalization error minimization:

minx

f (x) = minx

EZ [ℓ(Z , x)].

ℓ(z , x) is a loss function:e.g., logistic loss ℓ((w , y), x) = log(1 + exp(−yw⊤x)) forz = (w , y) ∈ Rp × {±1}.gt = ∇ℓ(zt , x (t)) is observed where zt ∼ P(Z ) is i.i.d. data.Used for huge dataset.We don’t need exact optimization. Optimization with certainprecision (typically O(1/n)) is sufficient.

8 / 55

Page 9: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Two types of stochastic optimization

Online type stochastic optimization:

We observe data sequentially.Each observation is used just once (basically).

minx

EZ [ℓ(Z , x)]

Batch type stochastic optimization

The whole sample has been already observed.We may use training data multiple times.

minx

1

n

n∑i=1

ℓ(zi , x)

9 / 55

Page 10: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Summary of convergence rates

Online methods (expected risk minimization):GR√T

(non-smooth, non-strongly convex)

G 2

µT(non-smooth, strongly convex)

σR√T

+R2L

T 2(smooth, non-strongly convex)

σ2

µT+ exp

(−√µ

LT

)(smooth, strongly convex)

Batch methods (empirical risk minimization)

exp(− 1

n+µLT)(smooth loss, strongly convex reg)

exp

(− 1

n+√

nµL

T

)(smooth loss, strongly convex reg with acceleration)

G : upper bound of norm of gradient, R: diameter of the domain,L: smoothness, µ: strong convexity, σ: variance of the gradient

10 / 55

Page 11: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Example of empirical risk minimization:High dimensional data analysis

Redundant information deteriorates the estimation accuracy.

Bio-informatics Text data Image data11 / 55

Page 12: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Example of empirical risk minimization:High dimensional data analysis

Redundant information deteriorates the estimation accuracy.

Bio-informatics Text data Image data11 / 55

Page 13: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Sparse estimation

Cut off redundant information → sparsity

R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal.

Statist. Soc B., Vol. 58, No. 1, pages 267–288.

12 / 55

Page 14: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Variable selection (linear regression)

Design matrix X = (Xij) ∈ Rn×p.p (dimension) ≫ n (number of samples).The true vector β∗ ∈ Rp: At most d non-zero elements (sparse).

Linear model : Y = Xβ∗ + ξ.

Estimate β∗ from (Y ,X ).The number of parameters that we need to estimate is d → variableselection.

AIC:βAIC = argmin

β∈Rp∥Y − Xβ∥2 + 2σ2∥β∥0

where ∥β∥0 = |{j | βj = 0}|.→ 2p candidates. NP-hard → Convex approximation.

13 / 55

Page 15: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Variable selection (linear regression)

Design matrix X = (Xij) ∈ Rn×p.p (dimension) ≫ n (number of samples).The true vector β∗ ∈ Rp: At most d non-zero elements (sparse).

Linear model : Y = Xβ∗ + ξ.

Estimate β∗ from (Y ,X ).The number of parameters that we need to estimate is d → variableselection.

AIC:βAIC = argmin

β∈Rp∥Y − Xβ∥2 + 2σ2∥β∥0

where ∥β∥0 = |{j | βj = 0}|.→ 2p candidates. NP-hard → Convex approximation.

13 / 55

Page 16: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Lasso estimator

Lasso [L1 regularization]

βLasso = argminβ∈Rp

∥Y − Xβ∥2 + λ∥β∥1

where ∥β∥1 =∑p

j=1 |βj |.

→ Convex optimization!L1-norm is the convex hull of L0-normon [−1, 1]p (the largest convex functionwhich supports from below).

L1-norm is the Lovasz extension of thecardinality function.

More generally for a loss function ℓ (logistic loss, hinge loss, ...)

minx

{n∑

i=1

ℓ(zi , x) + λ∥x∥1

}

14 / 55

Page 17: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Lasso estimator

Lasso [L1 regularization]

βLasso = argminβ∈Rp

∥Y − Xβ∥2 + λ∥β∥1

where ∥β∥1 =∑p

j=1 |βj |.

→ Convex optimization!L1-norm is the convex hull of L0-normon [−1, 1]p (the largest convex functionwhich supports from below).

L1-norm is the Lovasz extension of thecardinality function.

More generally for a loss function ℓ (logistic loss, hinge loss, ...)

minx

{n∑

i=1

ℓ(zi , x) + λ∥x∥1

}14 / 55

Page 18: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Sparsity of Lasso estimator

Suppose p = n and X = I .

βLasso = argminβ∈Rp

1

2∥Y − β∥2 + C∥β∥1

⇒ βLasso,i = argminb∈R

1

2(yi − b)2 + C |b|

=

{sign(yi )(yi − C ) (|yi | > C )

0 (|yi | ≤ C ).

Small signal is shrunk to 0→ sparse !

15 / 55

Page 19: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Sparsity of Lasso estimator (fig)

β = argminβ∈Rp

1

n∥Xβ − Y ∥22 + λn

p∑j=1

|βj |.

16 / 55

Page 20: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Example

Y = Xβ + ϵ.

n = 1, 000, p = 10, 000, d = 500.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000

1

2

3

4

5

6

7

8

9

10

True

17 / 55

Page 21: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Example

Y = Xβ + ϵ.

n = 1, 000, p = 10, 000, d = 500.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-2

0

2

4

6

8

10

12

TrueLasso

17 / 55

Page 22: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Example

Y = Xβ + ϵ.

n = 1, 000, p = 10, 000, d = 500.

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-2

0

2

4

6

8

10

12

TrueLassoLeastSquares

17 / 55

Page 23: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Benefit of sparsity

β = argminβ∈Rp

1

n∥Xβ − Y ∥22 + λn

p∑j=1

|βj |.

Theorem (Lasso’s convergence rate)

Under some conditions,there exists a constant C such that

∥β − β∗∥22 ≤ Cd log(p)

n.

※ The overall dimension p affects just in O(log(p)) !The actual dimension d is dominant.

18 / 55

Page 24: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Extensions of sparse regularization

β = argminβ∈Rp

1

n∥Xβ − Y ∥22 + λn

p∑j=1

|βj |

β = argminβ∈Rp

1

nℓ(yi , x

⊤i β) + ψ(β)

19 / 55

Page 25: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Examples

Overlapped group lasso

ψ(β) = C∑g∈G

∥βg∥

The groups may overlap.More aggressive sparsity.

Genome Wide Association Study (GWAS)

(Balding ‘06, McCarthy et al. ‘08)20 / 55

Page 26: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Application of group reg. (1)

Multi-task learning (Lounici et al., 2009)

Estimate simultaneously across T tasks:

y(t)i = x

(t)⊤i β(t) + ϵ

(t)i (i = 1, . . . , n(t), t = 1, . . . ,T ).

minβ(t)

T∑t=1

n(t)∑i=1

(yi − x(t)⊤i β(t))2 + C

p∑k=1

∥(β(1)k , . . . , β(T )k )∥︸ ︷︷ ︸

Group regularization

.

β(1)β(2) β(T)

Select non-zero elements across tasks21 / 55

Page 27: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Application of group reg. (1)

Multi-task learning (Lounici et al., 2009)

Estimate simultaneously across T tasks:

y(t)i = x

(t)⊤i β(t) + ϵ

(t)i (i = 1, . . . , n(t), t = 1, . . . ,T ).

minβ(t)

T∑t=1

n(t)∑i=1

(yi − x(t)⊤i β(t))2 + C

p∑k=1

∥(β(1)k , . . . , β(T )k )∥︸ ︷︷ ︸

Group regularization

.

β(1)β(2) β(T)

Select non-zero elements across tasks21 / 55

Page 28: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Application of group reg. (2)

Sentence regularization for text classification (Yogatama and Smith,

2014)

The words occurred in the same sentence is grouped:

ψ(β) =D∑

d=1

Sd∑s=1

λd ,s∥β(d ,s)∥2,

(d expresses a document, s expresses a sentence).

22 / 55

Page 29: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Trace norm regularization

W : M × N matrix.

∥W ∥Tr = Tr[(WW⊤)12 ] =

min{M,N}∑j=1

σj(W )

σj(W ) is the j-th singular value of W (non-negative).

Sum of singular values = L1-regularization on singular values→ Singular values are sparse

Sparse singular values = Low rank

23 / 55

Page 30: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Application of trace norm reg.:Recommendation system

Assuming the rank is 1.

Movie A Movie B Movie C · · · Movie X

User 1 4 8 * · · · 2

User 2 2 * 2 · · · *

User 3 2 4 * · · · *

...

(e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007))

24 / 55

Page 31: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Application of trace norm reg.:Recommendation system

Assuming the rank is 1.

Movie A Movie B Movie C · · · Movie X

User 1 4 8 4 · · · 2

User 2 2 4 2 · · · 1

User 3 2 4 2 · · · 1

...

(e.g., Srebro et al. (2005), NetFlix Bennett and Lanning (2007))

24 / 55

Page 32: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Application of trace norm reg.:Recommendation system

N

M

Movie

User

→ Low rank matrix completion:

Rademacher complexity of low rank matrices: Srebro et al. (2005).Compressed sensing: Candes and Tao (2009), Candes and Recht(2009).

25 / 55

Page 33: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Example: Reduced rank regression

Reduced rank regression (Anderson, 1951, Burket, 1964, Izenman, 1975)

Multi-task learning (Argyriou et al., 2008)

Reduced rank regression

=

W*

n

Y X

N M

N

W

+

W ∗ is low rank.

26 / 55

Page 34: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

(Generalized) Fused Lasso

ψ(β) = C∑

(i ,j)∈E

|βi − βj |.

(Tibshirani et al. (2005), Jacob et al. (2009))

Genome data analysis by Fused lasso (Tibshi-

rani and Taylor ‘11)

TV-denoising (Chambolle ‘04)

27 / 55

Page 35: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Sparse covariance selection

xk ∼ N(0,Σ) (i.i.d.,Σ ∈ Rp×p), Σ = 1n

∑nk=1 xkx

⊤k .

S = argminS⪰O

{− log(det(S)) + Tr[SΣ] + λ

p∑i ,j=1

|Si ,j |}.

(Meinshausen and B uhlmann, 2006, Yuan and Lin, 2007, Banerjee et al., 2008)

Estimating the inverse S of Σ.Si ,j = 0 ⇔ X(i),X(j) is conditionally independent.Gaussian graphical model can be estimated by convex optimization.

28 / 55

Page 36: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Covariance selection on the stock data of 50 randomly selected companiesin NASDAQ list from 4 January 2011 to 31 December 2014.

(Lie Michael, Bachelor thesis)

29 / 55

Page 37: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Other examples

Robust PCA (Candes et al. 2009).

Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,2011).

Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,2013).

30 / 55

Page 38: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Outline

1 Introduction

2 Short course to convex analysisConvexity and related conceptsDualitySmoothness and strong convexity

31 / 55

Page 39: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Regularized empirical risk minimization

Basically, we want to solve

Empirical risk minimization:

minx∈Rp

1

n

n∑i=1

ℓ(zi , x).

Regularized empirical risk minimization:

minx∈Rp

1

n

n∑i=1

ℓ(zi , x) + ψ(x).

In this lecture, we assume ℓ and ψ are convex.→ convex analysis to exploit the properties of convex functions.

32 / 55

Page 40: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Outline

1 Introduction

2 Short course to convex analysisConvexity and related conceptsDualitySmoothness and strong convexity

33 / 55

Page 41: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Convex set

Definition (Convex set)

A convex set is a set that contains the segment connecting two points inthe set:

x1, x2 ∈ C =⇒ θx1 + (1− θ2)x2 ∈ C (θ ∈ [0, 1]).

Convex set Non-convex set Non-convex set

34 / 55

Page 42: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Epigraph and domainLet R := R ∪ {∞}.

Definition (Epigraph and domain)

The epigraph of a function f : Rp → R is given by

epi(f ) := {(x , µ) ∈ Rp+1 : f (x) ≤ µ}.

The domain of a function f : Rp → R is given by

dom(f ) := {x ∈ Rp : f (x) <∞}.

epigraph

domain( ]

35 / 55

Page 43: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Convex function

Let R := R ∪ {∞}.

Definition (Convex function)

A function f : Rp → R is a convex function if f satisfies

θf (x) + (1− θ)f (y) ≥ f (θx + (1− θ)y) (∀x , y ∈ Rp, θ ∈ [0, 1]),

where ∞+∞ = ∞, ∞ ≤ ∞.

Convex Non-convex

f is convex ⇔ epi(f ) is a convex set.36 / 55

Page 44: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Proper and closed convex function

If the domain of a function f is not empty (dom(f ) = ∅), f is calledproper.

If the epigraph of a convex function f is a closed set, then f is calledclosed.(We are interested in only a proper closed function in this lecture.)

Even if f is closed, it’s domain is not necessarily closed (even for 1D).

“f is closed” does not imply“f is continuous.”

Closed convex function is continuous on a segment in its domain.

Closed function is “lower semicontinuity.”

37 / 55

Page 45: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Convex loss functions (regression)

All well used loss functions are (closed) convex. The followings are convexw.r.t. u with a fixed label y ∈ R.

Squared loss: ℓ(y , u) = 12(y − u)2.

τ -quantile loss: ℓ(y , u) = (1− τ)max{u − y , 0}+ τ max{y − u, 0}.for some τ ∈ (0, 1). Used for quantile regression.ϵ-sensitive loss: ℓ(y , u) = max{|y − u| − ϵ, 0} for some ϵ > 0. Usedfor support vector regression.

f-y-3 -2 -1 0 1 2 30

1

2

3 τ-quantile

-sensitive

Squared

Huber

38 / 55

Page 46: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Convex surrogate loss (classification)

y ∈ {±1}Logistic loss: ℓ(y , u) = log((1 + exp(−yu))/2).Hinge loss: ℓ(y , u) = max{1− yu, 0}.Exponential loss: ℓ(y , u) = exp(−yu).Smoothed hinge loss:

ℓ(y , u) =

0, (yu ≥ 1),12 − yu, (yu < 0),12(1− yu)2, (otherwise).

yf-3 -2 -1 0 1 2 30

1

2

3

40-1Logistic

expHinge

Smoothed-hinge

39 / 55

Page 47: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Convex regularization functions

Ridge regularization: R(x) = ∥x∥22 :=∑p

j=1 x2j .

L1 regularization: R(x) = ∥x∥1 :=∑p

j=1 |xj |.

Trace norm regularization: R(X ) = ∥X∥tr =∑min{q,r}

k=1 σj(X )where σj(X ) ≥ 0 is the j-th singular value.

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Bridge (=0.5)L1RidgeElasticnet

1n

∑ni=1(yi − z⊤i x)2 + λ∥x∥1: Lasso

1n

∑ni=1 log(1 + exp(−yiz

⊤i x)) + λ∥X∥tr: Low rank matrix recovery

40 / 55

Page 48: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Other definitions of sets

Convex hull: conv(C ) is the smallest convex set that contains a setC ⊆ Rp.

Affine set: A set A is an affine set if and only if ∀x , y ∈ A, the line thatintersects x and y lies in A: λx + (1− λ)y ∀λ ∈ R.

Affine hull: The smallest affine set that contains a set C ⊆ Rp.Relative interior: ri(C ). Let A be the affine hull of a convex set C ⊆ Rp.

ri(C ) is a set of internal points with respect to the relativetopology induced by the affine hull A.

Convex hullAffine hull

Relative interior

41 / 55

Page 49: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Continuity of a closed convex function

Theorem

For a (possibly non-convex) function f : Rp → R, the following threeconditions are equivalent to each other.

1 f is lower semi-continuous.

2 For any converging sequence {xn}∞n=1 ⊆ Rp s.t. x∞ = limn xn,lim infn f (xn) ≥ f (x∞).

3 f is closed.

Remark: Any convex function f is continuous in ri(dom(f )). The continuity

could be broken on the boundary of the domain.

42 / 55

Page 50: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Outline

1 Introduction

2 Short course to convex analysisConvexity and related conceptsDualitySmoothness and strong convexity

43 / 55

Page 51: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Subgradient

We want to deal with non-differentiable function such as L1 regularization.To do so, we need to define something like gradient.

Definition (Subdifferential, subgradient)

For a proper convex function f : Rp → R, the subdifferential of f atx ∈ dom(f ) is defined by

∂f (x) := {g ∈ Rp | ⟨x ′ − x , g⟩+ f (x) ≤ f (x ′) (∀x ′ ∈ Rp)}.

An element of the subdifferential is called subgradient.

f(x)

x

Subgradient

Figure: Subgraient

44 / 55

Page 52: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Properties of subgradient

Subgradient does not necessarily exist (∂f (x) could be empty).f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0.

Subgradient always exists on ri(dom(f )).

If f is differentiable at x , its gradient is the unique element of subdiff.

∂f (x) = {∇f (x)}.

If ri(dom(f )) ∩ ri(dom(h)) = ∅, then∂(f + h)(x) = ∂f (x) + ∂h(x)

= {g + g ′ | g ∈ ∂f (x), g ′ ∈ ∂h(x)}(∀x ∈ dom(f ) ∩ dom(h)).

For all g ∈ ∂f (x) and all g ′ ∈ ∂f (x ′) (x , x ′ ∈ dom(f )),

⟨g − g ′, x − x ′⟩ ≥ 0.

45 / 55

Page 53: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Properties of subgradient

Subgradient does not necessarily exist (∂f (x) could be empty).f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0.

Subgradient always exists on ri(dom(f )).If f is differentiable at x , its gradient is the unique element of subdiff.

∂f (x) = {∇f (x)}.

If ri(dom(f )) ∩ ri(dom(h)) = ∅, then∂(f + h)(x) = ∂f (x) + ∂h(x)

= {g + g ′ | g ∈ ∂f (x), g ′ ∈ ∂h(x)}(∀x ∈ dom(f ) ∩ dom(h)).

For all g ∈ ∂f (x) and all g ′ ∈ ∂f (x ′) (x , x ′ ∈ dom(f )),

⟨g − g ′, x − x ′⟩ ≥ 0.

45 / 55

Page 54: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Properties of subgradient

Subgradient does not necessarily exist (∂f (x) could be empty).f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0.

Subgradient always exists on ri(dom(f )).If f is differentiable at x , its gradient is the unique element of subdiff.

∂f (x) = {∇f (x)}.

If ri(dom(f )) ∩ ri(dom(h)) = ∅, then∂(f + h)(x) = ∂f (x) + ∂h(x)

= {g + g ′ | g ∈ ∂f (x), g ′ ∈ ∂h(x)}(∀x ∈ dom(f ) ∩ dom(h)).

For all g ∈ ∂f (x) and all g ′ ∈ ∂f (x ′) (x , x ′ ∈ dom(f )),

⟨g − g ′, x − x ′⟩ ≥ 0.

45 / 55

Page 55: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Properties of subgradient

Subgradient does not necessarily exist (∂f (x) could be empty).f (x) = x log(x) (x ≥ 0) is proper convex but not subdifferentiable at x = 0.

Subgradient always exists on ri(dom(f )).If f is differentiable at x , its gradient is the unique element of subdiff.

∂f (x) = {∇f (x)}.

If ri(dom(f )) ∩ ri(dom(h)) = ∅, then∂(f + h)(x) = ∂f (x) + ∂h(x)

= {g + g ′ | g ∈ ∂f (x), g ′ ∈ ∂h(x)}(∀x ∈ dom(f ) ∩ dom(h)).

For all g ∈ ∂f (x) and all g ′ ∈ ∂f (x ′) (x , x ′ ∈ dom(f )),

⟨g − g ′, x − x ′⟩ ≥ 0.

45 / 55

Page 56: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Legendre transform

Defines the other representation on the dual space (the space of gradients).

Definition (Legendre transform)

Let f be a (possibly non-convex) function f : Rp → R s.t. dom(f ) = ∅.Its convex conjugate is given by

f ∗(y) := supx∈Rp

{⟨x , y⟩ − f (x)}.

The map from f to f ∗ is called Legendre transform.

f(x)

f*(y)

line with gradient y

xx*

0

f(x*)

46 / 55

Page 57: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Examples

f (x) f ∗(y)

Squared loss 12x2 1

2y 2

Hinge loss max{1− x , 0}

{y (−1 ≤ y ≤ 0),

∞ (otherwise).

Logistic loss log(1 + exp(−x)){(−y) log(−y) + (1 + y) log(1 + y) (−1 ≤ y ≤ 0),

∞ (otherwise).

L1 regularization ∥x∥1

{0 (maxj |yj | ≤ 1),

∞ (otherwise).

Lp regularization∑d

j=1 |xj |p ∑d

j=1p−1

pp

p−1|yj |

pp−1

(p > 1)

0

0 1

logistic dual of logistic

L1-norm dual

47 / 55

Page 58: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Properties of Legendre transformf ∗ is a convex function even if f is not.f ∗∗ is the closure of the convex hull of f :

f ∗∗ = cl(conv(f )).

Corollary

Legendre transform is a bijection from the set of proper closed convexfunctions onto that defined on the dual space.

f (proper closed convex) ⇔ f ∗ (proper closed convex)

0 1 2 3 4

-6

-4

-2

0

2

4

6f(x)cl.conv.

48 / 55

Page 59: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Connection to subgradient

Lemma

y ∈ ∂f (x) ⇔ f (x) + f ∗(y) = ⟨x , y⟩ ⇔ x ∈ ∂f ∗(y).

∵ y ∈ ∂f (x) ⇒ x = argmaxx ′∈Rp

{⟨x ′, y⟩ − f (x ′)}

(take the “derivative” of ⟨x ′, y⟩ − f (x ′))

⇒ f ∗(y) = ⟨x , y⟩ − f (x).

Remark: By definition, we always have

f (x) + f ∗(y) ≥ ⟨x , y⟩.

→ Young-Fenchel’s inequality.

49 / 55

Page 60: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

⋆ Fenchel’s duality theorem

Theorem (Fenchel’s duality theorem)

Let f : Rp → R, g : Rq → R be proper closed convex, and A ∈ Rq×p.Suppose that either of condition (a) or (b) is satisfied, then it holds that

infx∈Rp

{f (x) + g(Ax)} = supy∈Rq

{−f ∗(A⊤y)− g∗(−y)}.

� �(a) ∃x ∈ Rp s.t. x ∈ ri(dom(f )) and Ax ∈ ri(dom(g)).

(b) ∃y ∈ Rq s.t. A⊤y ∈ ri(dom(f ∗)) and −y ∈ ri(dom(g∗)).� �If (a) is satisfied, there exists y∗ ∈ Rq that attains sup of the RHS.If (b) is satisfied, there exists x∗ ∈ Rp that attains inf of the LHS.Under (a) and (b), x∗, y∗ are the optimal solutions of the each side iff

A⊤y∗ ∈ ∂f (x∗), Ax∗ ∈ ∂g∗(−y∗).

→ Karush-Kuhn-Tucker condition.50 / 55

Page 61: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Equivalence to the separation theorem

Convex

Concave

51 / 55

Page 62: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Applying Fenchel’s duality theorem to RERM

RERM (Regularized Empirical Risk Minimizatino):Let ℓi (z

⊤i x) = ℓ(yi , z

⊤i x) where (zi , yi ) is the input-output pair of the i-th

observation.� �(Primal) inf

x∈Rp

{n∑

i=1

ℓi (z⊤i x)︸ ︷︷ ︸ + ψ(x)

}

f (Zx)� �[Fenchel’s duality theorem]

infx∈Rp

{f (Zx) + ψ(x)} = − infy∈Rn

{f ∗(y) + ψ∗(−Z⊤y)}

� �(Dual) sup

y∈Rn

{n∑

i=1

ℓ∗i (yi ) + ψ∗(−Z⊤y)

}� �This fact will be used to derive dual coordinate descent alg. 52 / 55

Page 63: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Outline

1 Introduction

2 Short course to convex analysisConvexity and related conceptsDualitySmoothness and strong convexity

53 / 55

Page 64: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Smoothness and strong convexity

Definition

Smoothness: the gradient is Lipschitz continuous:

∥∇f (x)−∇f (x ′)∥ ≤ L∥x − x ′∥.

Strong convexity: ∀θ ∈ (0, 1), ∀x , y ∈ dom(f ),

µ

2θ(1− θ)∥x − y∥2 + f (θx + (1− θ)y) ≤ θf (x) + (1− θ)f (y).

0 0 0

Smooth butnot strongly convex

Smooth andStrongly convex

Strongly convex butnot smooth

54 / 55

Page 65: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Duality between smoothness and strong convexity

Smoothness and strong convexity is in a relation of duality.

Theorem

Let f : Rp → R be proper closed convex.

f is L-smooth ⇐⇒ f ∗ is 1/L-strongly convex.

logistic loss its dual function

0

0 1

Smooth butnot strongly convex

Strongly convex butnot smooth

(gradient → ∞)

55 / 55

Page 66: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

T. Anderson. Estimating linear restrictions on regression coefficients formultivariate normal distributions. Annals of Mathematical Statistics, 22:327–351, 1951.

A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectralregularization framework for multi-task structure learning. In Y. S.J.C. Platt, D. Koller and S. Roweis, editors, Advances in NeuralInformation Processing Systems 20, pages 25–32, Cambridge, MA,2008. MIT Press.

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.

J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD Cupand Workshop 2007, 2007.

L. Bottou. Online algorithms and stochastic approximations. 1998. URLhttp://leon.bottou.org/papers/bottou-98x. revised, oct 2012.

L. Bottou and Y. LeCun. Large scale online learning. In S. Thrun, L. Saul,and B. Scholkopf, editors, Advances in Neural Information ProcessingSystems 16. MIT Press, Cambridge, MA, 2004. URLhttp://leon.bottou.org/papers/bottou-lecun-2004.

55 / 55

Page 67: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

G. R. Burket. A study of reduced-rank models for multiple prediction,volume 12 of Psychometric monographs. Psychometric Society, 1964.

E. Candes and T. Tao. The power of convex relaxations: Near-optimalmatrix completion. IEEE Transactions on Information Theory, 56:2053–2080, 2009.

E. J. Candes and B. Recht. Exact matrix completion via convexoptimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods foronline learning and stochastic optimization. Journal of MachineLearning Research, 12:2121–2159, 2011.

A. J. Izenman. Reduced-rank regression for the multivariate linear model.Journal of Multivariate Analysis, pages 248–264, 1975.

L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap andgraph lasso. In Proceedings of the 26th International Conference onMachine Learning, 2009.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,

55 / 55

Page 68: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

N. Le Roux, M. Schmidt, and F. R. Bach. A stochastic gradient methodwith an exponential convergence rate for finite training sets. InF. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advancesin Neural Information Processing Systems 25, pages 2663–2671. CurranAssociates, Inc., 2012. URL http://papers.nips.cc/paper/

4633-a-stochastic-gradient-method-with-an-exponential-convergence-_

rate-for-finite-training-sets.pdf.

K. Lounici, A. Tsybakov, M. Pontil, and S. van de Geer. Taking advantageof sparsity in multi-task learning. 2009.

N. Meinshausen and P. B uhlmann. High-dimensional graphs and variableselection with the lasso. The Annals of Statistics, 34(3):1436–1462,2006.

A. Nemirovskii and D. Yudin. On cezari ’s convergence of the steepest55 / 55

Page 69: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

descent method for approximating saddle points of convex-concavefunctions. Soviet Mathematics Doklady, 19(2):576–601, 1978.

A. Nemirovsky and D. Yudin. Problem complexity and method efficiencyin optimization. John Wiley, New York, 1983.

B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximationby averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

H. Robbins and S. Monro. A stochastic approximation method. TheAnnals of Mathematical Statistics, 22(3):400–407, 1951.

F. Rosenblatt. The perceptron: A perceiving and recognizing automaton.Technical Report Technical Report 85-460-1, Project PARA, CornellAeronautical Lab., 1957.

D. Ruppert. Efficient estimations from a slowly convergent robbins-monroprocess. Technical report, Cornell University Operations Research andIndustrial Engineering, 1988.

S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascentmethods for regularized loss minimization. Journal of Machine LearningResearch, 14:567–599, 2013.

55 / 55

Page 70: Stochastic Optimization Introduction + Sparse ...ibis.t.u-tokyo.ac.jp/suzuki/lecture/2015/nagoya... · Stochastic Optimization Introduction + Sparse regularization + Convex analysis

Y. Singer and J. C. Duchi. Efficient learning using forward-backwardsplitting. In Advances in Neural Information Processing Systems, pages495–503, 2009.

N. Srebro, N. Alon, and T. Jaakkola. Generalization error bounds forcollaborative prediction with low-rank matrices. In Advances in NeuralInformation Processing Systems (NIPS) 17, 2005.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused lasso. Journal of Royal Statistical Society:B, 67(1):91–108, 2005.

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23. 2009.

D. Yogatama and N. A. Smith. Making the most of bag of words:Sentence regularization with alternating direction method of multipliers.In Proceedings of the 31th International Conference on MachineLearning, pages 656–664, 2014.

M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94(1):19–35, 2007.

55 / 55