higher order fused regularization for supervised learning with grouped parameters

Copyright©2014 NTT corp. All Rights Reserved.

Higher Order Fused Regularization for Supervised Learning with Grouped Parameters Koh Takeuchi1, Yoshinobu Kawahara2, Tomoharu Iwata1 1NTT Communication Science Laboratories, Kyoto, Japan 2Osaka University, Osaka, Japan

2 Copyright©2014 NTT corp. All Rights Reserved.

This talk: Contributions

n  A new regularizer that incorporates smoothness into overlapping groups of parameters on supervised learning.

n  An efficient network flow algorithm for minimizing our regularizer.

n  Empirical improvements on predictive performances of our regularizer from existing regularizers


Target value, observed vector, parameters Convex empirical loss and regularizer The goal of regularized supervised learning

Its gradient is Lipschitz continuous Non-smooth

Task: Regularized Supervised Learning

�⇤ = argmin�

L(�)

yn 2 Y,xn 2 Rd (n = 1 · · ·N),� 2 Rd


Problem: Overfitting in supervised learning

When # of samples (N) is small, parameter 　estimations resultin poor predictive performances. Regularizers prevent overfitting by incorporating prior knowledge to impose sparseness or smoothness on estimations.

Sparseor

Smooth

Regularizer

Train

Test

Erro

r

# of iterations


Existing sparsity-inducing reguralizers (with l1-norm)

Lasso (Tibshirani+, 1996)J Inefficient parameters have zero L Cannot utilize structures Group Lasso (Yuan+, 2006)

J Inefficient groups of parameters have zero L Cannot impose smoothness

⌦(�) =dX

i=1

k�ik1

⌦(�) =KX

k=1

k�(gk)k2The k-th group of parameters


Existing smoothness-inducingreguralizers (with l1-norm)

Generalized Fused Lasso (Tibshirani+, 2005)J Pairs of parameters have the same value L Cannot utilize groups Existing regulalizers cannot utilize group structures to impose smoothness on parameters

⌦(�) =X

(i,j)2E

wi,jk�i � �jk1β1

β2 β3

w1,2 w1,3

E = {(1,2), (1,3)}


Lasso and Laplace distribution

The penalty term of Lasso corresponds to the Laplace prior Park, Trevor, and George Casella. "The bayesian lasso." Journal of the American Statistical Association 103.482 (2008): 681-686.

p(�|�2) =

dY

i=1

1

2

p�2

exp(�k�ik1/p�2

)


Idea: Group structures in parameters

Words expressing the same meaning Music pieces in the same genre Books released in the same year Parameters of linear or quadratic terms in polynomial regression X

i,j

�i,jxixj

X

i

�ixi

Tesco, Sainsbury’s


Idea: Overlapping groups of parameters

Suppose there are five parameters corresponding five objects

Fruit Group Orange Color Group

A group of parameters would provide similar functionality


Idea: Assumption on overlapping groups of parameters

β1 β4β2 β3 β5

GROUP 1 GROUP 2

Agroupofparameterswouldtakesimilarvalues(notequalto0)inasupervisedlearningproblem


Groups are not always match the problem. Utilize only effective part of a group.

Idea: Make regularizer robust for incomplete groups

β1 β2 β3

Group 1

An effictive part of a group for a problem


Recap:

n  Given overlapping groups of parameters n  Groups may be not perfectly fit a problem

n  Demands for our regularizer: U=lizeoverlappinggroupstoimposesmoothnessinparameteres=ma=ons

U=lizeaneffec=vepartofagroupthatmatchtheproblem(Robustness)


Robust Pn Potts model (Kohli+, 2005)

A cut function for image segmentation that enforces label consistency on overlapping groups.J Control consistency in a group (Robust) L Discrete function fho

(S) =KX

k=1

min�✓k0

+ ck0

(V \ S), ✓k1

+ ck1

(S), ✓kmax

�,

Weights for parameters in the k-th group

Hyper parameter for controlling consistency


An example of Robust Pn Potts model (Kohli+, 2005)

Potential with K=1.

1 s t di

0θ1 = θ0

5

θmax

10θ1 + c1(Ui)

θmax

θ0c0(V \Ui)

pixels

Penalty


Continuous relaxation via the Lovász extension

n  A continuous relaxation can be obtained via the Lovász extension (Lovász,1983).

n  Robust Pn Potts model is a submodular function (Edmonds,1970).

n  A Lovász extension is convex if and only if its set function is submodular

Submodular Convex


Parameters are sorted as ü  This regularizer consists of three parts. ü Check an example of estimated

parameters with K = 1.

Higher Order Fused regularizer[Proposition 1]

⌦ho

(�) =KX

k=1

0

@X

i2{j1,...,js�1}

(�i � �js)ck1,i + �js(✓

kmax

� ✓k1

)

+�jt(✓k0

� ✓kmax

) +X

i2{jt+1,··· ,jd}

(�jt � �i)ck0,i

1

A ,

⌦ho

(�) =KX

k=1

0

@X

i2{j1,...,js�1}

(�i � �js)ck1,i + �js(✓

kmax

� ✓k1

)

+�jt(✓k0

� ✓kmax

) +X

i2{jt+1,··· ,jd}

(�jt � �i)ck0,i

1

A ,

�j1 � �j2 � · · · � �jd

k-th group


This property allows parameters whose values are larger or smaller than threshold to have different values.

Higher Order Fused regularizer

1 s t di

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

βj i

βjs

βjt

βjiif i < js then βi = βjs

if i > jt then βi = βjt

i between js and jt have the same value

js and jt are set by the hyper parameter


Optimization: Minimization of HOF regulalizer

is a non-smooth convex function and thus it difficult to minimize Utilize the proximity operator (Moreau, 1962) Remark: The optimal of the proximity operator is equal to the original function with sufficiently small γ (e.g. γ < 0.001)

prox�⌦ho

ˆ� = argmin

�⌦

ho

(�) +1

�k� � ˆ�k2

2

⌦ho

(�)⌦

ho

(�)


Solve the proximity operator via maximum network flow algorithm

The proximity operator can be solved by a Minimum-Norm-Point Algorithm (Fujishige, 06)LThe Fastest algorithm runs in O(d5EO+d6) Propose a maximum flow algorithm that run in O(d|E|log(d2/|E|)) [Theorem 1].


Minimize the loss function L(β)

Forward-backward splitting algorithm can attain optimal of L(β) with O(1/n2) convergence rate (Nesterov’s acceralation).


Experiments: Linear regression problem

n  Compare predictive performances by Root Mean Squared Error (RMSE) on test data set

n  Linear regression experiments with ・Synthesis data sets (d=100) non-overlapping, overlapping groups・Real-world data sets

vuut 1

N

NX

n=1

kyn � ynk22


Synthesis data experiment: Linear regression with five groups

a)  non-overlapping groups setting (d=100)

b)  overlapping groups setting (d=100)


Estimated parameters with non- overlapping groups (d=100, N=30)

0 10 20 30 40 50 60 70 80 90 100−5

0

5

Ordinary Least Squares0 10 20 30 40 50 60 70 80 90 100

−20

0

20Lasso

0 10 20 30 40 50 60 70 80 90 100−10

0

10Sparse Group Lasso

0 10 20 30 40 50 60 70 80 90 100−5

0

5Generalized Fused Lasso

0 10 20 30 40 50 60 70 80 90 100−5

0

5Proposed

・Bluelinesaretrueprameters・Circlesarees=matedparameters


Estimated parameters with overlapping groups (d=100, N=30)

0 10 20 30 40 50 60 70 80 90 100−5

0

5

Ordinary Least Squares0 10 20 30 40 50 60 70 80 90 100

−10

0

10Lasso

0 10 20 30 40 50 60 70 80 90 100−10

0

10Sparse Group Lasso

0 10 20 30 40 50 60 70 80 90 100−5

0

5Generalized Fused Lasso

0 10 20 30 40 50 60 70 80 90 100−5

0

5Proposed

・Bluelinesandcirclesaretrueandes=matedparameters,respec=vely

overlapping


Real-world data experiment: MovieLens100k

Average rating prediction of each movie from users who watched the movie (d = 942) 31 user groups (8 age, 2 gender, and 21 occupation)

a)MovieLens100k


Real-world data experiment:Yelp

Rating value prediction from a review text (bag-of-words) where d = 1,000. We employed 52 word groups (50 semantic groups and two positive and negative groups).


Estimated parameters of four groups

Blue=positive, Red=negative, Size=value Robustness enables to utilize effective parts of groups to impose smoothness.


Conclusion

n  An l1-norm based regularizer that incorporates smoothness into overlapping groups of parameters.

n  An efficient flow algorithm for calculating the exact minimizer of the regularizer.

n  Empirical improvements on predictive performances from existing regularizers in supervised learning.


Future work:

n  Non-linear regression and classification with HOF regulaizer

n  Apply HOF reguralizer to matrix/tensor completion problem

　・Recommendation (Netflix) 　・Spatio-Temporal data prediction n  Extend HOF reguralizer to ・Hierarchical groups ・Hyper graph

Copyright©2014 NTT corp. All Rights Reserved.

Higher Order Fused Regularization for Supervised Learning with Grouped Parameters Koh Takeuchi1, Yoshinobu Kawahara2, Tomoharu Iwata1 1NTT Communication Science Laboratories, Kyoto, Japan 2Osaka University, Osaka, Japan


ロバシュ拡張の証明

　　　　はロバスト高階エネルギー関数のロバシュ拡張と一致する（証明略） →近接法を劣モジュラ関数最小化問題へ変換できる（証明略）ロバスト高階エネルギー関数はComputer Visionで提案された画素クラスタリング技術

⌦ho

(�)

fho

(S) =KX

k=1

min�✓k0

+ ck0

(V \ S), ✓k1

+ ck1

(S), ✓kmax

�1 s t d

i

0θ1 = θ0

5

θmax

10θ1 + c1(Ui)

θmax

θ0c0(V \Ui)


グループ内のパラメタを降べき順に並べ替え、値の一致度を評価する

1項目: βjsより大きいパラメタをβjsと一致

2,3項目: βjsとβjt間の値を一致 4項目: βjtより小さいパラメタをβjtと一致

ハイパーパラメタ

X

i2{j1,...,js�1}

(�i � �js)ck1,i

(�js � �jt)✓kmax

� �js✓k1

+ �jt✓k0


　　　　はRobust Higher Order Potentialのロバシュ拡張と一致する

Robust Higher Order PotentialはComputer Visionの画素クラスタリング技術 cf: ロバシュ拡張

fho

(S) =KX

k=1

min�✓k0

+ ck0

(V \ S), ✓k1

+ ck1

(S), ✓kmax

�

f(�) =dX

i=1

�ji (f({j1, . . . , ji)� f({j1, . . . , ji�1))

⌦ho

(�)