a unified perspective on convex structured sparsity · results for di erent types of graphs chains...

$: A unified perspective on convex structured sparsity · Results for di erent types of graphs Chains Families F \and F [are equivalent Norms and prox can be computed using algorithms$
A unified perspective on convex structured sparsity

Guillaume Obozinski

Laboratoire d’Informatique Gaspard Monge

Ecole des Ponts ParisTech

Joint work with Francis Bach

Conference Mascot-Num

Ecole Centrale de Nantes, 21 mars 2018

Guillaume Obozinski Unified perspective on convex structured sparsity 1/39

Structured Sparsity

The support is not only sparse, but, in addition,we have prior information about its structure.

Examples

The variables should be selected in groups.

The variables lie in a hierarchy.

The variables lie on a graph or network and the support should belocalized or densely connected on the graph.


Applications: Difficult inverse problem in Brain Imaging

L R

y=-84 x=17

L R

z=-13

-5.00e-02 5.00e-02Scale 6 - Fold 9

Jenatton et al. (2011b)


Convex relaxation for classical sparsity

Empirical risk: for w ∈ Rd ,

L(w) =1

2n

n∑i=1

(yi − x>i w)2

Support of the model:

Supp(w) = i | wi 6= 0.

Penalization for variable selection

minw∈Rd

L(w) + λ |Supp(w)|

Lasso

minw∈Rd

L(w) + λ‖w‖1

|Supp(w)| =n∑

i=1

1wi 6=0

0 0 1−1 0 1−1


Formulation with combinatorial functions

Let V = 1, . . . , d.

Let L be some empirical risk such as L(w) = 12n

∑ni=1(yi − x>i w)2.

Given a set function F : 2V 7→ R+ consider

minw∈Rd

L(w) + F (Supp(w))

Examples of combinatorial functions

Use recursivity or counts of structures (e.g. tree) with DP

Block-coding (Huang et al., 2011)

G (A) = minBi

F (B1) + . . .+ F (Bk) s.t. B1 ∪ . . . ∪ Bk ⊃ A

Submodular functions


Block-coding (Huang, Zhang and Metaxas (2009))

A

Bi

p1

F+ : 2V → R+ a positive set function.

F∪(A) = minS

∑B∈S

F+(B) s.t. A ⊂⋃B∈S

B.

→ minimal weighted cover set problem.


A relaxation for F ...?

How to solve?minw∈Rd

L(w) + F (Supp(w))

→ Greedy algorithms

→ Non-convex methods

→ Relaxation

|A| F (A)

L(w) + λ |Supp(w)| L(w) + λF (Supp(w))

↓ ↓ ?

L(w) + λ ‖w‖1 L(w) + λ ...?...


Penalizing and regularizing...

Given a function F : 2V → R+, consider for ν, µ > 0 the combinedpenalty:

pen(w) = µF (Supp(w)) + ν ‖w‖pp.

Motivations

Compromise between variable selection and smooth regularization

Required for functions F allowing large supports

Interpretable as a description length for the parameters w .


A convex and homogeneous relaxation

Looking for a convex relaxation of pen(w).

Require as well that it is positively homogeneous → scale invariance.

Definition (Homogeneous extension of a function g)

gh : x 7→ infλ>0

1

λg(λx).

Proposition

The tightest convex positively homogeneous lower bound of a function gis the convex envelope of gh.

Leads us to consider:

penh(w) = infλ>0

1

λ

(µF (Supp(λw)) + ν ‖λw‖pp

)∝ Θ(w) := ‖w‖p F (Supp(w))1/q with

1

p+

1

q= 1.


Envelope of the homogeneous penalty ΘConsider Ωp with dual norm

Ω∗p(s) = maxA⊂V ,A 6=∅

‖sA‖qF (A)1/q

.

Proposition

The norm Ωp is the convex envelope (tightest convex lower bound) ofthe function w 7→ ‖w‖p F (Supp(w))1/q.

Proof.

Denote Θ(w) = ‖w‖p F (Supp(w))1/q:

Θ∗(s) = maxw∈Rd

w>s − ‖w‖p F (Supp(w))1/q

= maxA⊂V

maxwA∈RA

w>A sA − ‖wA‖p F (A)1/q

= maxA⊂V

ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61


Graphs of the different penalties

F (Supp(w)) pen(w) = µF (Supp(w)) + ν ‖w‖22


Graphs of the different penalties

Θ(w) =√F (Supp(w))‖w‖2 ΩF (w)


A large latent group Lasso (Jacob et al., 2009)

V = v = (vA)A⊂V ∈(RV)2V

s.t. Supp(vA) ⊂ A

Ωp(w) = minv∈V

∑A⊂V

F (A)1q ‖vA‖p s.t. w =

∑A⊂V

vA,

w

v1 v2 v1,2 v1,2,3,4... ...

+ + + + + + + + + + + + + +=


Some simple examples

F Ωp

|A| ‖w‖1

1A 6=∅ ‖w‖pIf G is a partition:

∑B∈G 1A∩B 6=∅

∑B∈G ‖wB‖p

If G is not a partition:∑

B∈G 1A∩B 6=∅ new: Overlap count Lasso


Combinatorial norms as atomic norms

F (A) = |A|1/2

F (A) =1A∩1,2,36=∅+ 1A∩2,36=∅+ 1A∩36=∅

ΘF2 (w) ΩF

2 (w)


Relation between combinatorial functions and norms

Name F (A) Norm Ωp

cardinality |A| Lasso (`1)

nb of groups∑

B∈G 1A∩B 6=∅ Group Lasso (`1/`p)

nb of groups δA,A ∈ G,+∞ else Latent group Lasso

max. nb of el./group maxB∈G |A ∩ B| Exclusive Lasso (`p/`1)

constant 1A 6=∅ `p-norm

func. of cardinality h(|A|), h sublinear

1A 6=∅ ∨ |A|k k-support norm (p = 2)

func. of cardinality h(|A|), h concave OWL (for p =∞)

λ1|A|+ λ2

[(dk

)−(d−|A|

k

)]OSCAR (p =∞, k = 2)∑|A|

i=1 Φ−1(1− qi

2d

)SLOPE (p =∞)

chain length h(max(A)) wedge penalty


Is the relaxation “faithful” to the original function

Consider V = 1, . . . , p and the function

F (A) = range(A) = max(A)−min(A) + 1.

→ Leads to the selection of interval patterns.

What is its convex relaxation?

Easy to show that |A| must have the same relaxation.

⇒ ΩFp (w) = ‖w‖1

The relaxation fails

⇒ What are the good functions F?

→ Good functions are Lower Combinatorial Envelopes (LCE)

Submodular functions are LCEs !


Min-cover vs Overlap count functionsGiven a collection of sets G with weights (dB)B∈G ...... two natural functions to consider:

Min-cover

F∪(A) := infS⊂G

∑B∈S

dB | A ⊂⋃B∈S

B

:

F∪,− is the corresponding fractional min-cover value

Overlap count

F∩(A) =∑B∈G

dB 1A∩B 6=∅

counting the number of set of G intersected

“maximal cover” by elements of GF∩ is a submodular function (as a sum of submodular functions).


Latent group Lasso vs Overlap count Lasso vs `1/`p

G = 1, 22, 3.

ΩF∪2 (w) ≤ 1 ΩF∩

2 (w) ≤ 1 ‖w1,2‖2+‖w2,3‖2≤1

F∩(A) = 1A∩1,26=∅ + 1A∩2,36=∅,

F∪(A) = minδ,δ′

δ + δ′ | 1A ≤ δ 11,2 + δ′ 12,3

.


Hierarchical sparsityConsider a DAG, with

Ai ,Di ancestors/descendants sets of iincluding itself.

Significant literature: Zhao et al.

(2009); Yuan et al. (2009); Jenatton et al.

(2011c); Mairal et al. (2011); Bien et al.

(2013); Yan and Bien (2015) and many

others...

e.g. formulations with`1/`p-norms (Zhao et al., 2009; Jenatton

et al., 2011c)

Ω(w) =∑i∈V‖wD(i)‖2, with

efficient algorithms for tree-structuredgroups.

11

22 33

44 55 66


Combinatorial functions for strong hierarchical sparsityConsider a DAG, with

Ai ,Di ancestors/descendants sets of iincluding itself.

Strong hierarchical sparsity:

“A node can be selected only if all itsancestors are selected”.

Overlap count with Di :

F∩(B) :=∑i∈V

di 1B∩Di 6=∅ =∑i∈AB

di ,

vs Min-cover with Ai :

F∪(B) := infI⊂V

∑i∈I

fi | B ⊂⋃i∈I

Ai

.

11

22 33

44 55 66


Results for different types of graphs

Chains

Families F∩ and F∪ are equivalent

Norms and prox can be computed using algorithms for isotonicregression.

Trees

Families F∩ and F∪ are different

Norms and prox for F∩ can be computed using a decompositionalgorithm.

No efficient algorithm known for F∪.

DAGs

Norms and prox for F∩ can be computed using general connexionwith isotonic regressions on DAGs.

No efficient algorithm known for F∪.


Sublinear functions of the cardinality

F (A) =d∑

k=1

fk 1|A|=k,

and F− must be sublinear.

Let |s|(1) ≥ . . . ≥ |s|(d) be the reverse order statistics of the entries of s.Then

Ω∗p(w) = max1≤j≤d

1

f1/qj

[j∑

i=1

|s|q(i)

]1/q

First example

F+(A) =

1 if |A| = k

∞ o.w.

recovers the k-support norm of Argyriou et al.(2012) (p = 2).


Concave functions of the cardinality

If k 7→ fk is concave then we have

Ω∞(w) =d∑

i=1

(fi − fi−1) |w |(i).

Ordered weighted Lasso (OWL) (Figueiredo and Nowak, 2014)

Examples

OSCAR (Bondell and Reich, 2008): = λ1‖w‖1 + λ2Ω(w) with

Ω(w) =∑i<j

max(|wi |, |wj |

)obtained with fk =

(d2

)−(d−k

2

)SLOPE (Bogdan et al., 2015): fk =

k∑i=1

Φ−1(

1− qi

2d

)Guillaume Obozinski Unified perspective on convex structured sparsity 29/39

Computations and extensions of OWL

Since F is submodular, ΩF∞ is a linear function of |w | if the order of the

coefficients is fixed. Computational problem can therefore be reduced tothe case of the chain.

Proposition (Figueiredo and Nowak, 2014)

In the p =∞ case the proximal operator can be computed efficiently viaisotonic regression and PAVA.

Proposition (`p-OWL norms)

Norm definitions and efficient computations ofnorms and proximal operators can be naturallyextended to ΩF

p via isotonic regression and PAVA.


An example: penalizing the rangeStructured prior on support (Jenatton et al., 2011a):

the support is an interval of 1, . . . , p

Natural associated penalization:F (A) = range(A) = imax(A)− imin(A) + 1.

→ F is not submodular...

→ G (A) = |A|But F (A) : = d − 1 + range(A) is submodular !

In fact F (A) =∑

B∈G 1A∩B 6=∅ for B of the form:

Jenatton et al. (2011a) considered Ω(w) =∑

B∈B ‖wB dB‖2.


Experiments

0 50 100 150 200 2500

0.5

1

1.5

S1

0 50 100 150 200 2500

0.5

1

S2

0 50 100 150 200 2500

0.5

1

S3

0 50 100 150 200 2500

0.5

1

S4

0 50 100 150 200 250−2

0

2

S5

Figure: Signals

S1 constant

S2 triangular shape

S3 x 7→ | sin(x) sin(5x)|S4 a slope pattern

S5 i.i.d. Gaussian pattern

Compare:

Lasso

Elastic Net

Naive `2 group-Lasso

Ω2 for F (A) = d − 1 + range(A)

Ω∞ for F (A) = d − 1 + range(A)

The weighted `2 group-Lasso of(Jenatton et al., 2011a).


Constant signal

0 500 1000 1500 20000

20

40

60

80

100

d=256, k=160, σ=0.5

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5


Triangular signal

0 500 1000 1500 20000

10

20

30

40

50

60

70

80

90

100

Best Hamming d=256, k=160, σ=0.5, S2, cov=id

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2 d = 256

k = 160

σ = .5


(x1, x2) 7→ | sin(x1) sin(5x1) sin(x2) sin(5x2)| signal in 2D

0 500 1000 1500 2000 250020

30

40

50

60

70

80

90

100

d=256, k=160, σ=1.0

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5


i.i.d Random signal in 2D

0 500 1000 1500 2000 25000

20

40

60

80

100

d=256, k=160, σ=1.0

n

Best

Ham

min

g

EN

GL+w

GL

L1

Sub p=∞

Sub p=2

d = 256

k = 160

σ = .5


Summary

A convex relaxation for functions penalizing

(a) the support via a general set function(b) the `p norm of the parameter vector w .

Retrieves a large fraction of the norms used (Lasso, group Lasso,Exclusive Lasso, OSCAR, OWL, SLOPE, etc).

Generic efficient algorithms for chains/trees/graphs-OCL

Open: efficient prox computation for tree/DAG for F∩Alternative fast column generation/FCFW algorithm(Vinyes and Obozinski, 2017).

Did not talk about general support recovery and fast ratesconvergence that can be obtained based on generalization of theirrepresentability condition/restricted eigenvalue condition.


References I

Argyriou, A., Foygel, R., and Srebro, N. (2012). Sparse prediction with the k-support norm.In Advances in Neural Information Processing Systems 25, pages 1466–1474.

Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzigselector. Annals of Statistics, 37(4):1705–1732.

Bien, J., Taylor, J., Tibshirani, R., et al. (2013). A lasso for hierarchical interactions. TheAnnals of Statistics, 41(3):1111–1141.

Bogdan, M., van den Berg, E., Sabatti, C., Su, W., and Candes, E. J. (2015). SLOPE:adaptive variable selection via convex optimization. Annals of Applied Statistics,9(3):1103–1140.

Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection,and supervised clustering of predictors with OSCAR. Biometrics, 64(1):115–123.

Figueiredo, M. and Nowak, R. D. (2014). Sparse estimation with strongly correlated variablesusing ordered weighted `1 regularization. Technical Report 1409.4005, arXiv.

Groenevelt, H. (1991). Two algorithms for maximizing a separable concave function over apolymatroid feasible region. Eur. J Oper. Res., 54(2):227–236.

Huang, J., Zhang, T., and Metaxas, D. (2011). Learning with structured sparsity. The JMLR,12:3371–3412.

Jacob, L., Obozinski, G., and Vert, J. (2009). Group lasso with overlap and graph lasso. InICML.


References II

Jenatton, R., Audibert, J., and Bach, F. (2011a). Structured variable selection withsparsity-inducing norms. JMLR, 12:2777–2824.

Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F., and Thirion, B.(2011b). Multi-scale mining of fmri data with hierarchical structured sparsity. Arxivpreprint arXiv:1105.0363.

Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011c). Proximal methods forhierarchical sparse coding. JMLR, 12:2297–2334.

Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2011). Convex and network flowoptimization for structured sparsity. JMLR, 12:2681–2720.

Vinyes, M. and Obozinski, G. (2017). Fast column generation for atomic norm regularization.In Artificial Intelligence and Statistics.

Wainwright, M. J. (2009). Sharp thresholds for noisy and high-dimensional recovery ofsparsity using `1- constrained quadratic programming. IEEE Transactions on InformationTheory, 55:2183–2202.

Yan, X. and Bien, J. (2015). Hierarchical sparse modeling: A choice of two regularizers. arXivpreprint arXiv:1512.01631.

Yuan, M., Joseph, V. R., and Zou, H. (2009). Structured variable selection and estimation.The Annals of Applied Statistics, 3(4):1738–1757.

Zhao, P., Rocha, G., and Yu, B. (2009). The composite absolute penalties family for groupedand hierarchical variable selection. The Annals of Statistics, pages 3468–3497.


a unified perspective on convex structured sparsity · results for di erent types of graphs chains...

Documents