a unified perspective on convex structured sparsity · results for di erent types of graphs chains...
TRANSCRIPT
A unified perspective on convex structured sparsity
Guillaume Obozinski
Laboratoire d’Informatique Gaspard Monge
Ecole des Ponts ParisTech
Joint work with Francis Bach
Conference Mascot-Num
Ecole Centrale de Nantes, 21 mars 2018
Guillaume Obozinski Unified perspective on convex structured sparsity 1/39
Structured Sparsity
The support is not only sparse, but, in addition,we have prior information about its structure.
Examples
The variables should be selected in groups.
The variables lie in a hierarchy.
The variables lie on a graph or network and the support should belocalized or densely connected on the graph.
Guillaume Obozinski Unified perspective on convex structured sparsity 2/39
Applications: Difficult inverse problem in Brain Imaging
L R
y=-84 x=17
L R
z=-13
-5.00e-02 5.00e-02Scale 6 - Fold 9
Jenatton et al. (2011b)
Guillaume Obozinski Unified perspective on convex structured sparsity 3/39
Convex relaxation for classical sparsity
Empirical risk: for w ∈ Rd ,
L(w) =1
2n
n∑i=1
(yi − x>i w)2
Support of the model:
Supp(w) = i | wi 6= 0.
Penalization for variable selection
minw∈Rd
L(w) + λ |Supp(w)|
Lasso
minw∈Rd
L(w) + λ‖w‖1
|Supp(w)| =n∑
i=1
1wi 6=0
0 0 1−1 0 1−1
Guillaume Obozinski Unified perspective on convex structured sparsity 4/39
Formulation with combinatorial functions
Let V = 1, . . . , d.
Let L be some empirical risk such as L(w) = 12n
∑ni=1(yi − x>i w)2.
Given a set function F : 2V 7→ R+ consider
minw∈Rd
L(w) + F (Supp(w))
Examples of combinatorial functions
Use recursivity or counts of structures (e.g. tree) with DP
Block-coding (Huang et al., 2011)
G (A) = minBi
F (B1) + . . .+ F (Bk) s.t. B1 ∪ . . . ∪ Bk ⊃ A
Submodular functions
Guillaume Obozinski Unified perspective on convex structured sparsity 6/39
Block-coding (Huang, Zhang and Metaxas (2009))
A
Bi
p1
F+ : 2V → R+ a positive set function.
F∪(A) = minS
∑B∈S
F+(B) s.t. A ⊂⋃B∈S
B.
→ minimal weighted cover set problem.
Guillaume Obozinski Unified perspective on convex structured sparsity 7/39
A relaxation for F ...?
How to solve?minw∈Rd
L(w) + F (Supp(w))
→ Greedy algorithms
→ Non-convex methods
→ Relaxation
|A| F (A)
L(w) + λ |Supp(w)| L(w) + λF (Supp(w))
↓ ↓ ?
L(w) + λ ‖w‖1 L(w) + λ ...?...
Guillaume Obozinski Unified perspective on convex structured sparsity 8/39
Penalizing and regularizing...
Given a function F : 2V → R+, consider for ν, µ > 0 the combinedpenalty:
pen(w) = µF (Supp(w)) + ν ‖w‖pp.
Motivations
Compromise between variable selection and smooth regularization
Required for functions F allowing large supports
Interpretable as a description length for the parameters w .
Guillaume Obozinski Unified perspective on convex structured sparsity 9/39
A convex and homogeneous relaxation
Looking for a convex relaxation of pen(w).
Require as well that it is positively homogeneous → scale invariance.
Definition (Homogeneous extension of a function g)
gh : x 7→ infλ>0
1
λg(λx).
Proposition
The tightest convex positively homogeneous lower bound of a function gis the convex envelope of gh.
Leads us to consider:
penh(w) = infλ>0
1
λ
(µF (Supp(λw)) + ν ‖λw‖pp
)∝ Θ(w) := ‖w‖p F (Supp(w))1/q with
1
p+
1
q= 1.
Guillaume Obozinski Unified perspective on convex structured sparsity 11/39
Envelope of the homogeneous penalty ΘConsider Ωp with dual norm
Ω∗p(s) = maxA⊂V ,A 6=∅
‖sA‖qF (A)1/q
.
Proposition
The norm Ωp is the convex envelope (tightest convex lower bound) ofthe function w 7→ ‖w‖p F (Supp(w))1/q.
Proof.
Denote Θ(w) = ‖w‖p F (Supp(w))1/q:
Θ∗(s) = maxw∈Rd
w>s − ‖w‖p F (Supp(w))1/q
= maxA⊂V
maxwA∈RA
w>A sA − ‖wA‖p F (A)1/q
= maxA⊂V
ι‖sA‖q6F (A)1/q = ιΩ∗p (s)61
Guillaume Obozinski Unified perspective on convex structured sparsity 12/39
Graphs of the different penalties
F (Supp(w)) pen(w) = µF (Supp(w)) + ν ‖w‖22
Guillaume Obozinski Unified perspective on convex structured sparsity 13/39
Graphs of the different penalties
Θ(w) =√F (Supp(w))‖w‖2 ΩF (w)
Guillaume Obozinski Unified perspective on convex structured sparsity 14/39
A large latent group Lasso (Jacob et al., 2009)
V = v = (vA)A⊂V ∈(RV)2V
s.t. Supp(vA) ⊂ A
Ωp(w) = minv∈V
∑A⊂V
F (A)1q ‖vA‖p s.t. w =
∑A⊂V
vA,
w
v1 v2 v1,2 v1,2,3,4... ...
+ + + + + + + + + + + + + +=
Guillaume Obozinski Unified perspective on convex structured sparsity 15/39
Some simple examples
F Ωp
|A| ‖w‖1
1A 6=∅ ‖w‖pIf G is a partition:
∑B∈G 1A∩B 6=∅
∑B∈G ‖wB‖p
If G is not a partition:∑
B∈G 1A∩B 6=∅ new: Overlap count Lasso
Guillaume Obozinski Unified perspective on convex structured sparsity 16/39
Combinatorial norms as atomic norms
F (A) = |A|1/2
F (A) =1A∩1,2,36=∅+ 1A∩2,36=∅+ 1A∩36=∅
ΘF2 (w) ΩF
2 (w)
Guillaume Obozinski Unified perspective on convex structured sparsity 17/39
Relation between combinatorial functions and norms
Name F (A) Norm Ωp
cardinality |A| Lasso (`1)
nb of groups∑
B∈G 1A∩B 6=∅ Group Lasso (`1/`p)
nb of groups δA,A ∈ G,+∞ else Latent group Lasso
max. nb of el./group maxB∈G |A ∩ B| Exclusive Lasso (`p/`1)
constant 1A 6=∅ `p-norm
func. of cardinality h(|A|), h sublinear
1A 6=∅ ∨ |A|k k-support norm (p = 2)
func. of cardinality h(|A|), h concave OWL (for p =∞)
λ1|A|+ λ2
[(dk
)−(d−|A|
k
)]OSCAR (p =∞, k = 2)∑|A|
i=1 Φ−1(1− qi
2d
)SLOPE (p =∞)
chain length h(max(A)) wedge penalty
Guillaume Obozinski Unified perspective on convex structured sparsity 18/39
Is the relaxation “faithful” to the original function
Consider V = 1, . . . , p and the function
F (A) = range(A) = max(A)−min(A) + 1.
→ Leads to the selection of interval patterns.
What is its convex relaxation?
Easy to show that |A| must have the same relaxation.
⇒ ΩFp (w) = ‖w‖1
The relaxation fails
⇒ What are the good functions F?
→ Good functions are Lower Combinatorial Envelopes (LCE)
Submodular functions are LCEs !
Guillaume Obozinski Unified perspective on convex structured sparsity 19/39
Min-cover vs Overlap count functionsGiven a collection of sets G with weights (dB)B∈G ...... two natural functions to consider:
Min-cover
F∪(A) := infS⊂G
∑B∈S
dB | A ⊂⋃B∈S
B
:
F∪,− is the corresponding fractional min-cover value
Overlap count
F∩(A) =∑B∈G
dB 1A∩B 6=∅
counting the number of set of G intersected
“maximal cover” by elements of GF∩ is a submodular function (as a sum of submodular functions).
Guillaume Obozinski Unified perspective on convex structured sparsity 21/39
Latent group Lasso vs Overlap count Lasso vs `1/`p
G = 1, 22, 3.
ΩF∪2 (w) ≤ 1 ΩF∩
2 (w) ≤ 1 ‖w1,2‖2+‖w2,3‖2≤1
F∩(A) = 1A∩1,26=∅ + 1A∩2,36=∅,
F∪(A) = minδ,δ′
δ + δ′ | 1A ≤ δ 11,2 + δ′ 12,3
.
Guillaume Obozinski Unified perspective on convex structured sparsity 22/39
Hierarchical sparsityConsider a DAG, with
Ai ,Di ancestors/descendants sets of iincluding itself.
Significant literature: Zhao et al.
(2009); Yuan et al. (2009); Jenatton et al.
(2011c); Mairal et al. (2011); Bien et al.
(2013); Yan and Bien (2015) and many
others...
e.g. formulations with`1/`p-norms (Zhao et al., 2009; Jenatton
et al., 2011c)
Ω(w) =∑i∈V‖wD(i)‖2, with
efficient algorithms for tree-structuredgroups.
11
22 33
44 55 66
Guillaume Obozinski Unified perspective on convex structured sparsity 24/39
Combinatorial functions for strong hierarchical sparsityConsider a DAG, with
Ai ,Di ancestors/descendants sets of iincluding itself.
Strong hierarchical sparsity:
“A node can be selected only if all itsancestors are selected”.
Overlap count with Di :
F∩(B) :=∑i∈V
di 1B∩Di 6=∅ =∑i∈AB
di ,
vs Min-cover with Ai :
F∪(B) := infI⊂V
∑i∈I
fi | B ⊂⋃i∈I
Ai
.
11
22 33
44 55 66
Guillaume Obozinski Unified perspective on convex structured sparsity 25/39
Results for different types of graphs
Chains
Families F∩ and F∪ are equivalent
Norms and prox can be computed using algorithms for isotonicregression.
Trees
Families F∩ and F∪ are different
Norms and prox for F∩ can be computed using a decompositionalgorithm.
No efficient algorithm known for F∪.
DAGs
Norms and prox for F∩ can be computed using general connexionwith isotonic regressions on DAGs.
No efficient algorithm known for F∪.
Guillaume Obozinski Unified perspective on convex structured sparsity 26/39
Sublinear functions of the cardinality
F (A) =d∑
k=1
fk 1|A|=k,
and F− must be sublinear.
Let |s|(1) ≥ . . . ≥ |s|(d) be the reverse order statistics of the entries of s.Then
Ω∗p(w) = max1≤j≤d
1
f1/qj
[j∑
i=1
|s|q(i)
]1/q
First example
F+(A) =
1 if |A| = k
∞ o.w.
recovers the k-support norm of Argyriou et al.(2012) (p = 2).
Guillaume Obozinski Unified perspective on convex structured sparsity 28/39
Concave functions of the cardinality
If k 7→ fk is concave then we have
Ω∞(w) =d∑
i=1
(fi − fi−1) |w |(i).
Ordered weighted Lasso (OWL) (Figueiredo and Nowak, 2014)
Examples
OSCAR (Bondell and Reich, 2008): = λ1‖w‖1 + λ2Ω(w) with
Ω(w) =∑i<j
max(|wi |, |wj |
)obtained with fk =
(d2
)−(d−k
2
)SLOPE (Bogdan et al., 2015): fk =
k∑i=1
Φ−1(
1− qi
2d
)Guillaume Obozinski Unified perspective on convex structured sparsity 29/39
Computations and extensions of OWL
Since F is submodular, ΩF∞ is a linear function of |w | if the order of the
coefficients is fixed. Computational problem can therefore be reduced tothe case of the chain.
Proposition (Figueiredo and Nowak, 2014)
In the p =∞ case the proximal operator can be computed efficiently viaisotonic regression and PAVA.
Proposition (`p-OWL norms)
Norm definitions and efficient computations ofnorms and proximal operators can be naturallyextended to ΩF
p via isotonic regression and PAVA.
Guillaume Obozinski Unified perspective on convex structured sparsity 30/39
An example: penalizing the rangeStructured prior on support (Jenatton et al., 2011a):
the support is an interval of 1, . . . , p
Natural associated penalization:F (A) = range(A) = imax(A)− imin(A) + 1.
→ F is not submodular...
→ G (A) = |A|But F (A) : = d − 1 + range(A) is submodular !
In fact F (A) =∑
B∈G 1A∩B 6=∅ for B of the form:
Jenatton et al. (2011a) considered Ω(w) =∑
B∈B ‖wB dB‖2.
Guillaume Obozinski Unified perspective on convex structured sparsity 31/39
Experiments
0 50 100 150 200 2500
0.5
1
1.5
S1
0 50 100 150 200 2500
0.5
1
S2
0 50 100 150 200 2500
0.5
1
S3
0 50 100 150 200 2500
0.5
1
S4
0 50 100 150 200 250−2
0
2
S5
Figure: Signals
S1 constant
S2 triangular shape
S3 x 7→ | sin(x) sin(5x)|S4 a slope pattern
S5 i.i.d. Gaussian pattern
Compare:
Lasso
Elastic Net
Naive `2 group-Lasso
Ω2 for F (A) = d − 1 + range(A)
Ω∞ for F (A) = d − 1 + range(A)
The weighted `2 group-Lasso of(Jenatton et al., 2011a).
Guillaume Obozinski Unified perspective on convex structured sparsity 32/39
Constant signal
0 500 1000 1500 20000
20
40
60
80
100
d=256, k=160, σ=0.5
n
Best
Ham
min
g
EN
GL+w
GL
L1
Sub p=∞
Sub p=2
d = 256
k = 160
σ = .5
Guillaume Obozinski Unified perspective on convex structured sparsity 33/39
Triangular signal
0 500 1000 1500 20000
10
20
30
40
50
60
70
80
90
100
Best Hamming d=256, k=160, σ=0.5, S2, cov=id
n
Best
Ham
min
g
EN
GL+w
GL
L1
Sub p=∞
Sub p=2 d = 256
k = 160
σ = .5
Guillaume Obozinski Unified perspective on convex structured sparsity 34/39
(x1, x2) 7→ | sin(x1) sin(5x1) sin(x2) sin(5x2)| signal in 2D
0 500 1000 1500 2000 250020
30
40
50
60
70
80
90
100
d=256, k=160, σ=1.0
n
Best
Ham
min
g
EN
GL+w
GL
L1
Sub p=∞
Sub p=2
d = 256
k = 160
σ = .5
Guillaume Obozinski Unified perspective on convex structured sparsity 35/39
i.i.d Random signal in 2D
0 500 1000 1500 2000 25000
20
40
60
80
100
d=256, k=160, σ=1.0
n
Best
Ham
min
g
EN
GL+w
GL
L1
Sub p=∞
Sub p=2
d = 256
k = 160
σ = .5
Guillaume Obozinski Unified perspective on convex structured sparsity 36/39
Summary
A convex relaxation for functions penalizing
(a) the support via a general set function(b) the `p norm of the parameter vector w .
Retrieves a large fraction of the norms used (Lasso, group Lasso,Exclusive Lasso, OSCAR, OWL, SLOPE, etc).
Generic efficient algorithms for chains/trees/graphs-OCL
Open: efficient prox computation for tree/DAG for F∩Alternative fast column generation/FCFW algorithm(Vinyes and Obozinski, 2017).
Did not talk about general support recovery and fast ratesconvergence that can be obtained based on generalization of theirrepresentability condition/restricted eigenvalue condition.
Guillaume Obozinski Unified perspective on convex structured sparsity 37/39
References I
Argyriou, A., Foygel, R., and Srebro, N. (2012). Sparse prediction with the k-support norm.In Advances in Neural Information Processing Systems 25, pages 1466–1474.
Bickel, P., Ritov, Y., and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzigselector. Annals of Statistics, 37(4):1705–1732.
Bien, J., Taylor, J., Tibshirani, R., et al. (2013). A lasso for hierarchical interactions. TheAnnals of Statistics, 41(3):1111–1141.
Bogdan, M., van den Berg, E., Sabatti, C., Su, W., and Candes, E. J. (2015). SLOPE:adaptive variable selection via convex optimization. Annals of Applied Statistics,9(3):1103–1140.
Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection,and supervised clustering of predictors with OSCAR. Biometrics, 64(1):115–123.
Figueiredo, M. and Nowak, R. D. (2014). Sparse estimation with strongly correlated variablesusing ordered weighted `1 regularization. Technical Report 1409.4005, arXiv.
Groenevelt, H. (1991). Two algorithms for maximizing a separable concave function over apolymatroid feasible region. Eur. J Oper. Res., 54(2):227–236.
Huang, J., Zhang, T., and Metaxas, D. (2011). Learning with structured sparsity. The JMLR,12:3371–3412.
Jacob, L., Obozinski, G., and Vert, J. (2009). Group lasso with overlap and graph lasso. InICML.
Guillaume Obozinski Unified perspective on convex structured sparsity 38/39
References II
Jenatton, R., Audibert, J., and Bach, F. (2011a). Structured variable selection withsparsity-inducing norms. JMLR, 12:2777–2824.
Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F., and Thirion, B.(2011b). Multi-scale mining of fmri data with hierarchical structured sparsity. Arxivpreprint arXiv:1105.0363.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F. (2011c). Proximal methods forhierarchical sparse coding. JMLR, 12:2297–2334.
Mairal, J., Jenatton, R., Obozinski, G., and Bach, F. (2011). Convex and network flowoptimization for structured sparsity. JMLR, 12:2681–2720.
Vinyes, M. and Obozinski, G. (2017). Fast column generation for atomic norm regularization.In Artificial Intelligence and Statistics.
Wainwright, M. J. (2009). Sharp thresholds for noisy and high-dimensional recovery ofsparsity using `1- constrained quadratic programming. IEEE Transactions on InformationTheory, 55:2183–2202.
Yan, X. and Bien, J. (2015). Hierarchical sparse modeling: A choice of two regularizers. arXivpreprint arXiv:1512.01631.
Yuan, M., Joseph, V. R., and Zou, H. (2009). Structured variable selection and estimation.The Annals of Applied Statistics, 3(4):1738–1757.
Zhao, P., Rocha, G., and Yu, B. (2009). The composite absolute penalties family for groupedand hierarchical variable selection. The Annals of Statistics, pages 3468–3497.
Guillaume Obozinski Unified perspective on convex structured sparsity 39/39