information-theoretic sparse modelingnacho/presentations/cmat2017.pdf1 2 (jaj+ ) ( +1) (moe) i moe =...
TRANSCRIPT
INFORMATION-THEORETIC SPARSE MODEL
LEARNING AND SELECTION
Ignacio [email protected]
Facultad de Ingenierıa
Outline
I Introduction to Sparse Models
II Universal sparsity-inducing regularizersI I. Ramırez and G. Sapiro. “Universal regularizers for robust sparse coding and
modeling.” To appear in IEEE Trans. IP. Preprint: arXiv:1003.2941v2, 2012
III MDL-based sparse modelingI I. Ramırez and G. Sapiro. “An MDL framework for sparse coding and dictionary
learning.” IEEE Trans. SP., Vol. 60 (6). DOI:http://dx.doi.org/10.1109/TSP.2012.2187203. Preprint arXiv:1110.2436, 2012.
I I. Ramırez and G. Sapiro. “Low-rank data modeling via the Minimum DescriptionLength principle.” ICASSP 2012. Preprint arXiv:1109.6297.
IV Conclusions and future work
I. Introduction to Sparse Models
Sparse Models
Why sparsity?
y e s ! w h y ?
Why sparsity?
I Statistics: model selectionI control model complexityI bias-variance tradeoff
I Signal processingI robust estimationI lossy compression (JPEG, MPEG, etc.)
I Machine learningI subspace modeling, reduce curse of dimensionaltyI manifold learning
The concept has been around for over 100 years!! PCA (Pearson, 1901)Only recently it has been studied formally as a general modeling principle
Traditional Sparsity-Promoting Models
a = Dx , x = D−1a , x,a ∈ Rm,D ∈ Rm×m
I Pre-designed: Fourier, DCT, Wavelets
I Learned: KLT/PCA
Properties
I x−Da1...k ≈ 0 for k � m
I Fast to compute (Fourier, DCT, Wavelets)
Limitations
I Non-adaptive, square D only
Linear sparse coding
General problem statement
Given a dictionary/basis D, find a with few non-zeros and x ≈ Da
The first requirement is enforced by a regularizer g(·). The second one bya fitting function f(·). Three possible forms
a = arg minu
1
2f(x−Du) s.t. g(u) ≤ τ (pursuit-type)
a = arg minu
1
2f(x−Du) + τg(u) (Lagrangian)
a = arg minu
g(u) s.t. f(x−Du) ≤ τ (denoising-type)
Examples:
I f = − log p(·), g = ‖a‖0 (AIC, BIC)
I f = ‖·‖2 , g = ‖a‖1 (LASSO)
Learned Sparse Linear Models
Coding seeks sparse data representation
A∗ = arg minA
n∑j=1
1
2‖xj −Daj‖22 s.t. ‖aj‖r ≤ τ , ∀ j,
Learning dictionary encourages sparse representations
(A∗,D∗) = arg minA,D
n∑j=1
1
2‖xj −Daj‖22 s.t. ‖aj‖r ≤ τ , ∀ j,
‖dk‖2 ≤ 1, ∀k,
Burning questions
Q1 What is the best sparse model for given data?I How large D, how sparse A?
Q2 How do we add robustness to the coding/learning process?
Q3 How does one incorporate more prior information?
Information-theoretic sparse modeling
Idea: Sparse models as compression tools
I The best model compresses the mostI more regularity captured from the data
I Codelength: universal metric to compare different modelsI Beyond Bayesian: including different model formulations!
I Unifying framework for understading sparse models, in all its variants,from a common perspective.
II. Universal sparsity-inducingregularizers
Universal sparsity-inducing priors
Goal: bypass critical choice of regularization parameter
a∗ = arg mina{1
2‖x−Da‖2 + λ ‖a‖1}
Reinterpreted as a codelength minimization problem
= arg mina{
L(x|a)︷ ︸︸ ︷− log P (x|a, σ2)︸ ︷︷ ︸
∝
e− 1
2σ2‖x−Da‖22
L(a)︷ ︸︸ ︷− logP (a| θ )︸ ︷︷ ︸
∝e−θ ‖a‖1
}
λ = 2σ2θ after factorization, σ2 known, θ unknown
Universal probability models: replace P (a|θ) for Q(a) so that
− logQ(a) ≈ arg minθ{− logP (a|θ) } , ∀ a
Universal sparsity-inducing priors
Goal: bypass critical choice of regularization parameter
a∗ = arg mina{1
2‖x−Da‖2 + λ ‖a‖1}
Reinterpreted as a codelength minimization problem
= arg mina{
L(x|a)︷ ︸︸ ︷− log P (x|a, σ2)︸ ︷︷ ︸
∝
e− 1
2σ2‖x−Da‖22
L(a)︷ ︸︸ ︷− logP (a| θ )︸ ︷︷ ︸
∝e−θ ‖a‖1
}
λ = 2σ2θ after factorization, σ2 known, θ unknown
Universal probability models: replace P (a|θ) for Q(a) so that
− logQ(a) ≈ arg minθ{− logP (a|θ) } , ∀ a
Universal sparsity-inducing priors
Goal: bypass critical choice of regularization parameter
a∗ = arg mina{1
2‖x−Da‖2 + λ ‖a‖1}
Reinterpreted as a codelength minimization problem
= arg mina{
L(x|a)︷ ︸︸ ︷− log P (x|a, σ2)︸ ︷︷ ︸
∝
e− 1
2σ2‖x−Da‖22
L(a)︷ ︸︸ ︷− logP (a| θ )︸ ︷︷ ︸
∝e−θ ‖a‖1
}
λ = 2σ2θ after factorization, σ2 known, θ unknown
Universal probability models: replace P (a|θ) for Q(a) so that
− logQ(a) ≈ arg minθ{− logP (a|θ) } , ∀ a
Universal probability models
I Model class, M = {P (a|θ), θ ∈ Θ}I Our case: M = “all Laplacians of parameter θ ∈ (0,+∞)
I Codelength regret of model Q(a):
R(a, Q) := − logQ(a) + logP (a|θ(a))
I Q(·) is universal if 1n maxa∈Rp{R(a, Q)} → 0
Q(a) is as good as any model in M for describing any a
I Our choice (among possible dictated by theory): convex mixture
Q(a) =
∫ΘP (a|θ)w(θ)dθ , w(θ) ≥ 0,
∫θw(θ)dθ = 1
Universal probability models
I Model class, M = {P (a|θ), θ ∈ Θ}I Our case: M = “all Laplacians of parameter θ ∈ (0,+∞)
I Codelength regret of model Q(a):
R(a, Q) := − logQ(a) + logP (a|θ(a))
I Q(·) is universal if 1n maxa∈Rp{R(a, Q)} → 0
Q(a) is as good as any model in M for describing any a
I Our choice (among possible dictated by theory): convex mixture
Q(a) =
∫ΘP (a|θ)w(θ)dθ , w(θ) ≥ 0,
∫θw(θ)dθ = 1
Universal sparse modelingExample w(θ) = Γ(θ;κ, β) (conjugate prior of Laplacian)
Q(a;κ, β) =1
2κβκ(|a|+ β)−(κ+1) (MOE)
I MOE = symmetrized Pareto Type II distribution
I (κ, β) are non-informative parameters
I other notable choices include Jeffrey’s prior and its variants
Derived sparse coding problem
a∗j = arg mina
1
2σ2‖x−Da‖2 +
− logQ(a) (non−convex!)︷ ︸︸ ︷(κ+ 1)
p∑k=1
log (|ak|+ β) (USM)
Results: significant improvements over LASSO formulation
I 3x sparse recovery success rate
I Up to 4dB improvement in image denoising
Universal sparse modelingExample w(θ) = Γ(θ;κ, β) (conjugate prior of Laplacian)
Q(a;κ, β) =1
2κβκ(|a|+ β)−(κ+1) (MOE)
I MOE = symmetrized Pareto Type II distribution
I (κ, β) are non-informative parameters
I other notable choices include Jeffrey’s prior and its variants
Derived sparse coding problem
a∗j = arg mina
1
2σ2‖x−Da‖2 +
− logQ(a) (non−convex!)︷ ︸︸ ︷(κ+ 1)
p∑k=1
log (|ak|+ β) (USM)
Results: significant improvements over LASSO formulation
I 3x sparse recovery success rate
I Up to 4dB improvement in image denoising
Universal sparse modelingExample w(θ) = Γ(θ;κ, β) (conjugate prior of Laplacian)
Q(a;κ, β) =1
2κβκ(|a|+ β)−(κ+1) (MOE)
I MOE = symmetrized Pareto Type II distribution
I (κ, β) are non-informative parameters
I other notable choices include Jeffrey’s prior and its variants
Derived sparse coding problem
a∗j = arg mina
1
2σ2‖x−Da‖2 +
− logQ(a) (non−convex!)︷ ︸︸ ︷(κ+ 1)
p∑k=1
log (|ak|+ β) (USM)
Results: significant improvements over LASSO formulation
I 3x sparse recovery success rate
I Up to 4dB improvement in image denoising
Optimization
Optimization Technique
I Local Linear Approximations (LLA) Ramırez and Sapiro [2010](independently proposed by Zou and Li [2008])
I Iteratively replace non-convex g(·) by linear (convex) approximationg(·) about previous iterate at−1, and solve for at.
I at converges to stationary point of original problem as t→∞I About 5 iterations usually enough.
Performance
I Expected approximation error Ramırez and Sapiro [2010]
ζQ=moe(ak, κ, β) = Eν∼moe(κ,β) [gk(ν)− g(ν)]
= log(ak + β) +1
ak + β
[ak +
β
κ− 1
]− log β − 1
κ.
I in practice, error is about 6% of original cost function (good!)
III. MDL-based sparse modeling
MDL-based sparse modeling
Q1: What is the best sparse model for given data?
Model selection: find best model M∗ ∈M for given data
I In general: Cost function = goodness-of-fit + model complexity
I Traditional model selection: AIC [Akaike 1974], BIC [Schwartz 1978]I risk-based: heavy assumptions, only valid for models of the same type
I Minimum Description Length (MDL) model selectionI compression-based: milder assumptions, can compare models of
any type
MDL in a nutshell
The MDL principle for model selection [Rissanen, 1978, 1984]
I Capture regularity = compress
I cost function = codelength L(X)
Main result
description length︷ ︸︸ ︷L(X) =
stochastic complexity︷ ︸︸ ︷L(X|M) +
parametric complexity︷ ︸︸ ︷L(M)
I Stochastic complexity: depends on particularities of the data X.
I Parametric complexity: inevitable cost due to the flexibility of themodel class M. Data independent.
MDL in a nutshell
The MDL principle for model selection [Rissanen, 1978, 1984]
I Capture regularity = compress
I cost function = codelength L(X)
Main result
description length︷ ︸︸ ︷L(X) =
stochastic complexity︷ ︸︸ ︷L(X|M) +
parametric complexity︷ ︸︸ ︷L(M)
I Stochastic complexity: depends on particularities of the data X.
I Parametric complexity: inevitable cost due to the flexibility of themodel class M. Data independent.
Framework components
I Quantization: L(X) must be finite, shorter than trivial desc.
I Probability assignement: deal with unknown parameters
P (E,A,D) = P (E|A,D)P (A|D)P (D)
I Codelength assignement: must be efficient to compute
L(X) = L(E,A,D) = L(E|A,D) + L(A|D) + L(D)
I Algorithms:I Efficient model search within MI Efficient computation of L(X)
Residual model
Random error variable E sum of two effects:
I Random noise: Gaussian Gaussian(0, σ2e) (σe known)
I Model deviation: Laplacian Laplacian(θa) (θe unknown)
Result: Laplacian+Gaussian convolution distribution, E ∼ LG(σe, θe)
I L(E) = − log LG(E) M -type estimator
Unknown parameters: universal mixture/two-parts code Q(E)
I L(E) = − log Q(E) non-convex M -type estimator
Coefficients model
I Coeff. vector a quantized to precision δa (δa = 1 in the figure)
I Decomposed into three parts: a→ (z, s,v)I Sparsity: Bernoulli support vector z (unknown parameter)I Symmetry: Symmetric vector sI Laplacian magnitude vector v (unknown parameter)
Dictionary model
I Piecewise smooth → predictable
I Atom prediction residuals Bk = Wdk are encoded
I Causal bilinear prediction operator W maps atoms to residuals
L(D) = ≈ θdp∑
k=1
‖Wdk‖1.
Collaborative sequential encoding
I Assumption: same unknown model parameters for every col. in X
I Sequential encoding: parameters learned from past samplesI L(e), L(v) become convex, faster to computeI This is also a Universal model (plug-in sequential)!
I Spatial correlation: Markov dependency on supports
Coding algorithms
I Fixed dictionary D, data sample a.
I Goal: minimize L(x|M), M ∈M,
M =⋃γ
M(γ) , M(γ) = {a ∈ Rp, ‖a‖0 ≤ γ} , γ = 0, . . . , p
I COdelength-MInimizing Pursuit Algorithm (COMPA):I Forward-stepwise selection: direction that maximizes ∆L(x|a)
I COdelength-MInimizing Continuation Algorithm (COMICA):I Convex relaxation of L(x|a), solved using CDI Candidate supports z estimated via homotopy/continuationI Candidate coefficient values v estimated via OLS
Coding algorithms
I Fixed dictionary D, data sample a.
I Goal: minimize L(x|M), M ∈M,
M =⋃γ
M(γ) , M(γ) = {a ∈ Rp, ‖a‖0 ≤ γ} , γ = 0, . . . , p
I COdelength-MInimizing Pursuit Algorithm (COMPA):I Forward-stepwise selection: direction that maximizes ∆L(x|a)
I COdelength-MInimizing Continuation Algorithm (COMICA):I Convex relaxation of L(x|a), solved using CDI Candidate supports z estimated via homotopy/continuationI Candidate coefficient values v estimated via OLS
Coding algorithms
I Fixed dictionary D, data sample a.
I Goal: minimize L(x|M), M ∈M,
M =⋃γ
M(γ) , M(γ) = {a ∈ Rp, ‖a‖0 ≤ γ} , γ = 0, . . . , p
I COdelength-MInimizing Pursuit Algorithm (COMPA):I Forward-stepwise selection: direction that maximizes ∆L(x|a)
I COdelength-MInimizing Continuation Algorithm (COMICA):I Convex relaxation of L(x|a), solved using CDI Candidate supports z estimated via homotopy/continuationI Candidate coefficient values v estimated via OLS
Coding results
Evolution of COMPA for a given sample
I Efficient coding speed similar to OMP/LARS/CD
I Compression: 4.1bpp for 8bpp grayscale images. Rivals PNG!
Learning algorithm
I For given data set X, choose model M = (A,D) in
M =
+∞⋃p=0
M(p) , M(p) ={
(A,D) : D ∈ Rm×p}
that minimizes L(x|M) = L(x|a) + L(a|D) + L(D)
I Forward-selection of atoms. Start with null dictionary D = [].
I New atom = principal componentI Train using alternate minimization:
I Coding stage done using COMPA/COMICAI Dictionary update: `1-regularized update, solved using FISTA
Dt = arg minD
L(X−DAt|θte, σ2e) + θtd
p∑k=1
‖Wdk‖1 .
I Plug-in (θe, θd) estimated from prev. iteration (EM like)
Denoising results
(L) Noisy “Peppers”, AWGN with σ = 20. (R) Recovered image (31.5dB). Reference:
30.8dB (K-SVD [Aharon et al., 2006]), 29.4dB (MDL denoising [Roos et al., 2009])
State of the art denoising. Parameter free!
Texture segmentation
I A dictionary for each texture, Di.
I Classification: patch x assigned to class i,
i = arg mini
{L(x,Di) + L(Di)
}95.4% correct. Parameter free!
Low-rank matrix decomposition
I MDL for low-rank model selection: choose best approximation rankI Candidate model (U,Σ,V,E): RPCA + continuation on λ
UΣVᵀ = arg minW‖W‖∗ + λ ‖Y −W‖1 , E = X−UΣVᵀ
I Task: dynamic background recovery in video sequences
Videos: Shopping Mall | Lobby
State of the art results, parameter-free.
IV. Conclusions and future work
Conclusions and future work
Conclusions
I New perspective to understand sparse models
I Solves problem of choosing critical model parameters
I State of the art (or close) in tested applications
I Computationally efficient, compares to traditional methods
Future work
I Extend MDL framework to structured sparse models
I Enlarge the models by including other variables (patch width?)
I Extend the concept of sparsity
Bibliography
M. Aharon, M. Elad, and A. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparserepresentations. IEEE Trans. SP, 54(11):4311–4322, Nov. 2006.
I. Ramırez and G. Sapiro. Low-rank data modeling via the Minimum Description Length principle. Submitted to ICASSP 2012.Preprint available at http://arxiv.org/abs/1109.6297.
I. Ramırez and G. Sapiro. Universal regularizers for robust sparse coding and modeling. Submitted. Preprint available inarXiv:1003.2941v2 [cs.IT], Aug. 2010.
I. Ramırez and G. Sapiro. An MDL framework for sparse coding and dictionary learning. Submitted to IEEE Trans. SP. Preprintavailable at http://arxiv.org/abs/1110.2436., 2011.
J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.
J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans. IT, 30(4):629–636, 1984.
T. Roos, P. Myllymaki, and J. Rissanen. MDL denoising revisited. IEEE Trans. SP, 57(9):3347–3360, 2009.
H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509–1533,2008. DOI: 10.1214/009053607000000802.