information-theoretic sparse modelingnacho/presentations/cmat2017.pdf1 2 (jaj+ ) ( +1) (moe) i moe =...

INFORMATION-THEORETIC SPARSE MODEL

LEARNING AND SELECTION

Ignacio [email protected]

Facultad de Ingenierıa

Outline

I Introduction to Sparse Models

II Universal sparsity-inducing regularizersI I. Ramırez and G. Sapiro. “Universal regularizers for robust sparse coding and

modeling.” To appear in IEEE Trans. IP. Preprint: arXiv:1003.2941v2, 2012

III MDL-based sparse modelingI I. Ramırez and G. Sapiro. “An MDL framework for sparse coding and dictionary

learning.” IEEE Trans. SP., Vol. 60 (6). DOI:http://dx.doi.org/10.1109/TSP.2012.2187203. Preprint arXiv:1110.2436, 2012.

I I. Ramırez and G. Sapiro. “Low-rank data modeling via the Minimum DescriptionLength principle.” ICASSP 2012. Preprint arXiv:1109.6297.

IV Conclusions and future work

I. Introduction to Sparse Models

Sparse Models

Why sparsity?

y e s ! w h y ?

Why sparsity?

I Statistics: model selectionI control model complexityI bias-variance tradeoff

I Signal processingI robust estimationI lossy compression (JPEG, MPEG, etc.)

I Machine learningI subspace modeling, reduce curse of dimensionaltyI manifold learning

The concept has been around for over 100 years!! PCA (Pearson, 1901)Only recently it has been studied formally as a general modeling principle

Traditional Sparsity-Promoting Models

a = Dx , x = D−1a , x,a ∈ Rm,D ∈ Rm×m

I Pre-designed: Fourier, DCT, Wavelets

I Learned: KLT/PCA

Properties

I x−Da1...k ≈ 0 for k � m

I Fast to compute (Fourier, DCT, Wavelets)

Limitations

I Non-adaptive, square D only

Linear sparse coding

General problem statement

Given a dictionary/basis D, find a with few non-zeros and x ≈ Da

The first requirement is enforced by a regularizer g(·). The second one bya fitting function f(·). Three possible forms

a = arg minu

1

2f(x−Du) s.t. g(u) ≤ τ (pursuit-type)

a = arg minu

1

2f(x−Du) + τg(u) (Lagrangian)

a = arg minu

g(u) s.t. f(x−Du) ≤ τ (denoising-type)

Examples:

I f = − log p(·), g = ‖a‖0 (AIC, BIC)

I f = ‖·‖2 , g = ‖a‖1 (LASSO)

Learned Sparse Linear Models

Coding seeks sparse data representation

A∗ = arg minA

n∑j=1

1

2‖xj −Daj‖22 s.t. ‖aj‖r ≤ τ , ∀ j,

Learning dictionary encourages sparse representations

(A∗,D∗) = arg minA,D

n∑j=1

1

2‖xj −Daj‖22 s.t. ‖aj‖r ≤ τ , ∀ j,

‖dk‖2 ≤ 1, ∀k,

Burning questions

Q1 What is the best sparse model for given data?I How large D, how sparse A?

Q2 How do we add robustness to the coding/learning process?

Q3 How does one incorporate more prior information?

Information-theoretic sparse modeling

Idea: Sparse models as compression tools

I The best model compresses the mostI more regularity captured from the data

I Codelength: universal metric to compare different modelsI Beyond Bayesian: including different model formulations!

I Unifying framework for understading sparse models, in all its variants,from a common perspective.

II. Universal sparsity-inducingregularizers

Universal sparsity-inducing priors

Goal: bypass critical choice of regularization parameter

a∗ = arg mina{1

2‖x−Da‖2 + λ ‖a‖1}

Reinterpreted as a codelength minimization problem

= arg mina{

L(x|a)︷︸︸︷− log P (x|a, σ2)︸︷︷︸

∝

e− 1

2σ2‖x−Da‖22

L(a)︷︸︸︷− logP (a| θ )︸︷︷︸

∝e−θ ‖a‖1

}

λ = 2σ2θ after factorization, σ2 known, θ unknown

Universal probability models: replace P (a|θ) for Q(a) so that

− logQ(a) ≈ arg minθ{− logP (a|θ) } , ∀ a

Universal probability models

I Model class, M = {P (a|θ), θ ∈ Θ}I Our case: M = “all Laplacians of parameter θ ∈ (0,+∞)

I Codelength regret of model Q(a):

R(a, Q) := − logQ(a) + logP (a|θ(a))

I Q(·) is universal if 1n maxa∈Rp{R(a, Q)} → 0

Q(a) is as good as any model in M for describing any a

I Our choice (among possible dictated by theory): convex mixture

Q(a) =

∫ΘP (a|θ)w(θ)dθ , w(θ) ≥ 0,

∫θw(θ)dθ = 1

Universal sparse modelingExample w(θ) = Γ(θ;κ, β) (conjugate prior of Laplacian)

Q(a;κ, β) =1

2κβκ(|a|+ β)−(κ+1) (MOE)

I MOE = symmetrized Pareto Type II distribution

I (κ, β) are non-informative parameters

I other notable choices include Jeffrey’s prior and its variants

Derived sparse coding problem

a∗j = arg mina

1

2σ2‖x−Da‖2 +

− logQ(a) (non−convex!)︷︸︸︷(κ+ 1)

p∑k=1

log (|ak|+ β) (USM)

Results: significant improvements over LASSO formulation

I 3x sparse recovery success rate

I Up to 4dB improvement in image denoising

Optimization

Optimization Technique

I Local Linear Approximations (LLA) Ramırez and Sapiro [2010](independently proposed by Zou and Li [2008])

I Iteratively replace non-convex g(·) by linear (convex) approximationg(·) about previous iterate at−1, and solve for at.

I at converges to stationary point of original problem as t→∞I About 5 iterations usually enough.

Performance

I Expected approximation error Ramırez and Sapiro [2010]

ζQ=moe(ak, κ, β) = Eν∼moe(κ,β) [gk(ν)− g(ν)]

= log(ak + β) +1

ak + β

[ak +

β

κ− 1

]− log β − 1

κ.

I in practice, error is about 6% of original cost function (good!)

III. MDL-based sparse modeling

MDL-based sparse modeling

Q1: What is the best sparse model for given data?

Model selection: find best model M∗ ∈M for given data

I In general: Cost function = goodness-of-fit + model complexity

I Traditional model selection: AIC [Akaike 1974], BIC [Schwartz 1978]I risk-based: heavy assumptions, only valid for models of the same type

I Minimum Description Length (MDL) model selectionI compression-based: milder assumptions, can compare models of

any type

MDL in a nutshell

The MDL principle for model selection [Rissanen, 1978, 1984]

I Capture regularity = compress

I cost function = codelength L(X)

Main result

description length︷︸︸︷L(X) =

stochastic complexity︷︸︸︷L(X|M) +

parametric complexity︷︸︸︷L(M)

I Stochastic complexity: depends on particularities of the data X.

I Parametric complexity: inevitable cost due to the flexibility of themodel class M. Data independent.

Framework components

I Quantization: L(X) must be finite, shorter than trivial desc.

I Probability assignement: deal with unknown parameters

P (E,A,D) = P (E|A,D)P (A|D)P (D)

I Codelength assignement: must be efficient to compute

L(X) = L(E,A,D) = L(E|A,D) + L(A|D) + L(D)

I Algorithms:I Efficient model search within MI Efficient computation of L(X)

Residual model

Random error variable E sum of two effects:

I Random noise: Gaussian Gaussian(0, σ2e) (σe known)

I Model deviation: Laplacian Laplacian(θa) (θe unknown)

Result: Laplacian+Gaussian convolution distribution, E ∼ LG(σe, θe)

I L(E) = − log LG(E) M -type estimator

Unknown parameters: universal mixture/two-parts code Q(E)

I L(E) = − log Q(E) non-convex M -type estimator

Coefficients model

I Coeff. vector a quantized to precision δa (δa = 1 in the figure)

I Decomposed into three parts: a→ (z, s,v)I Sparsity: Bernoulli support vector z (unknown parameter)I Symmetry: Symmetric vector sI Laplacian magnitude vector v (unknown parameter)

Dictionary model

I Piecewise smooth → predictable

I Atom prediction residuals Bk = Wdk are encoded

I Causal bilinear prediction operator W maps atoms to residuals

L(D) = ≈ θdp∑

k=1

‖Wdk‖1.

Collaborative sequential encoding

I Assumption: same unknown model parameters for every col. in X

I Sequential encoding: parameters learned from past samplesI L(e), L(v) become convex, faster to computeI This is also a Universal model (plug-in sequential)!

I Spatial correlation: Markov dependency on supports

Coding algorithms

I Fixed dictionary D, data sample a.

I Goal: minimize L(x|M), M ∈M,

M =⋃γ

M(γ) , M(γ) = {a ∈ Rp, ‖a‖0 ≤ γ} , γ = 0, . . . , p

I COdelength-MInimizing Pursuit Algorithm (COMPA):I Forward-stepwise selection: direction that maximizes ∆L(x|a)

I COdelength-MInimizing Continuation Algorithm (COMICA):I Convex relaxation of L(x|a), solved using CDI Candidate supports z estimated via homotopy/continuationI Candidate coefficient values v estimated via OLS

Coding results

Evolution of COMPA for a given sample

I Efficient coding speed similar to OMP/LARS/CD

I Compression: 4.1bpp for 8bpp grayscale images. Rivals PNG!

Learning algorithm

I For given data set X, choose model M = (A,D) in

M =

+∞⋃p=0

M(p) , M(p) ={

(A,D) : D ∈ Rm×p}

that minimizes L(x|M) = L(x|a) + L(a|D) + L(D)

I Forward-selection of atoms. Start with null dictionary D = [].

I New atom = principal componentI Train using alternate minimization:

I Coding stage done using COMPA/COMICAI Dictionary update: `1-regularized update, solved using FISTA

Dt = arg minD

L(X−DAt|θte, σ2e) + θtd

p∑k=1

‖Wdk‖1 .

I Plug-in (θe, θd) estimated from prev. iteration (EM like)

Denoising results

(L) Noisy “Peppers”, AWGN with σ = 20. (R) Recovered image (31.5dB). Reference:

30.8dB (K-SVD [Aharon et al., 2006]), 29.4dB (MDL denoising [Roos et al., 2009])

State of the art denoising. Parameter free!

Texture segmentation

I A dictionary for each texture, Di.

I Classification: patch x assigned to class i,

i = arg mini

{L(x,Di) + L(Di)

}95.4% correct. Parameter free!

Low-rank matrix decomposition

I MDL for low-rank model selection: choose best approximation rankI Candidate model (U,Σ,V,E): RPCA + continuation on λ

UΣVᵀ = arg minW‖W‖∗ + λ ‖Y −W‖1 , E = X−UΣVᵀ

I Task: dynamic background recovery in video sequences

Videos: Shopping Mall | Lobby

State of the art results, parameter-free.

IV. Conclusions and future work

Conclusions and future work

Conclusions

I New perspective to understand sparse models

I Solves problem of choosing critical model parameters

I State of the art (or close) in tested applications

I Computationally efficient, compares to traditional methods

Future work

I Extend MDL framework to structured sparse models

I Enlarge the models by including other variables (patch width?)

I Extend the concept of sparsity

Bibliography

M. Aharon, M. Elad, and A. Bruckstein. The K-SVD: An algorithm for designing of overcomplete dictionaries for sparserepresentations. IEEE Trans. SP, 54(11):4311–4322, Nov. 2006.

I. Ramırez and G. Sapiro. Low-rank data modeling via the Minimum Description Length principle. Submitted to ICASSP 2012.Preprint available at http://arxiv.org/abs/1109.6297.

I. Ramırez and G. Sapiro. Universal regularizers for robust sparse coding and modeling. Submitted. Preprint available inarXiv:1003.2941v2 [cs.IT], Aug. 2010.

I. Ramırez and G. Sapiro. An MDL framework for sparse coding and dictionary learning. Submitted to IEEE Trans. SP. Preprintavailable at http://arxiv.org/abs/1110.2436., 2011.

J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.

J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans. IT, 30(4):629–636, 1984.

T. Roos, P. Myllymaki, and J. Rissanen. MDL denoising revisited. IEEE Trans. SP, 57(9):3347–3360, 2009.

H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):1509–1533,2008. DOI: 10.1214/009053607000000802.

http://arxiv.org/abs/1109.6297



information-theoretic sparse modelingnacho/presentations/cmat2017.pdf1 2 (jaj+ ) ( +1) (moe) i moe =...

Documents