an introduction to sparse coding, sparse sensing, and...
TRANSCRIPT
-
An Introduction to Sparse Coding and Dictionary Learning
Kai Cao January 14, 2014
1
-
Outline
• Introduction • Mathematical foundation • Sparse coding • Dictionary learning • Summary
2
-
Introduction
3
-
What is sparsity?
• Sparsity implies many zeros in a vector or a matrix
4
FFT transform
Sparse representation
Fingerprint patch FFT response
Reconstructed patch
IFFT transform
Usage: Compression Analysis Denoising ……
-
Sparse Representation
Dictionary Learning Problem
Sparse Coding Problem
x α≈ D
≈
-
Application---Denoising
[M. Elad, Springer 2010]
Dictionary
Source
Result 30.829dB
Noisy image PSNR 22.1dB=
-
Application---Compression
[O. Bryta, M. Elad, 2008]
550 bytes per image
9.44
15.81
14.67
15.30
13.89
12.41
12.57
10.66
10.27
6.60
5.49
6.36
Original JPEG JPEG 2000 PCA
Dictionary based
Bottom: RMSE values
-
Mathematical foundation
8
-
Derivatives of vectors
• First order
• Second order
• Exercise
9
T Ta x x a ax x
∂ ∂= =
( )T
Tx Bx B B xx
∂= +
∂
2 22 2
1min || || || || , ,2m
n n mx D x R D Rα
α λ α ×∈
− + ∈ ∈
1( )T TD D I D xα λ −= +
-
Trace of a Matrix
• Definition
• Properties
10
1( ) ,n iiiTr A a==∑ ( )
n nijA a R
×= ∈
2 2
1 1|| || ( ),
n nT
F iji j
A a Tr A A= =
= =∑∑
( ) ( ) ( ),Tr ABC Tr BCA Tr CAB= =
( ) ( ),TTr A Tr A=( ) ( ),Tr A B Tr A B+ = +
( ) ( ),Tr aA aTr A=
( ) ( ),Tr AB Tr BA=
n nB R ×∈
a R∈n nB R ×∈
, n nB C R ×∈
-
Derivatives of traces
• First order
• Derivatives of traces
• Exercise
11
( ) TTr XA AX∂
=∂
( )T TTr X XA XA XAX∂
= +∂
( )TTr X A AX∂
=∂
( )T TTr X BX B X BXX∂
= +∂
2 2min || || || || , ,k m
n m n kF FA R
X DA A X R D Rλ×
× ×
∈− + ∈ ∈
1( )T TA D D I D Xλ −= +
-
Sparse coding
12
-
Sparse linear model
• Let x ϵ Rn be a signal
• Let D =[d1, d2, …, dm] ϵ Rn×m be a set of normalized “basis vectors” (dictionary)
• Sparse representation is to find a sparse vector α ϵ
Rm such that x ≈ Dα, where α is regarded as sparse code
13
D
x
( 1)Ti id d =
-
The sparse coding model
• Objective function
• The regularization term can be – the l2 norm. – the l0 norm. – the l1 norm. – …
14
22
1min || || ( )2m
x Dα
α λϕ α∈
− +
Data fitting term Regularization term
ϕ2 22 1
|| || m iiα α=∑0|| || #{ | 0}ii aα ≠
1 1|| || | |m iiα α=∑
Sparsity inducing
-
Matching pursuit
1. Initialization: α = 0, residual r = x 2. while ||α||0
-
Patch from latent Dictionary elements
…… d1 d2 d3 d4 d5
An example for matching pursuit
x
c1=-0.039 c2= 0.577 c3=0.054 c4=-0.031 c5 =-0.437
……
= - ×0.577 Residual r
Correlation ci= diT x
…… Correlation ci= diT r d1 d2 d3 d4 d5
c1=-0.035 c2= 0 c3=0.037 c4=-0.046 c5 =-0.289
= - - × (-0.289) Residual r × 0.577
Coefficient does not update !
Reconstructed patch = + × 0.577 × (-0.289) x̂2ˆ|| || 0.763x x− =
-
Orthogonal matching pursuit
1. Initialization: α = 0, residual r = x, active set Ω = Æ 2. while ||α||0
-
An example for orthogonal matching pursuit
2ˆ|| || 0.759x x− =× (-0.309)
= - - ×(-0.309)
Reconstructed patch =
Residual r × 0.499
+ × 0.499 x̂
Patch from latent Dictionary elements
…… d1 d2 d3 d4 d5 x
c1=-0.039 c2= 0.577 c3=0.054 c4=-0.031 c5 =-0.437
……
= - ×0.577 Residual r
Correlation ci= diT x
…… Correlation ci= diT r d1 d2 d3 d4 d5
c1=-0.035 c2= 0 c3=0.037 c4=-0.046 c5 =-0.289
-
Why does l1-norm induce sparsity?
• Analysis in 1D (comparison with l2)
19
21min ( ) | |2
xα
α λ α∈
− +
2 21min ( )2
xα
α λα∈
− +
if x ³ λ , α = x – λ if x £- λ , α = x+ λ else, α =0
Þ α = x/(1+2λ)
x λ
α
-λ x
α slope = /(1+2λ)
Þ
-
Why does l1-norm induce sparsity?
• Analysis in 2D (comparison with l2)
20
22 1
1min || || || ||2
xα
α λ α∈
− +
2 22 2
1min || || || ||2
xα
α λ α∈
− +
22
1min || ||2
xα
α∈
−
1|| ||α µ≤s.t. ⇔22
1min || ||2
xα
α∈
−
2|| ||α µ≤s.t. ⇔
μ -μ
α2
α1 α1
α2
μ -μ
-
Optimality condition for l1-norm regularization
• Directional derivative in the direction u at α
• g is subgradient of J at α if and only if
• Proposition 1: • Proposition 2: if J is differentiable at α, • Proposition 3: α is optimal if and only if for all u,
21
22 1
1min ( ) || || || ||2m
J x Dα
α α λ α∈
= − +
0
( ) ( )( , ) limt
J tu JJ ut
α αα+→
+ −∇ =
, ( ) ( ) ( )m Tt R J t J g tα α∀ ∈ ≥ + −
( , ) 0J uα∇ ≥( , ) ( )TJ u J uα α∇ = ∇
, ( , )m Tu R g u J uα⇔∀ ∈ ≤∇g is a subgradient
-
• Example: f(x) = |x|
22
Subgradient for l1-norm regularization
x
f(x) = |x|
x
subgradient
1
-1
| | 0( , )
( ) 0u x
f x usign x u x
=∇ = ≠
-
Subgradient for l1-norm regularization
• g is a subgradient at α if and only if for all i
23
22 1
1min ( ) || || || ||2m
J x Dα
α α λ α∈
= − +
, 0 , 0( , ) ( ) sign( ) | |
i i
T Ti i i
i a i aJ u u D x D a u uα α λ λ
≠ =
∇ = − − + +∑ ∑
| ( ) |Ti ig d x Dα λ− − ≤ if 0ia =
0ia ≠( ) sign( )Ti i ig d x D aα λ= − + if
-
First order method for convex optimization
• Differentiable objective – Gradient descent: – With line search for a decent ht – Diminishing step size: e.g., ht=(t+t0)-1
• Non differentiable objective – Subgradient decent: , gt is a subgradient – With line search – Diminishing step size
24
1 ( )t t t tJα α η α+ = − ∇
1t t t tgα α η+ = −
-
Reformulation as quadratic program
25
22 1
1min || || || ||2m
x Dα
α λ α∈
− +
22
,
1min || || (1 1 )2m
T Tx D Dα α
α α λ α α+ − +
+ − + −∈
− + + +
-
Dictionary Learning
26
-
Dictionary selection
• Which D to use? • A fixed set of basis:
– Steerable wavelet – Contourlet – DCT Basis – ……
• Data adaptive dictionary – learn from data – K-SVD (l0-norm) – On-line dictionary learning (l1-norm)
27
-
The examples are linear combinations
of atoms from D
The objective function for K-SVD
D ≈ X A
2
,min || ||FD A X DA− 0
|| ||j Lα ≤"j, s.t.
Each example has a sparse representation with
no more than L atoms
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 28
-
K–SVD – An Overview
D Initialize D Sparse Coding
Use MP or OMP
Dictionary Update
Column-by-Column by SVD computation
X T
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 29
-
K–SVD: Sparse Coding Stage
D
X
For the jth example we solve
Ordinary Sparse Coding !
T 2
02min . .jx s t Lα α α− ≤D
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
2min || ||FA X DA− 0|| ||j Lα ≤"j, s.t.
30
-
K–SVD: Dictionary Update Stage
D
X T
For the kth atom
we solve
(the residual)
2min || ||FD X DA− 0|| ||j Lα ≤"j, s.t.
2min || ||k
kk T k Fd
d Eα −
ik i T
i kE d Xα
≠
= −∑
Solve with SVD 31
TkE U V= Λ 1kd u=
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt
-
K–SVD Dictionary Update Stage
Only some of the examples use column dk!
When updating ak, only recompute the coefficients corresponding to those examples
dk ak T
−
Solve with SVD!
We want to solve:
Ek
www.cs.technion.ac.il/~ronrubin/Talks/K-SVD.ppt 32
−
-
Compare K-SVD with K-means
33
Initialize Dictionary
Sparse Coding Use MP or OMP
Dictionary Update
Column-by-Column by SVD computation
Initialize Cluster Centers
Assignment for each vector
Cluster centers update
Cluster-by-cluster
K-SVD K-means
-
dictionary learning with l1-norm regularization
• Objective function for l1-norm regularization
• Advantages of online learning: – Handle large and dynamic datasets, – Could be much faster than batch algorithms.
34
22 1
1
1 1min || || || ||2
t
i i iD ix D
tα λ α
=
− +∑
22 1
1arg min || || || ||2mi iR
x Dα
α α λ α∈
− +where
-
dictionary learning with l1-norm regularization
35
22 1
1
11
1 1( ) || || || ||2
1 1( ( ) ( )) || ||2
t
t i i ii
tT T
t t ii
F D x Dt
Tr D DA Tr D Bt
α λ α
λ α
=
=
= − +
= − +
∑
∑where
1,
tT
t i ii
A α α=
=∑1
tT
t i ii
B xα=
=∑
( ) 1( )t t tF D DA B
D t∂
= −∂
1 1 1,T
t t t tA A α α+ + += +For a new xt+1, 1 1 1T
t t t tB B x α+ + += +
-
On-line dictionary learning 1) Initialization: D0 ϵ Rn×m; A0=0; B0=0; 2) For t=1,…,T 3) Draw xt from the training data set 4) Get sparse code
5) Aggregate sufficient statistics
6) Dictionary update
7) End for
36
21 2 1
1arg min || || || ||2mt t tR
x Dα
α α λ α−∈
= − +
1 ,T
t t t tA A α α−= + 1 ,T
t t t tB B xα−= +
1( )t
t tF DD D
Dρ−∂
= −∂
-
Toolbox - SPAMS • SPArse Modeling Software:
– Sparse coding • l0-norm regularization • l1-norm regularization • ……
– Dictionary learning • K-SVD • Online dictionary learning • ……
• C++ implemented with Matlab interface • http://spams-devel.gforge.inria.fr/
37
-
Summary
• Sparsity and sparse representation • Sparse coding with l0- and l1-norm regularization
– Orthogonal matching pursuit/matching pursuit – Subgradient and optimal condition
• Dictionary learning with l0- and l1-norm regularization – K-SVD – Online dictionary learning
• Try to use it !!
38
-
References • T. T. Cai, Lie Wang ,Orthogonal Matching Pursuit for Sparse Signal Recovery
With Noise, IEEE Transactions on Information Theory, 57(7): 4680-4688,2011 • Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals
of statistics, 32(2):407–499, 2004. • M. Aharon, M. Elad, and A. M. Bruckstein. The K-SVD: An algorithm for
designing of overcomplete dictionaries for sparse representations. IEEE Transactions on Signal Processing, 54(11):4311-4322, November 2006.
• J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning (ICML), 2009a.
39
-
Thank you for listening
40
An Introduction to Sparse Coding and Dictionary LearningOutlineIntroduction What is sparsity? Sparse RepresentationApplication---DenoisingApplication---CompressionMathematical foundationDerivatives of vectorsTrace of a MatrixDerivatives of tracesSparse codingSparse linear modelThe sparse coding modelMatching pursuitAn example for matching pursuitOrthogonal matching pursuitAn example for orthogonal matching pursuitWhy does l1-norm induce sparsity?Why does l1-norm induce sparsity?Optimality condition for l1-norm regularizationSubgradient for l1-norm regularization Subgradient for l1-norm regularization First order method for �convex optimizationReformulation as quadratic programDictionary LearningDictionary selectionSlide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Compare K-SVD with K-meansdictionary learning with �l1-norm regularizationdictionary learning with �l1-norm regularizationOn-line dictionary learningToolbox - SPAMSSummaryReferencesThank you for listening