on the sparse estimation - cns 1. why sparse estimation is necessary? generalization ability for ill...
TRANSCRIPT
On the sparse estimation
ATR Computational Neuroscience Laboratories
Masa-aki Sato
Contents
1. Why sparse estimation is necessary?Generalization ability for ill posed problem
2. Why sparse estimation can be achieved?Role of Bayesian estimation
3. Example of sparse estimation for real problem
Ill-posed problem
Large number of parameters are estimated from small number of data
• Maximum Likelihood (Minimum Squared Error)– Estimate optimal parameter which maximize likelihood– Overfitting:
Complex models with many adjustable parameters tend to fit noise in training data and degrade generalization ability
• Sparse estimation Extract relevant features and discard irrelevant features
to attain good generalization ability
Function approximation by polynomial
50 data points are randomly generated from quadratic function
22y x x noise= − +
20 1 2
21, , , ,
NN
N
y w w x w x w x
W X
X x x x
= + + + +
= ⋅
⎡ ⎤= ⎢ ⎥⎣ ⎦
…
… Input (feature) variable
Linear parameter model
0 1 2, , , , NW w w w w⎡ ⎤= ⎢ ⎥⎣ ⎦… Weight parameter vector
Maximum likelihood (Quadratic function)
Find optimal W( ) ( )( )21
T
terror y t W X t
== − ⋅∑
Minimize Squared Error
Maximum likelihood (20 degree polynomial)
Overfitting : Optimal W fits noise in training dataGeneralization is not good
Regularization method (20 degree polynomial)
( ) ( )( )2 2
1
T
terror y t W X t Wα
== − ⋅ + ⋅∑ Minimization
( ) ( )21exp2
P W Wα∝ − ⋅Prior :
Model selection• Search best model which gives best generalization error
by changing number pf parameters (polynomial degree)• Combinatorial serach is almost impossible
for large degree of freedom
Training error & Generalization error
Number pf parameters (Polynomial degree)
Sparse estimation by Bayesian method• Parameters are considered as random variable• Posterior probability is calculated for possible parameter value• Estimation is done by integrated over possible value
according to posterior probability distribution
Sparse estimation (20 degree polynomial)• Precision parameter is introduced
for each weight component and estimated from observed datanwnα
Posterior for 2nd order weight
Posterior for 3rd order weight
( ) ( )21exp ,2n n n nP w wα α∝ − ⋅ = 不定Prior
Prunes irrelevant features in the modeland increase generalization ability
Sparse Estimation
( )( ) ( )
( )0|
|P P
PP
=X
XXθ θ
θ
( )|P Xθ Regular Model
MAP / ML Estimation(Maximum a Posteriori / Maximum Likelihood)
• Posterior
Liklihood prior
Marginal likelihood
( ) ( )( )MAP 0arg max log |P P= Xθ θ θ• Estimate optimal parameter
1θ2θ
Full Bayesian Estimation
• Estimate posterior parameter distributionand integrate over parameters according to the posterior.
• Posterior
Liklihood prior
Marginal likelihood
( )( ) ( )
( )0|
|P P
PP
=X
XXθ θ
θ
( ) ( ) ( )0|P d P P= ∫X Xθ θ θ
( )|d P= ∫ Xθ θ θ θ
Marginal likelihood
Estimated parameter
Model reductionby Bayesian method
Mixture of Gaussian example
Redundant Model (ill-posed problem)Estimation model is a Mixture of two Gaussian units
Assume data is generated by a single Gaussian
Single unit model correspond to three cases :
( ) ( ) ( ) ( )0 1 0 2| | 1 |P g N g Nθ θ= ⋅ + − ⋅x x xθ
( ) ( )( ) ( )( ) ( )
1 0 2
2 0 1
1 1 2 0
| | , 1 , arbitrary
| | , 0 , arbitrary
| | , , arbitrary
P N g
P N g
P N g
θ θ
θ θ
θ θ θ
= = =
= = =
= = =
x x
x x
x x
θ
θ
θ
Posterior distribution for redundant model
( )|P Xθ
1θ2θ2θ 1θ
( )|P Xθ
Fisher Information matrix becomes singular
0 1g = 0 0g =
Pruning of redundant parameters
1θ2θ
• Complex models explain a given data better than simpler models and give higher posterior value
• Then all parameters are used for prediction
( )|P Xθ
MAP
Full Bayesian • Reduced simpler model dominates by integration over parameters
MAP optimal Reduced modelReduced model
Posterior distribution for 100 sample data generated by a single Gaussian model
0 0g =0 1g =
Parameter pruningin Sparse Estimation
Prior in Sparse Estimation Model
( ) ( )( )( )
21| exp2
log .
n n n n n
n
P w w
P const
α α α
α
∝ ⋅ − ⋅
= (Non-informative prior)
controles a precision (width) ofweight parameter distribution
nα
Arbitrary
Prior
0 ,n nw α= =
Posterior parameter distribution for
Posterior for relevant parameter
Posterior for irrelevant parameter
Other parameters are integrated outAnd their effects are taken into account
( ),n nw α
nwnα
= 0= arbitrary
Calculation of posterior distribution
Free energy maximization Posterior calculation
Distance between trial posterior Q(J,α) and true posterior P(J,α|B)
( )[ ] ( ) ( ) ( )[ ]BαJ,αJ,BαJ, PQKLPQF −= ln
Log marginal likelihood (Evidence)
( ) ( ) ( ) ( )( )B
ααJJBBαJ
PPPP
P 00, =• Posterior
Liklihood (Hierarhical) prior
Marginal likelihood( )∫= αJBαJJJ ddP ,
( ) ( ) ( ),Q Q Q= JJ J αα αFactorization assumption :
Repeated until convergence
Posterior distribution
Log marginal likelihood
Variational Bayesian (VB) method
Maximization of F(Q)Maximization of F(Q) w.r.t.
Maximization of F(Q) w.r.t. QJ(J)
( ) ( )| at the maximumP Q≈ JJ B J
( )( ) ( )log maximized P F Q≈B
( )Qα α
( )11log
2 VB VBF Tr −⎡ ⎤′= − ⋅ ⋅ +⎢ ⎥⎣ ⎦B BΣ Σ
0 0 0
VB
σ
σ
′′⋅ = ⋅ ⋅ ⋅ +′
′ ′= ⋅ ⋅ ⋅ ⋅ +
B B G J J G I
G W W G IΣ α
MEG covariance matrix
Estimated MEG covariance
Free energyFree energy after integration of current distribution
Optimal condition (Free energy maximum)
Estimation gain
( )( )
VBn
nα
′⋅=J JG
Variational Bayesian methodPosterior calculation is converted to free energy maximization
Free energy = (Likelihood)+ (Model complexity)
( )
( ) ( ) ( )
2
12 2
1212
L Y W X W W
Y Y X X X
α
α −
⎡ ⎤′= − − ⋅ + ⋅ ⋅⎢ ⎥⎣ ⎦
⎡ ⎤′= − − ⋅ ⋅ ⋅ +⎢ ⎥⎣ ⎦
for finite ,T ≫ 1α
Decreasing function of α
α
Likelihood
Model complexity
Error
Increasing function of
BIC( )
11 log 121 log2
H X X
N T
α−′= − ⋅ ⋅ +
→ − ⋅
11 log 12
0
H X X
as
α
α
−′= − ⋅ ⋅ +
→ → ∞
One dimensional case
Genaral dimension
( )
11 log 121 log2 eff
H X X
N T
α−′= − ⋅ ⋅ +
→ − ⋅
Number of finite ,T ≫ 1αeffN
Example of sparse estimationfor real problem
I Nambu, R Osu, M Sato, S Ando, M Kawato, E NaitoSingle-trial reconstruction of !nger-pinch forces from human motor-cortical activation measured by near-
infrared spectroscopy (NIRS)NeuroImage 47 (2009) 628.637
Estimated weight by brute force model search
Estimated weight by sparse estimation
forc
e
time
Sparse estimation
(Nambu et al)
Estimate pinching force from 24ch x 21 (sec) NIRS data
Forc
e
Estimated force from NIRS data