phd candidature presentation - angms.science · phd candidature presentation in other words : what...
Post on 27-Jun-2020
9 Views
Preview:
TRANSCRIPT
PhD Candidature PresentationIn other words : What I did last two years
Andersen Ang
Mathematique et recherche operationnelle, UMONS, BelgiumEmail: manshun.ang@umons.ac.be Homepage: angms.science
Feburary 26, 2019
The works
Journal paper
A.-Gillis, Accelerating Nonnegative Matrix Factorization Algorithms using ExtrapolationNeural Computation, vol. 31 (2), pp 417-439, Feb 2019, MIT Press
Conference paper
Leplat-A.-Gillis, Minimum-Volume Rank-Deficient Non-negative Matrix FactorizationsTo be presented in IEEE ICASSP 2019, Brighton, UK, 2019-May
A.-Gillis, Volume regularized Non-negative Matrix FactorizationsIEEE WHISPERS 2018, Amsterdam, NL, 2018-Sept-25
Work in progress
A.-Gillis, Algorithms and Comparisons of Non-negative Matrix Factorization with VolumeRegularization for Hyperspectral Unmixing, In preparation, to be submitted to IEEE JSTARS
Leplat-A.-Gillis, β-NMF for blind audio source separation with minimum volume regularization
Cohen-A., Accelerating Non-negative Canonical Polyadic Decomposition using extrapolationA soumettre a GRETSI2019 (en Francais!)
And numerous presentations (abstracts) in conferences, workshops, doctoral schools andseminars in BE, FR, NL, DE, IT, HK, e.g. SIAM-ALA18, ISMP2018, OR2018 . . .
2 / 42
Non-negative Matrix Factorization
Given X ∈ IRm×n or IRm×n+ , find two matrices W ∈ IRm×r
+ andH ∈ IRr×n
+ by solving :
minW,H
f(W,H) =1
2‖X−WH‖2F subject to W ≥ 0,H ≥ 0, (1)
where ≥ is taken element-wise.
Key points (see the report for references):
• Non-convex problem.
• No close form solution, use numerical optimization algorithm to solve.
• Non-negativity makes NMF NP-Hard (as opposed to PCA); there aremodel modifications that makes the problem solvable in polynomialtime.
• Many applications in machine learning, data mining, signal processing.
3 / 42
Non-negative Matrix Factorization
Given X ∈ IRm×n or IRm×n+ , find two matrices W ∈ IRm×r
+ andH ∈ IRr×n
+ by solving :
minW,H
f(W,H) =1
2‖X−WH‖2F subject to W ≥ 0,H ≥ 0, (1)
where ≥ is taken element-wise.
Key points (see the report for references):
• Non-convex problem.
• No close form solution, use numerical optimization algorithm to solve.
• Non-negativity makes NMF NP-Hard (as opposed to PCA); there aremodel modifications that makes the problem solvable in polynomialtime.
• Many applications in machine learning, data mining, signal processing.
4 / 42
Alternating minimization
The standard way to solve NMF :
· · · → update W→ update H→ update W→ update H→ . . .
with the goal – descent condition :
f(Wk,Hk) ≤ f(Wk,Hk−1) ≤ f(Wk−1,Hk−1), k ∈ IN, (2)
where k is iteration counter.
To achieve (2), use projected gradient descent (PGD):
Update W Wk+1 = [Wk − αWk ∇Wf(Wk;Hk)]+ (3)
Update H Hk+1 = [Hk − αHk ∇Hf(Wk+1,Hk)]+, (4)
where αk is step size and [ · ]+ = max{·, 0}.
Fact : the series {Wk,Hk}k∈IN produced by the PGD scheme (3)-(4)converges to a first-order stationary point of f .
5 / 42
Alternating minimization
The standard way to solve NMF :
· · · → update W→ update H→ update W→ update H→ . . .
with the goal – descent condition :
f(Wk,Hk) ≤ f(Wk,Hk−1) ≤ f(Wk−1,Hk−1), k ∈ IN, (2)
where k is iteration counter.
To achieve (2), use projected gradient descent (PGD):
Update W Wk+1 = [Wk − αWk ∇Wf(Wk;Hk)]+ (3)
Update H Hk+1 = [Hk − αHk ∇Hf(Wk+1,Hk)]+, (4)
where αk is step size and [ · ]+ = max{·, 0}.
Fact : the series {Wk,Hk}k∈IN produced by the PGD scheme (3)-(4)converges to a first-order stationary point of f .
6 / 42
Embedding extrapolation into the update
Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+
Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk)
Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+
Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk),
where Y and G are the pairing variable of W and H, respectively1.
1For initialization, Y0 = W0 and G0 = H0.
7 / 42
The extrapolation parameter βk
Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)
Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)
Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)
Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)
The parameter βk is the critical part of the scheme :
• βk is dynamically updated at each iteration k
• βk ∈ [0, 1].
• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).
• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.
• NMF is non-convex, it is not known how to determine β.
8 / 42
The extrapolation parameter βk
Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)
Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)
Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)
Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)
The parameter βk is the critical part of the scheme :
• βk is dynamically updated at each iteration k
• βk ∈ [0, 1].
• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).
• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.
• NMF is non-convex, it is not known how to determine β.
9 / 42
The extrapolation parameter βk
Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)
Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)
Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)
Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)
The parameter βk is the critical part of the scheme :
• βk is dynamically updated at each iteration k
• βk ∈ [0, 1].
• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).
• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.
• NMF is non-convex, it is not known how to determine β.
10 / 42
The extrapolation parameter βk
Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)
Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)
Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)
Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)
The parameter βk is the critical part of the scheme :
• βk is dynamically updated at each iteration k
• βk ∈ [0, 1].
• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).
• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.
• NMF is non-convex, it is not known how to determine β.
11 / 42
The extrapolation parameter βk
Update W Wk+1 = [Yk − αYk ∇Yf(Yk;Hk)]+ (5)
Extrapolate W Yk+1 = Wk+1 + βk(Wk+1 −Wk) (6)
Update H Hk+1 = [Gk − αHk ∇Hf(Wk+1,Gk)]+ (7)
Extrapolate H Gk+1 = Hk+1 + βHk (Hk+1 −Hk), (8)
The parameter βk is the critical part of the scheme :
• βk is dynamically updated at each iteration k
• βk ∈ [0, 1].
• If βk = 0, the scheme (5)-(8) reduces to the plain projected gradientscheme (3)-(4).
• Nesterov’s acceleration, being optimal in terms of convergence rate,gives the explicit close form formula for βk.
• NMF is non-convex, it is not known how to determine β.
12 / 42
A numerical scheme to tune β for NMF
The idea is to update βk based on the increase or decrease of the objectivefunction. Let ek = f(Wk,Hk), then
βk+1 =
min{γβk, β} if ek ≤ ek−1βkη
if ek > ek−1(9)
where γ > 1, and η > 1 are constants and β0 = 1 with the update
βk+1 =
{min{γβk, 1} if ek ≤ ek−1 and βk < 1
βk if ek > ek−1. (10)
13 / 42
A numerical scheme to tune β for NMF
The idea is to update βk based on the increase or decrease of the objectivefunction. Let ek = f(Wk,Hk), then
βk+1 =
min{γβk, β} if ek ≤ ek−1βkη
if ek > ek−1(9)
where γ > 1, and η > 1 are constants and β0 = 1 with the update
βk+1 =
{min{γβk, 1} if ek ≤ ek−1 and βk < 1
βk if ek > ek−1. (10)
14 / 42
The logic flow of updating βk
Case 1. The error decreases : ek ≤ ek−1
• It means the current β value is “good”
• We can be more ambitious on the extrapolation
� i.e., we increase the value of β� How : multiplying it with a growth factor γ > 1
βk+1 = βkγ
• Note that the growth of β cannot be indefinite
� i.e., we put a ceiling parameter β to upper bound the growth� How : use min
βk+1 = min{βkγ, βk}
� β itself is also updated dynamically with a growth factor γ with the upperbound 1.
15 / 42
The logic flow of updating βk
Case 1. The error decreases : ek ≤ ek−1
• It means the current β value is “good”
• We can be more ambitious on the extrapolation
� i.e., we increase the value of β� How : multiplying it with a growth factor γ > 1
βk+1 = βkγ
• Note that the growth of β cannot be indefinite
� i.e., we put a ceiling parameter β to upper bound the growth� How : use min
βk+1 = min{βkγ, βk}
� β itself is also updated dynamically with a growth factor γ with the upperbound 1.
16 / 42
The logic flow of updating βk
Case 1. The error decreases : ek ≤ ek−1
• It means the current β value is “good”
• We can be more ambitious on the extrapolation
� i.e., we increase the value of β� How : multiplying it with a growth factor γ > 1
βk+1 = βkγ
• Note that the growth of β cannot be indefinite
� i.e., we put a ceiling parameter β to upper bound the growth� How : use min
βk+1 = min{βkγ, βk}
� β itself is also updated dynamically with a growth factor γ with the upperbound 1.
17 / 42
The logic flow of updating βk
Case 2. The error increases : ek > ek−1
• It means the current β value is “bad” (too large)
• We become less ambitious on the extrapolation
� i.e., we decrease the value of β� How : dividing it with the decay factor η > 1
βk+1 =βkη
• As f is often a continuous and smooth, for βk being too large, suchvalue of β will certainly be also too large at iteration k + 1
� i.e., we have to avoid βk+1 to grow back to βk (the “bad” value) too soon� How : we set the ceiling parameter
βk+1 = βk
18 / 42
The logic flow of updating βk
Case 2. The error increases : ek > ek−1
• It means the current β value is “bad” (too large)
• We become less ambitious on the extrapolation
� i.e., we decrease the value of β� How : dividing it with the decay factor η > 1
βk+1 =βkη
• As f is often a continuous and smooth, for βk being too large, suchvalue of β will certainly be also too large at iteration k + 1
� i.e., we have to avoid βk+1 to grow back to βk (the “bad” value) too soon� How : we set the ceiling parameter
βk+1 = βk
19 / 42
The logic flow of updating βk
Case 2. The error increases : ek > ek−1
• It means the current β value is “bad” (too large)
• We become less ambitious on the extrapolation
� i.e., we decrease the value of β� How : dividing it with the decay factor η > 1
βk+1 =βkη
• As f is often a continuous and smooth, for βk being too large, suchvalue of β will certainly be also too large at iteration k + 1
� i.e., we have to avoid βk+1 to grow back to βk (the “bad” value) too soon� How : we set the ceiling parameter
βk+1 = βk
20 / 42
A toy example
An example showing the extrapolated scheme (E-PGD) has much fasterconvergence than that of the standard scheme (PGD).
Such numerical scheme is found to be effective in accelerating NMF algorithms.See the paper for more examples and the comparisons with other accelerationscheme. 21 / 42
Chain structure
There are variations on the chain structure of the update, for examples
• Update W → extrapolate W → update H → extrapolate H
• Update W → extrapolate W → update H → extrapolate H → project H
• Update W → update H → extrapolate W → extrapolate H
The comparisons of these three schemes : see the paper.
22 / 42
23 / 42
24 / 42
Future work : to analyze why certain structure has a better performancethan others
25 / 42
Tensor extension
Recent attempt : try to extend the idea of extrapolation to the tensorcases; more precisely to the Non-negative Canonical PolyadicDecomposition (NNCPD).
minU,V,W
f(U,V,W) = ‖Y −U ∗V ∗W‖ s.t. U ≥ 0,V ≥ 0,W ≥ 0
= ‖Y −r∑i
ui ∗ vi ∗wi‖
Preliminary experiments showed that the approach is very promising and isable to significantly accelerate the NNCPD algorithms.
Unsolved problem : NNCPD has even higher variability on the chainstructure. 26 / 42
Understanding the relationship between the data structure (rank size, sizeof each mode) and the chain structure will be crucial.
27 / 42
Separable NMF relaxes NP-hardness of NMF
Geometrically, NMF describes a non-negative cone, that is data points areencapsulated inside a non-negative cone generated by the basis W.
Under the assumption of separability, when the basis W is also presentedwithin the data cloud, the NMF problem, which is then called SeparableNMF (SNMF), becomes solvable in polynomial time.
28 / 42
Volume Regularized Non-negative Matrix Factorizations
Separable NMF is a well studied problem.
Separability assumption is quite strong.
So we relax it by minimum volume NMF, or Volume-regularized NMF(VRNMF) :
argminW,H
1
2‖X−WH‖2F + λV(W) s.t. W ≥ 0,H ≥ 0,H>1r ≤ 1n.
Geometrically the goal is the find the underlying generating vertex of thedata by fitting a non-negative convex hull with minimum volume.
29 / 42
Volume Regularized Non-negative Matrix Factorizations
Separable NMF is a well studied problem.Separability assumption is quite strong.
So we relax it by minimum volume NMF, or Volume-regularized NMF(VRNMF) :
argminW,H
1
2‖X−WH‖2F + λV(W) s.t. W ≥ 0,H ≥ 0,H>1r ≤ 1n.
Geometrically the goal is the find the underlying generating vertex of thedata by fitting a non-negative convex hull with minimum volume.
30 / 42
Volume Regularized Non-negative Matrix Factorizations
Separable NMF is a well studied problem.Separability assumption is quite strong.
So we relax it by minimum volume NMF, or Volume-regularized NMF(VRNMF) :
argminW,H
1
2‖X−WH‖2F + λV(W) s.t. W ≥ 0,H ≥ 0,H>1r ≤ 1n.
Geometrically the goal is the find the underlying generating vertex of thedata by fitting a non-negative convex hull with minimum volume.
31 / 42
VR-NMF
Four different volume functions V are studied :
• log-determinant log det(W>W + δIr)
• Determinant det(W>W)
• Frobenius norm ‖W‖2F• Nuclear norm ‖W‖∗These functions are all non-decreasing function of the singular values ofW, which minimize them indirectly minimize the ”volume” of the convexhull spanned by W.
Note
• The “true” volume function of the convex hull of W exists, butcomputationally very expensive.
• Computing the exact volume of a convex polytope from the vertices in highdimension is a long-time hard problem.
32 / 42
VR-NMF
What has been done
• Algorithms for VR-NMF with the four volume functions
• Model comparisons bewteen different volume functions : logdet seemsto be better
• Proposed algorithms perform better than state-of-the-art algorithms
33 / 42
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1
-0.5
0
0.5
1
1.5
1
2
3
4
Ground truth and reconstructions (2D projected)Data point xiVertex WtrueB
Hull
SPA(Gillis14)Hull
Vertex Det(this work)Hull
VertexHull
RVolMin(Fu2016)
(m,n,r) = (100, 3000, 4)p = [0.9, 0.8, 0.7, 0.6]
Vertex
34 / 42
Future work on VR-NMF
• Theoretical limit of VR-NMF : if data points are concentrated in thecenter of the convex polygon, it is impossible to recover the vertices byVRNMF.Future study will be to analyze the theoretical limit of VRNMF : tocome up with a phase transition boundary of such vertex recovery theproblem2.
• Rank deficient case : recently it was found that when the inputfactorization rank r is larger than the true underlying dimension of thedata points, VRNMF with log-determinant volume regularizer is stillable to find the ground truth vertices.Another research direction will be to understand why it is so.
2There exists such a characterization of the transition boundary, when all the datapoints are equidistant from the vertices. However, a more general characterization of thetransition boundary when the data points are having different distances from the verticesis still an open problem.
35 / 42
Application domains : hyper spectral imaging
Examples of application of NMF for image segmentation of hyper-spectralimaging.
36 / 42
Application domains : audio source separation
Sheet music of Bach Prelude in C Major.13 kinds of notes : B3, C4, D4, E4, F#
4 , G4, A4, C5, D5, E5, F5, G5, A5.
37 / 42
38 / 42
W,H obtained from β-NMF with logdet regularizer with r = 16 ≥ 13. It can beobserved for an overestimated factorization rank r, minimum volumeregularization will automatically set some components to zero (with * symbol).
39 / 42
40 / 42
41 / 42
Last page – the works : past, present, future
Journal paper• A.-Gillis, Accelerating Nonnegative Matrix Factorization Algorithms using Extrapolation
Neural Computation, vol. 31 (2), pp 417-439, Feb 2019, MIT Press
Conference paper• Leplat-A.-Gillis, Minimum-Volume Rank-Deficient Non-negative Matrix Factorizations
To be presented in IEEE ICASSP 2019, Brighton, UK, 2019-May
• A.-Gillis, Volume regularized Non-negative Matrix FactorizationsIEEE WHISPERS 2018, Amsterdam, NL, 2018-Sept-25
Work in progress• A.-Gillis, Non-negative Matrix Factorization with Volume Regularizations for Asymmetric Non-separable Data, In
preparation, to be submitted to IEEE JSTARS
• Leplat-A.-Gillis, β-NMF for blind audio source separation with minimum volume regularization
• Cohen-A., Accelerating Non-negative Canonical Polyadic Decomposition using extrapolationA soumettre a GRETSI2019 (en Francais!)
Future working directions• Volume related
• Phase transition boundary of asymmetric non-separability• Rank deficient case
• Acceleration related
• Chain structure of the acceleration scheme• Convergence
• Application related
• Other interested applications (e.g. the “translator step” in the Brain Computer Interface)
End of Presentation
42 / 42
top related