a new implementation of k-mle for mixture modelling of wishart distributions
DESCRIPTION
A new implementation of k-MLE for mixture modelling of Wishart distributions GSI 2013TRANSCRIPT
![Page 1: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/1.jpg)
A new implementation of k-MLE formixture modelling of Wishart distributions
Christophe Saint-Jean Frank Nielsen
Geometric Science of Information 2013
August 28, 2013 - Mines Paris Tech
![Page 2: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/2.jpg)
Application Context (1)
2/31
2/31
We are interested in clustering varying-length sets of multivariateobservations of same dim. p.
X1 =
3.6 0.05 −4.3.6 0.05 −4.3.6 0.05 −4.
, . . . ,XN =
5.3 −0.5 2.53.6 0.5 3.51.6 −0.5 4.6−1.6 0.5 5.1−2.9 −0.5 6.1
Sample mean is a good but not discriminative enough feature.
Second order cross-product matrices tXiXi may capture somerelations between (column) variables.
![Page 3: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/3.jpg)
Application Context (2)
3/31
3/31
The problem is now the clustering of a set of p × p PSD matrices :
χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN
}
Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...
![Page 4: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/4.jpg)
Application Context (2)
3/31
3/31
The problem is now the clustering of a set of p × p PSD matrices :
χ ={x1 = tX1X1, x2 = tX2X2, . . . , xN = tXNXN
}
Examples of applications : multispectral/DTI/radar imaging,motion retrieval system, ...
![Page 5: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/5.jpg)
Outline of this talk
4/31
4/31
1 MLE and Wishart DistributionExponential Family and Maximum Likehood EstimateWishart DistributionTwo sub-families of the Wishart Distribution
2 Mixture modeling with k-MLEOriginal k-MLEk-MLE for Wishart distributionsHeuristics for the initialization
3 Application to motion retrieval
![Page 6: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/6.jpg)
Reminder : Exponential Family (EF)
5/31
5/31
An exponential family is a set of parametric probability distributions
EF = {p(x ;λ) = pF (x ; θ) = exp {〈t(x), θ〉+ k(x)− F (θ)|θ ∈ Θ}
Terminology:
λ source parameters.
θ natural parameters.
t(x) sufficient statistic.
k(x) auxiliary carrier measure.
F (θ) the log-normalizer:differentiable, strictlyconvex
Θ = {θ ∈ RD |F (θ) <∞}is an open convex set
Almost all commonly used distributions are EF members butuniform, Cauchy distributions.
![Page 7: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/7.jpg)
Reminder : Maximum Likehood Estimate (MLE)
6/31
6/31
Maximum Likehood Estimate principle is a very commonapproach for fitting parameters of a distribution
θ = argmaxθ
L(θ;χ) = argmaxθ
N∏i=1
p(xi ; θ) = argminθ− 1
N
N∑i=1
log p(xi ; θ)
assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations.
Log density have a convenient expression for EF members
log pF (x ; θ) = 〈t(x), θ〉+ k(x)− F (θ)
It follows
θ = argmaxθ
N∑i=1
log pF (xi ; θ) = argmaxθ
(〈
N∑i=1
t(xi ), θ〉 − NF (θ)
)
![Page 8: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/8.jpg)
MLE with EF
7/31
7/31
Since F is a strictly convex, differentiable function, MLEexists and is unique :
∇F (θ) =1
N
N∑i=1
t(xi )
Ideally, we have a closed form :
θ = ∇F−1(
1
N
N∑i=1
t(xi )
)
Numerical methods including Newton-Raphson can besuccessfully applied.
![Page 9: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/9.jpg)
Wishart Distribution
8/31
8/31
Definition (Central Wishart distribution)
Wishart distribution characterizes empirical covariance matrices forzero-mean gaussian samples:
Wd(X ; n,S) =|X |
n−d−12 exp
{− 1
2tr(S−1X )
}2
nd2 |S |
n2 Γd
(n2
)where for x > 0, Γd(x) = π
d(d−1)4∏d
j=1 Γ(x − j−1
2
)is the
multivariate gamma function.
Remarks : n > d − 1, E[X ] = nS
The multivariate generalization of the chi-square distribution.
![Page 10: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/10.jpg)
Wishart Distribution as an EF
9/31
9/31
It’s an exponential family:
logWd(X ; θn, θS) = < θn, log |X | >R + < θS ,−1
2X >HS
+ k(X )− F (θn, θS)
with k(X ) = 0 and
(θn, θS) = (n − d − 1
2,S−1), t(X ) = (log |X |,−1
2X ),
F (θn, θS) =
(θn +
(d + 1)
2
)(d log(2)− log |θS |)+log Γd
(θn +
(d + 1)
2
)
![Page 11: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/11.jpg)
MLE for Wishart Distribution
10/31
10/31
In the case of the Wishart distribution, a closed form would beobtained by solving the following system
θ = ∇F−1(
1
N
N∑i=1
t(xi )
)≡ d log(2)− log |θS |+ Ψd
(θn + (d+1)
2
)= ηn
−(θn + (d+1)
2
)θ−1S = ηS
(1)
with ηn and ηS the expectation parameters and Ψd the derivativeof the log Γd .Unfortunately, no closed-form solution is known.
![Page 12: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/12.jpg)
Two sub-families of the Wishart Distribution (1)
11/31
11/31
Case n fixed (n = 2θn + d + 1)
Fn(θS) =nd
2log(2)− n
2log |θS |+ log Γd
(n2
)kn(X ) =
n − d − 1
2log |X |
Case S fixed (S = θ−1S )
FS(θn) =
(θn +
d + 1
2
)log |2S |+ log Γd
(θn +
d + 1
2
)
kS(X ) = −1
2tr(S−1X )
![Page 13: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/13.jpg)
Two sub-families of the Wishart Distribution (2)
12/31
12/31
Both are exponential families and MLE equations are solvable !
Case n fixed:
−n
2θ−1S =
1
N
N∑i=1
−1
2Xi =⇒ θS = Nn
(N∑i=1
Xi
)−1(2)
Case S fixed :
θn = Ψ−1d
(1
N
N∑i=1
log |Xi | − log |2S |
)−d + 1
2, θn > 0 (3)
with Ψ−1d the functional reciprocal of Ψd .
![Page 14: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/14.jpg)
An iterative estimator for the Wishart Distribution
13/31
13/31
Algorithm 1: An estimator for parameters of the Wishart
Input: A sample X1,X2, . . . ,XN of Sd++
Output: Final values of θn and θS
Initialize θn with some value > 0;
repeat
Update θS using Eq. 2 with n = 2θn + d + 1;
Update θn using Eq. 3 with S the inverse matrix of θS ;
until convergence of the likelihood ;
![Page 15: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/15.jpg)
Questions and open problems
14/31
14/31
From a sample of Wishart matrices, distr. parameters arerecovered in few iterations.
Major question : do you have a MLE ? probably ...
Minor question : sample size N = 1 ?
Under-determined systemRegularization by sampling around X1
![Page 16: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/16.jpg)
Mixture Models (MM)
15/31
15/31
A additive (finite) mixture is a flexible tool to model a morecomplex distribution m:
m(x) =k∑
j=1
wjpj(x), 0 ≤ wj ≤ 1,k∑
j=1
wj = 1
where pj are the component distributions of the mixture, wj
the mixing proportions.
In our case, we consider pj as member of some parametricfamily (EF)
m(x ; Ψ) =k∑
j=1
wjpFj(x ; θj)
with Ψ = (w1,w2, ...,wk−1, θ1, θ2, ..., θk)
Expectation-Maximization is not fast enough [5] ...
![Page 17: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/17.jpg)
Original k-MLE (primal form.) in one slide
16/31
16/31
Algorithm 2: k-MLE
Input: A sample χ = {x1, x2, ..., xN}, F1,F2, ...,Fk Bregmangenerator
Output: Estimate Ψ of mixture parameters
A good initialization for Ψ (see later);
repeatrepeat
foreach xi ∈ χ do zi = argmaxj log wjpFj(xi ; θj);
foreach Cj := {xi ∈ χ|zi = j} do θj = MLEFj(Cj);
until Convergence of the complete likelihood ;
Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;
![Page 18: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/18.jpg)
k-MLE’s properties
17/31
17/31
Another formulation comes with the connection between EFand Bregman divergences [3]:
log pF (x ; θ) = −BF∗(t(x) : η) + F ∗(t(x)) + k(x)
Bregman divergence BF (. : .) associated to a strictly convexand differentiable function F :
![Page 19: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/19.jpg)
Original k-MLE (dual form.) in one slide
18/31
18/31
Algorithm 3: k-MLE
Input: A sample χ = {y1 = t(x1), y2 = x2, ..., yn = t(xN)},F ∗1 ,F
∗2 , ...,F
∗k Bregman generator
Output: Ψ = (w1, w2, ..., wk−1, θ1 = ∇F ∗(η1), ..., θk = ∇F ∗(ηk))
A good initialization for Ψ (see later);
repeatrepeat
foreach xi ∈ χ do zi = argminj
[BF∗j
(yi : ηj)− log wj
];
foreach Cj := {xi ∈ χ|zi = j} do ηj =∑
xi∈Cj yi/|Cj |
until Convergence of the complete likelihood ;
Update mixing proportions : wj = |Cj |/Nuntil Further convergence of the complete likelihood ;
![Page 20: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/20.jpg)
k-MLE for Wishart distributions
19/31
19/31
Practical considerations impose modifications of the algorithm:
During the assignment empty clusters may appear (Highdimensional data get this worse).
A possible solution is to consider Hartigan and Wang’sstrategy [6] instead of Lloyd’s strategy:
Optimally transfer one observation at a timeUpdate the parameters of involved clusters.Stop when no transfer is possible.
This should guarantees non-empty clusters [7] but does notwork when considering weighted clusters...
Get back to an “old school” criterion : |Czi | > 1
Experimentally shown to perform better in high dimensionthan the Lloyd’s strategy.
![Page 21: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/21.jpg)
k-MLE - Hartigan and Wang
20/31
20/31
Criterion for potential transfer (Max):
log wzipFzi(xi ; θzi )
log wz∗ipFz∗
i(xi ; θzi∗)
< 1
with z∗i = argmaxj log wjpFj(xi ; θj)
Update rules :
θzi = MLEFj(Czi\{xi})
θz∗i = MLEFj(Cz∗i ∪ {xi})
OR
Criterion for potential transfer (Min):
BF∗(yi : ηz∗i )− logwz∗i
BF∗(yi : ηzi )− logwzi
< 1
with z∗i = argminj(BF∗(yi : ηj) −logwj)
Update rules :
ηzi =|Czi |ηzi − yi|Czi | − 1
ηz∗i =|Cz∗i |ηz∗i + yi
|Cz∗i |+ 1
![Page 22: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/22.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 23: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/23.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 24: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/24.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 25: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/25.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 26: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/26.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 27: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/27.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 28: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/28.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
![Page 29: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/29.jpg)
Towards a good initialization...
21/31
21/31
Classical initializations methods : random centers, randompartition, furthest point (2-approximation), ...
Better approach is k-means++ [8]:“Sampling prop. to sq. distance to the nearest center.”
Fast and greedy approximation : Θ(kN)Probabilistic guarantee of good initialization:
OPTF ≤ k-meansF ≤ O(log k)OPTF
Dual Bregman divergence BF∗ may replace the square distance
![Page 30: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/30.jpg)
Heuristic to avoid to fix k
22/31
22/31
K-means imposes to fix k, the number of clusters
We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :
“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”
![Page 31: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/31.jpg)
Heuristic to avoid to fix k
22/31
22/31
K-means imposes to fix k, the number of clusters
We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :
“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”
![Page 32: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/32.jpg)
Heuristic to avoid to fix k
22/31
22/31
K-means imposes to fix k, the number of clusters
We propose on-the-fly cluster creation together with thek-MLE++ (inspired by DP-k-means [9]) :
“Create cluster when there exists observations contributing toomuch to the loss function with already selected centers”
It may overestimate the number of clusters...
![Page 33: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/33.jpg)
Initialization with DP-k-MLE++
23/31
23/31
Algorithm 4: DP-k-MLE++
Input: A sample y1 = t(X1), . . . , yN = t(XN), F , λ > 0
Output: C a subset of y1, . . . , yN , k the number of clusters
Choose first seed C = {yj}, for j uniformly random in {1, 2, . . . ,N};repeat
foreach yi do compute pi = BF∗(yi : C)/∑N
i ′=1 BF∗(yi ′ : C)
where BF∗(yi : C) = minc∈CBF∗(yi : c) ;
if ∃pi > λ thenChoose next seed s among y1, y2, . . . , yN with prob. pi ;
Add selected seed to C : C = C ∪ {s} ;
until all pi ≤ λ;
k = |C|;
![Page 34: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/34.jpg)
Motion capture
24/31
24/31
Real dataset:Motion capture of contemporary dancers (15 sensors in 3d).
![Page 35: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/35.jpg)
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
![Page 36: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/36.jpg)
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
![Page 37: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/37.jpg)
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
![Page 38: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/38.jpg)
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
![Page 39: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/39.jpg)
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Remark: Size of each sub-motion is known (so its θn)
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
![Page 40: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/40.jpg)
Application to motion retrieval(1)
25/31
25/31
Motion capture data can be view as matrices Xi with differentrow sizes but same column size d .
The idea is to describe Xi through one mixture modelparameters Ψi .
Mixture parameters can be viewed as a sparse representationof local dynamics in Xi .
![Page 41: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/41.jpg)
Application to motion retrieval(2)
26/31
26/31
Comparing two movements amounts to compute adissimilarity measure between Ψi and Ψj .
Remark 1 : with DP-k-MLE++, the two mixtures would notprobably have the same number of components.
Remark 2 : when both mixtures have one component, anatural choice is
KL(Wd(.; θ)||Wd(.; θ′)) = BF∗(η : η′) = BF (θ′ : θ)
A closed form is always available !
No closed form exists for KL divergence between generalmixtures.
![Page 42: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/42.jpg)
Application to motion retrieval(3)
27/31
27/31
A possible solution is to use the CS divergence [10]:
CS(m : m′) = − log
∫m(x)m′(x)dx∫
m(x)2dx∫m′(x)2dx
It has a analytic formula for∫m(x)m′(x)dx =
k∑j=1
k′∑j ′=1
wjw′j ′ exp
F (θj+θ′j′ )−(F (θj )+F (θ′
j′ ))
Note that this expression is well defined since naturalparameter space Θ = R+
∗ × Sp++ is a convex cone.
![Page 43: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/43.jpg)
Implementation
28/31
28/31
Early specific code in MatlabTM.
Today implementation in Python (based on pyMEF [2])
Ongoing proof of concept (with Herranz F., Beurive A.)
![Page 44: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/44.jpg)
Conclusions - Future works
29/31
29/31
Still some mathematical work to be done:
Solve MLE equations to get ∇F ∗ = (∇F )−1 then F ∗
Characterize our estimator for full Wishart distribution.
Complete and validate the prototype of system for motionretrieval.
Speeding-up algorithm: computational/numerical/algorithmictricks.
library for bregman divergences learning ?
Possible extensions:
Reintroduce mean vector in the model : Gaussian-WishartOnline k-means -> online k-MLE ...
![Page 45: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/45.jpg)
References I
30/31
30/31
Nielsen, F.:k-MLE: A fast algorithm for learning statistical mixture models.In: International Conference on Acoustics, Speech and Signal Processing.(2012) pp. 869–872
Schwander, O. and Nielsen, F.pyMEF - A framework for Exponential Families in Pythonin Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.Clustering with bregman divergences.Journal of Machine Learning Research (6) (2005) 1705–1749
Nielsen, F., Garcia, V.:Statistical exponential families: A digest with flash cards.http://arxiv.org/abs/0911.4863 (11 2009)
Hidot, S., Saint Jean, C.:An Expectation-Maximization algorithm for the Wishart mixture model:Application to movement clustering.Pattern Recognition Letters 31(14) (2010) 2318–2324
![Page 46: A new implementation of k-MLE for mixture modelling of Wishart distributions](https://reader034.vdocuments.site/reader034/viewer/2022051610/5482f289b47959dd0c8b4974/html5/thumbnails/46.jpg)
References II
31/31
31/31
Hartigan, J.A., Wong, M.A.:Algorithm AS 136: A k-means clustering algorithm.Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1)(1979) 100–108
Telgarsky, M., Vattani, A.:Hartigan’s method: k-means clustering without Voronoi.In: Proc. of International Conference on Artificial Intelligence andStatistics (AISTATS). (2010) pp. 820–827
Arthur, D., Vassilvitskii, S.:k-means++: The advantages of careful seedingIn: Proceedings of the eighteenth annual ACM-SIAM symposium onDiscrete algorithms (2007) pp. 1027–1035
Kulis, B., Jordan, M.I.:Revisiting k-means: New algorithms via Bayesian nonparametrics.In: International Conference on Machine Learning (ICML). (2012)
Nielsen, F.:Closed-form information-theoretic divergences for statistical mixtures.In: International Conference on Pattern Recognition (ICPR). (2012) pp.1723–1726