seminaire ercim
TRANSCRIPT
Imprecise and uncertain labellinga solution based on Mixture model and belief functions.
Etienne [email protected]
Institut National de Recherche sur les Transport et leurs SecuriteDirecteur : Patrice Aknin, Encadrant : Latifa Oukhellou
Universite de Technologie de CompiegneDirecteur : Thierry Denœux
19 juin 2008
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 1 / 44
Introduction
Introduction
Constatation
Supervised / Unsupervised frameworks→ do not correspond with situations encountered in applicationsIntermediary situations, Examples :
I few labelled samples and a lot of unlabelled samples
I difficult labelling (experts), can’t assign samples to unique class
I label noise, errors in the labels
I ...
Aim of the study
Building a method to deal with imprecise and uncertain labels, based onmixture models
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 2 / 44
Introduction
Soft labels examples
Set of classes : Y = {c1, ..., cK}
mi : bba pli : plausibility
Unsupervised(vacuous labels)
mi (Y) = 1 plik = 1, ∀k
Supervised(perfect labels)
mi (ck) = 1 plik = 1, plik ′ = 0, ∀k ′ 6= k
Partially supervised(imprecise labels)
mi (C ) = 1
plik = 1, if ck ∈ C
plik = 0, if ck /∈ C
soft supervision(imprecise,uncertain)
mi ? plik ∈ [0, 1]
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 3 / 44
Plan
1 Introduction
2 Mixture Model
3 The different settings and their classical solutionsSupervised and unsupervised learningSemi-supervised learningPartially supervised learning, imprecise labelling
4 Learning with soft supervision, imprecise and uncertain labelsAvailable solutions ?Generative setting in the belief function frameworkThe criterionOptimisation
5 ExperimentationsExperimentation (1) : from unsupervised to supervised learningExperiment (2) : Label noise
6 Conclusion
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 4 / 44
Plan
Mixture model
Graphical model :
label sampling :
p(Y = ck) = πk , ∀k ∈ {1, . . . ,K}. (1)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 5 / 44
Plan
Mixture model
Graphical model :
knowing the class, the observable variables are drawn :
p(x|Y = ck) = f (x; θk), ∀k ∈ {1, . . . ,K}. (2)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 6 / 44
Mixture Model
Example : mixture of Gaussian (clustering)
0-5 -2 2 50
0.1
0.2
0.3
x
f(x)
Fig.: Example of probability density function of a gaussian mixture
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 7 / 44
Mixture Model
Example : Weibull mixture
0 2 4 6 8 100
0.5
1
x
F(x)
Fig.: Exemple of cumulative density function of a weibull mixture
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 8 / 44
The different settings and their classical solutions Supervised and unsupervised learning
Supervised learning with mixture models
Data
learning set : Xs = {xi , yi}Ni=1,
I measurements xi ∈ X ,
I labels yi ∈ Y = {c1, ..., cK},
Generative setting
parameters Ψ = (π,θ1, . . . ,θK ), log-Likelihood :
L(Ψ; Xs) =M∑i=1
K∑k=1
zik log (πk f (xi ; θk)) , (3)
with zi ∈ {0, 1}K binary variables encoding the class membership :zik = 1 if yi = ck , and zik = 0 otherwise.Analytical solutions if f is the density of a classical law.
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 9 / 44
The different settings and their classical solutions Supervised and unsupervised learning
Unsupervised learning, clustering
Data
learning set : Xns = {xi}Ni=1
I measurements xi ∈ X ,
Goal, clustering
Find a good model of p(x)with a latent discrete variable encoding the class membership y
p(x) =∑y
p(y)p(x|y)
using bayes :
p(y |x) =p(y)p(x|y)∑y p(y)p(x|y)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 10 / 44
The different settings and their classical solutions Supervised and unsupervised learning
Unsupervised learning, with mixture model
Mixture model
p(x) =K∑
k=1
πk fk(x,θk) (4)
L(Ψ; Xns) =N∑
i=1
ln(K∑
k=1
πk fk(x,θk)) (5)
fk(.,θk) parametric density over X .Parameters : Ψ = (π,θ).
Optimisation !
! non-convexeClassical solution : EM algorithm
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 11 / 44
The different settings and their classical solutions Supervised and unsupervised learning
EM algorithm
Log-likelihood :
latent variable z : p(x) = p(x ,z)p(z|x)
log likelihood decomposition :
L(Ψ; Xns) =N∑
i=1
K∑k=1
t(q)ik ln (πk f (xi ; θk))︸ ︷︷ ︸Q(Ψ,Ψ(q))
−N∑
i=1
K∑k=1
t(q)ik ln (tik)︸ ︷︷ ︸
H(Ψ,Ψ(q))
, (6)
with :
t(q)ik = EΨ(q) [zik |xi ] =
π(q)k f (xi ; θ
(q)k )∑K
k ′=1 π(q)k ′ f (xi ; θ
(q)k ′ )
. (7)
! H propertie
H(Ψ(q),Ψ(q))− H(Ψ(q+1),Ψ(q)) >= 0
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 12 / 44
The different settings and their classical solutions Supervised and unsupervised learning
EM algorithm
⇒ maximisation of Q(Ψ,Ψ(q)) sufficient to increase the likelihood.
EM algorithm
Initialisation :Ψ(0)
until convergence
I Expectation : computation of t(q)ik
I Maximisation : Ψ(q+1) = arg maxΨ Q(Ψ,Ψ(q))
analytical solution for the proportions : π(q+1)k = 1
N
∑Ni=1 t
(q)ik .
analytical solution for θk for classical laws∑outside of the logarithm.
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 13 / 44
The different settings and their classical solutions Semi-supervised learning
Semi-supervised learning
Data
A mix of labelled and unlabelled samples
Xss = {(x1, y1), . . . , (xM , yM), xM+1, . . . , xN}
for unlabelled samples be usefull hypothesis must be made :
I model of p(x , y)→ generative methods(Mixture model [Nigam 00])
I decision boundary in law density area→ discriminative methods(Data dependant regularization [Bengio 05])
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 14 / 44
The different settings and their classical solutions Semi-supervised learning
Semi-supervised learning with mixture models
Generative setting
log-Likelihood :
L(Ψ,Xss) =M∑i=1
K∑k=1
zik ln(πk f (xi ; θk)) +N∑
i=M+1
ln(K∑
k=1
πk f (xi ; θk)), (8)
Optimisation
Modification of the EM algorithm :
I During the E step posterior probabilities are computed only forunlabelled samples.
I During the M step the known labels are used instead of the posteriorprobabilities for the labelled samples.
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 15 / 44
The different settings and their classical solutions Partially supervised learning
Partially supervised learning
Data
each label is a set of possible classes :
Xps = {(xi ,Ci )}Ni=1
where Ci ⊆ Y is the set of possible classes for sample i , (can be Y)
⇒ more general :semi-supervisised learning is a special case of partially supervised learning.
Solutions in the literature :I Self consistent logistic regression [Grandvallet 02]
I Mixture model [Ambroise 00]
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 16 / 44
The different settings and their classical solutions Partially supervised learning
Partially supervised learning with mixture models
Generative setting
log-Likelihood :
L(Ψ,Xps) =N∑
i=1
ln
(K∑
k=1
likπk f (xi ,θk)
), (9)
with li ∈ {0, 1}K binary variables encoding the possibles classes :lik = 1 if ck ∈ Ci , and lik = 0 otherwise.
Optimisation
Modification of the EM algorithm :During the E step the posterior probabilities are computed using :
t(q)ik =
likπ(q)k f (xi ; θ
(q)k )∑K
k ′=1 likπ(q)k ′ f (xi ; θ
(q)k ′ )
. (10)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 17 / 44
Learning with soft supervision Available solutions ?
Learning with soft supervision
Data
labels = bba mi over Y :
Xsd = {(xi ,mi )}Ni=1
⇒ all other settings are special case of this one depending on kind of bbaused
existent solutions in the belief functions frameworkI k-nn [Denœux 95]
I belief decision tree [Elouedi 01]
I ...
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 18 / 44
Learning with soft supervision Generative setting in the belief function framework
Learning with soft supervision,mixture model and generative setting
[Vannoorenberghe 05]
Previous work : attempt to deal with soft label in mixture model,modification of an EM algorithm
ProblemsI No criterion
I Convergence ?
I Relationship with probabilistic formulation unclear
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 19 / 44
Learning with soft supervision The criterion
The criterion
Starting point :
Relation between likelihood // possibility // plausibilityNatural criterion : parameters Ψ with biggest plausibility | observations
Ψ = arg maxΨ
plΨ(Ψ|Xsd)
plΨ(Ψ|Xsd) = plX1×...×XN (x1, . . . , xN |Ψ) , (GBT)
=∏N
i=1 plXi (xi |Ψ) , (Ind. cond.)
Development of plXi (xi |Ψ) :
plXi (xi |Ψ) =∑
C⊆Y mYi (C |Ψ) plXi |Yi (xi |C ,Ψ) , (Th. pl. totale)
Need to calculate : mYi (C |Ψ) and plXi |Yi (xi |C ,Ψ)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 20 / 44
Learning with soft supervision The criterion
The criterion
mYi (C |Ψ), bba over classes
2 belief functions over Yi : π and mi
Assumptions : labels are independent from parameters⇒ conjunctive combination! π is a bayesian bba⇒ after combination focal sets = singletons :
mYi (ck |Ψ) = (mYii ∩©πYi )(ck) , ∀k ∈ {1, ...,K}
=∑
C∩ck 6=∅mYii (C ).πYi (ck)
= plik .πk
bba over Y :
mYi (ck |Ψ) = plik .πk (11)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 21 / 44
Learning with soft supervision The criterion
The criterion
plXi |Yi (·|ck ,Ψ), plausibility of observations | class
From the model, plXi |Yi (·|ck ,Ψ) plausibility measure associated withf (xi ; θk)
Problem
⇒ Plausibility of a point with infinite precision is null.
Solution
Punctual observations ?Always a finite precision → considers a small region around the pointApproximation, product of function with an infinitesimal dxi1 . . . dxip :
plXi |Yi (xi |ck ,Ψ) = f (xi ; θk)dxi1 . . . dxip. (12)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 22 / 44
Learning with soft supervision The criterion
The criterion
Using all of this
plXi (xi |Ψ) =
(K∑
k=1
plikπk f (xi ; θk)
)dxi1 . . . dxip. (13)
Final criterion : generalized log-likelihood
L(Ψ,Xsd) =N∑
i=1
ln
(K∑
k=1
plikπk f (xi ,θk)
)+ Cst. (14)
RemarkI natural expression ;
I uniquely depends of singletons plausibilities ;
I extends all the probabilist criterions (semi supervised, partiallysupervised ,...).
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 23 / 44
Learning with soft supervision Optimisation
Optimisation
Information available on samples classes
at iteration q information on class labels of all samples comes from threesources :
1 label mYi ;
2 proportions π(q) (bayesian bba over Y) ;
3 observation xi and current parameters estimates θ(q) which lead to abba over Y by the GBT : plYi |Xi ({ck}|xi ,Ψ) = f (xi ; θk)dxi1 . . . dxip.
tik : conjunctive combination of the three sources
t(q)ik =
plikπk f (xi ; θ(q))∑K
k ′=1 plik ′πk ′f (xi ; θ(q))
,
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 24 / 44
Learning with soft supervision Optimisation
Optimisation
log likelihood decomposition :
Classical as in EM :
L(Ψ; Xns) =N∑
i=1
K∑k=1
t(q)ik ln (plikπk f (xi ; θk))︸ ︷︷ ︸
Q(Ψ,Ψ(q))
−N∑
i=1
K∑k=1
t(q)ik ln (tik)︸ ︷︷ ︸
H(Ψ,Ψ(q))
,
with : t(q)ik = plikπk f (xi ;θ
(q))PKk′=1 plik′πk′ f (xi ;θ
(q))
Increasing
H same form as in classical EM.⇒ Optimisation of Q sufficient.⇒ Same algorithm except for the E step where the tik are computed usingthe soft labels
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 25 / 44
Learning with soft supervision Optimisation
Optimisation
An EM algorithm
Initialisation :Ψ(0)
While :
I Expectation : t(q)ik = plikπk f (xi ;θ
(q))PKk′=1 plik′πk′ f (xi ;θ
(q)),
I Maximisation : Ψ(q+1) = arg maxΨ Q(Ψ,Ψ(q))
Analytical solution for the proportions : π(q+1)k = 1
N
∑Ni=1 t
(q)ik .
Analytical solution for the θk if classical laws.
Convergence
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 26 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Experimentation (1) : from unsupervised to supervisedlearning
from unsupervised to supervised learning
Simulation of different sets of labels with varying label precisionMore precise labels :
I Better results ?
I Simpler optimisation problem ?
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 27 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Information measure : non specificity
Average non specificity µs of learning set labels :
NS(mi ) =∑C⊆Y
mYi (C )log(card(C ))
between :
0 → ln(card(Y))perfect labels → vacuous labels
supervised learning → unsupervised learning
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 28 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Labels simulations :I for each sample i a non-specificity NS(mi ) is drawn,
with a uniform law in [µs − 0.05, µs + 0.05]
I a plausibility pli with NS(mi ) as non-specificity is build.I each plausibility value is assigned to a class with the following rules :
I the true class has the biggest plausibilityI random assignation for the other classes
⇒ No error in the labelling(Assumption removed in the second experiment)
Data simulation
Mixture of 2 Gaussian : Σ1 = Σ1 = I , π1 = π2 = 0.5, dimension 10 ;Distance between the two centres δ ∈ {1, 2, 4} ;Learning set size N = 1000.
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 29 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Influence of labelling on results quality, δ = 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
30
35
40
45
50
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Label’s Non−specificity Mean
Fig.: Classification error on test set / labels non-specificity : δ = 1, N = 1000
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 30 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Influence of labelling on results quality, δ = 2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
15
20
25
30
35
40
45
50
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Label’s Non−specificity Mean
Fig.: Classification error on test set / labels non-specificity : δ = 2 N = 1000
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 31 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Influence of labelling on results quality, δ = 4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11.6
1.8
2
2.2
2.4
2.6
2.8
3
3.2
3.4
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Label’s Non−specificity Mean
Fig.: Classification error on test set / labels non-specificity δ = 4, N = 1000
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 32 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Influence of labels precision on optimisation
δ = 2,N = 1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
20
40
60
80
100
120
140
Label’s Non−specificity Mean
# lo
cal m
axim
ums
(a) # Locals maximums
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
50
100
150
200
250
300
Label’s Non−specificity Mean
# ite
ratio
ns b
efor
e co
nver
genc
e
(b) # iterations before convergence
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 33 / 44
Experimentations Experimentation (1) : from unsupervised to supervised learning
Conclusions on experiment (1)
More precise labels :
I Better classification performance if the classes overlapI Simplify the optimisation problem
I less local maximaI faster EM convergence
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 34 / 44
Experimentations Experiment (2) : Label noise
Experiment (2) : Label noise
Context :I Labelling done by an expert.
I Some mistake are made during the labeling process.
I But a value, pi ∈ [0, 1] = expert’s doubt, is supplied.
How to deal with such additional information ?
Solution with “soft” labels
Discounting the hard label by pi :{mi (Ck) = 1− pi
mi (Y) = pi⇔
{plik = 1plik ′ = pi , ∀k ′ 6= k
(15)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 35 / 44
Experimentations Experiment (2) : Label noise
SimulationsI Data = benchmark classification problem (UCI),
I Expert’s doubt pi simulated with a Beta law,
I With a probability pi , the true label is flipped.
Rmq : Expectation of pi = asymptotic labelling error rate.
Comparison :
Supervised learning, Unsupervised learning, adaptative Semi-supervisedlearning, label noise model.Performance Criterion : classification error
ParametersI Expectation of Beta law : (0.1→ 0.4) = (% false labels )
I Variance of Beta law keep constant at 0.2
Results are averaged over 100 simulated learning set and performances areassessed by cross-validation (10 block)
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 36 / 44
Experimentations Experiment (2) : Label noise
Results Iris
10 15 20 25 30 35 400
5
10
15
20
25
Asymptotic label error rate (%)
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Soft labelsSupervisedUnsupervisedSemi−supervisedLabel Noise Model
Classification error / Expert’s labelling error rate
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 37 / 44
Experimentations Experiment (2) : Label noise
Results Wine
10 15 20 25 30 35 400
5
10
15
20
25
30
35
Asymptotic label error rate (%)
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Soft labelsSupervisedUnsupervisedSemi−supervisedLabel Noise Model
Classification error / Expert’s labelling error rate
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 38 / 44
Experimentations Experiment (2) : Label noise
Results Crabs
10 15 20 25 30 35 404
6
8
10
12
14
16
18
20
22
24
Asymptotic label error rate (%)
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Soft labelsSupervisedUnsupervisedSemi−supervisedLabel Noise Model
Classification error / Expert’s labelling error rate
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 39 / 44
Experimentations Experiment (2) : Label noise
Results Breast Cancer Wisconsin
10 15 20 25 30 35 400
5
10
15
20
25
30
35
Asymptotic label error rate (%)
Em
piric
al c
lass
ifica
tion
erro
r (%
)
Soft labelsSupervisedUnsupervisedSemi−supervisedLabel Noise Model
Classification error / Expert’s labelling error rate
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 40 / 44
Experimentations Experiment (2) : Label noise
Conclusions on experiment (2)
I Information on label reliability useful in the presence of label noise
I Our solution based on mixture model with soft label can deals withsuch additional information efficiently
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 41 / 44
Conclusion
Conclusion
ResultsI Definition of a generalized likelihood criterion
I Demonstration of EM convergence if likelihood bounded and smooth
I Relations with probabilist criterion
I Validation with experiments
Advantages of the method
I Can benefit from all the developments done in the probabilistframework (parsimonious model, model selection criterion BIC, ...)
I Can deal with any type of information on labels
Disadvantages of the method
I Generative settings : distributional assumptions must hold for themethod to work
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 42 / 44
Conclusion
Future work
Performance assessmentI Validation of classification results (how to extend misclassification
rate to soft labels),
Extension to other Latent variable modelsI Latent Traits Analysis :
Continuous latent variables, discrete observations ;
I Latent Profile Analysis :Discrete latent variables, discrete observations ;
I Independent Factor Analysis :Mixed continuous, discrete latent variables, continuous observations.
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 43 / 44
Conclusion
C. Ambroise & G. Govaert.
EM algorithm for partially known labels.
In Proceedings of the 7th international conference, IFCS, pages 161–166. Springer, 2000.
Y. Bengio & Y. Grandvalet.
Semi-supervised Learning by Entropy Minimization.Advances in Neural Information Processing Systems, vol. 17, 2005.
T. Denœux.
A k-Nearest Neighbor Classification rule Based on Dempster Shafer Theory.IEEE Tansaction on System Man and Cybernetics, vol. 25, no. 5, pages 804–813, may 1995.
Z. Elouedi, K. Mellouli & P. Smets.
Belief Decision Trees : Theoretical Foundations.International Journal of Approximate Reasoning, vol. 28, 2001.
Y. Grandvallet.
Logistic regression for partial labels.In 9th International Conference IPMU’02, volume III, pages 1935–1941, 2002.
K. Nigam, A. McCallum, S. Thrun & T. Mitchell.
Text Classification from labelled and unlabelled documents using EM.Machine Learning, vol. 39, no. 2-3, pages 103–134, 2000.
P. Smets.
Belief functions : The disjunctive rule of combination and the generalized Bayesian theorem.International Journal of Approximate Reasoning, vol. 9, no. 1, pages 1–35, August 1993.
P. Smets.
Belief functions on real numbers.International Journal of Approximate Reasoning, vol. 40, no. 3, pages 181–223, 2005.
P. Vannoorenberghe & P. Smets.
Partially supervised learning by a credal em approach.Springer, 2005.
Etienne Come (INRETS,UTC) ERCIM Workshop 19 juin 2008 44 / 44