study on ensemble learning by feng zhou. content introduction a statistical view of m3 network...
TRANSCRIPT
Study on Ensemble Learning
By Feng Zhou
Content
• Introduction• A Statistical View of M3 Network• Future Works
Introduction• Ensemble learning:
– To combine a group of classifiers rather than to design a new one.– The decisions of multiple hypotheses are combined to produce more
accurate results.
• Problems in traditional learning algorithms– Statistical Problem– Computational Problem– Representation Problem
• Related Works– Resampling techniques: Bagging, Boosting– Approaches for extending to multi-class problem:
One-vs-One, One-vs-All.
Min-Max-Modular (M3) Network(Lu, IEEE TNN 1999)
• Steps– Dividing training sets. (Chen, IJCNN 2006; Wen, ICONIP 2005)
– Training pair-wise classifiers– Integrating the outcomes (Zhao, IJCNN 2005)
• Min process• Max process
0.1 0.5 0.7 0.2
0.4 0.3 0.5 0.6
0.8 0.5 0.4 0.2
0.5 0.9 0.7 0.3
0.1
0.3
0.2
0.3
Min Min Min Min
Max 0.3
A Statistical View
• Assumption– The pair-wise classifier outputs a probabilistic
value. Sigmoid function (J.C. Platt, ALMC 1999):
• Bayesian decision theory
1( | )
1 Ax BP x
e
{ , }
ˆ argmax ( | )P x
( | ) ( )( | )
( | ) ( ) ( | ) ( )
P x PP x
P x P P x P
where
A Simple Discrete Example
P(w|x)
W+ W-
X1 1/2
X2 1/2 2/5
X3 2/5
X4 1/5
A Simple Discrete Example (II)
Classifier 1 (w+:w1-) Classifier 2 (w+:w2
-)
Pc0(w+|x=x2) = 1/3
Pc1(w+|x=x2) = 1/2
Pc2(w+|x=x2) = 1/2
Classifier 0 (w+:w-)
Pc0 < min(Pc1,Pc2)
A More Complicated Example• When consider a new more
classifier, the evidence that x belong to w+ is getting shrinking.
• Pglobal(w+) < min(Ppartial(w+))
• The one reporting the minimum value contains the most information about w- (Minimization principle)
• If Ppartial(w+)=1, no information about w- is
contained.
Classifier 1 (w+:w1-) Classifier 2 (w+:w2
-)
……
Information about w- is increasing
Analysis• For each classifier cij
• For each sub-positive class wi+
• For positive class w+
( | , )i i jP x ( , )
( , ) ( , )i
iji j
P xM
P x p x
( | , )i iP x 1
1( 1)
i
jij
qn
M
( | )P x
11
1( 1)
1i i
nq
Analysis (II)• Decomposition of a complex problem
• Restoration to the original resoluation
Composition of Training Setsw+ w-
w1+ … wn+
+ w1- … wn-
-
w+
w1+
…
wn++
w-
w1-
…
wn--
Have been used
Trivial set, useless
Not used yet
Another Way of Combinationw+ w-
w1+ … wn+
+ w1- … wn-
-
w+
w1+
…
wn++
w-
w1-
…
wn--
'
'
1( 2)
1 1( 2)
i kki
i k jki kj
nM
qn n
M M
Training and testing Time: ( * ) ( )n n n n
Experiments - Synthesis Data
Experiments – Text Categorization(20 Newsgroup copus)
Experiments Setup
• Removing words :stemming
stop words < 30
• Using Naïve Bayes as the elementary classifier
• Estimating the probability with a sigmod function
Future Work
• Situation with consideration of noise– The virtue of the problem:
To access the underlying distribution– Independent parameters for the model:– Constraints we get: – To obtain the best estimation.
Kullback-Leibler Distance (T. Hastie, Ann Statist 1998)
n n ( )2
n n
References[1] T. Hastie & R. Tibshirani, Classification by pairwise coupling, Ann
Statist 1998.[2] J. C. Platt, (Probabilistic outputs for support vector machines and
comparisons to regularized likelihood methods, ALMC 1999[3] B. Lu & , Task decomposition and module combination based on
class relations a modular neural network for pattern classification, IEEE Tran. Neural Networks, 1999
[4] Y. M. Wen & B. Lu, Equal Clustering Makes Min-Max Modular Support Vector Machines More Efficient, ICONIP 2005
[5] H. Zhao & B. Lu, On efficient selection of binary classifiers for min-max modular classifier, IJCNN 2005
[6] K. Chen & B. Lu, Efficient classification of multi-label and imbalanced data using min-max modular classifiers, IJCNN 2006