![Page 1: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/1.jpg)
1
![Page 2: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/2.jpg)
NB – Naïve Bayes
SSL – Semi-Supervised Learning
TC – Text Classification
EM – Expectation Maximization
SVM – Support Vector Machine
2
![Page 3: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/3.jpg)
As number of training documents increases, accuracy of
Text Classification increases. But traditional classifiers use
labeled data to train.
Labeled instances however are often difficult, expensive, or
time consuming to obtain, as they require the efforts of
experienced human annotators.
Meanwhile unlabeled data may be relatively easy to collect.
Semi-Supervised Learning make use of both labeled and
unlabeled documents for classification. So how to use
Semi-Supervised Learning when labeled documents are
less is a problem of research.
3
![Page 4: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/4.jpg)
In the field of machine learning, semi-supervised learning
(SSL) occupies the middle ground, between supervised
learning (in which all training examples are labeled) and
unsupervised learning (in which no label data are given).
Interest in SSL has increased in recent years, particularly
because of application domains in which unlabeled data are
plentiful, such as images, text, and bioinformatics.
4
![Page 5: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/5.jpg)
![Page 6: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/6.jpg)
![Page 7: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/7.jpg)
7
![Page 8: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/8.jpg)
8
![Page 9: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/9.jpg)
9
![Page 10: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/10.jpg)
India and china are joining
WTO.
{ India and china are joining
WTO }
{ India china joining WTO }
{ India china join WTO }
10
![Page 11: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/11.jpg)
1. Term Frequency, TFij = nij / di
TF of ‘chinese’ in d1 = 2/3,
d2 = 2/3
d3 = 1/2
d4 = 1/3
2. Doc Frequency, DF = nj / n
DF of ‘chinese’ = 4 / 4 = 1
11
![Page 12: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/12.jpg)
12
![Page 13: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/13.jpg)
13
![Page 14: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/14.jpg)
Precision = TP / TP + FP = 1/2
Recall = TP / TP + FN = 1/1+0=
1
F1 = (2 * Precision * Recall /
Precision + Recall) * 100 %
F1 = 0.66 * 100 %
= 66%
14
Doc Words in Doc Actual Label
Predicted Label
d10 India India India
Delhi Mumbai
Chinese
N Y
d11 Chinese Beijing
Chinese
Y Y
Y N
Y TP = 1 FP = 1
N FN = 0 TN = 0
![Page 15: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/15.jpg)
Training set Semi-Supervised
Learning
Training Set
after SSL
Test Set
Doc
Id
Class Doc
Id
Class Doc
Id
Class
Labeled
Documents
D1 India D1 India D7 India
D2 China D2 China D8 China
Unlabeled
Documents
D3 ? D3 India D9 China
D4 ? D4 China D10 India
D5 ? D5 India D11 India
D6 ? D6 India D12 India
15
![Page 16: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/16.jpg)
Low Density Separation (SVM) Graph based Methods Co –Training (Multi-View Approach) Generative Method
16
![Page 17: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/17.jpg)
In probability and statistics, a generative modelis a model for randomly generating observabledata, typically given some hidden parameters.
Generative models are used in machine learningfor either modeling data directly (i.e., modelingobserved draws from a probability densityfunction), or as an intermediate step to forming aconditional probability density function. Aconditional distribution can be formed from agenerative model through the use of Bayes' rule.
17
![Page 18: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/18.jpg)
In statistics, an expectation–maximization (EM)
algorithm is an iterative method for finding
maximum likelihood or maximum a posteriori
(MAP) estimates of parameters in statistical models.
Widely used for learning in the presence of
unobserved variables, e.g., missing features, class
labels.
18
![Page 19: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/19.jpg)
19
![Page 20: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/20.jpg)
Algorithm
N = No of Labeled Doc, U = No of Unlabeled Doc
Inputs : Collections Dl of labeled documents and Du of unlabeled documents.
Method :
Build an initial naive Bayes classifier, , from the labeled documents, Dl , only. Use
maximum a posteriori parameter estimation to find
Loop while classifier parameters improve,
(the complete log probability of the labeled and unlabeled data)
(E-step) Use the current classifier, , to estimate component membership of each
unlabeled document, i.e., the probability that each mixture component (and class)
generated each document.
(M-step) Re-estimate the classifier , , given the estimated component membership of
each document. Use maximum a posteriori parameter estimation to find
Output : A classifier, , that takes an unlabeled document and predicts a class label.
)()/(maxargˆ PDP ),/( zDlc )ˆ;/( ij dcP )()/(maxargˆ PDP
20
![Page 21: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/21.jpg)
The algorithm first trains a classifier with only the available
labeled documents, and assigns probabilistically-weighted
class labels to each unlabeled document by using the
classifier to calculate their expectation.
It then trains a new classifier using all the documents both
the originally labeled and the formerly unlabeled and
iterates.
21
![Page 22: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/22.jpg)
Tool, Technology, Language Version used
1 RapidMiner 5.1.001
2 Eclipse Ganymede
3 Java JDK 1.6
Dataset Detail
22
Training – Testing spilt Class1 Class2
Religion Politics
Training Set No of Labeled
Documents
10 to 600 10 to 600
No of
Unlabeled
Documents
500 500
Testing Set No of
Documents for
testing
100 100
Implementation Setup
We have used 20 newsgroup dataset [11]
![Page 23: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/23.jpg)
1) Creating an operator:package com.RapidMiner.operator.learner;
import com.RapidMiner.operator.learner;import com.RapidMiner.operator.Operator;import com.RapidMiner.operator.OperatorDescription;public class SemiSupervisedLarnerextends Operator {
public SemiSupervisedLarner(OperatorDescription description){
super(description);}
}
23
![Page 24: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/24.jpg)
2) Adding Ports to Operator for Input and Output :
private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");
private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");
private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");
private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");
private OutputPort modelOutput = getOutputPorts().createPort("model");
24
![Page 25: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/25.jpg)
3) Writing Logic for Implementation of Semi-Supervised Algorithm
public void doWork() throws OperatorException {
ExampleSet labeledExampleSet = labeledExampleSetInput.getData();ExampleSet unLabeledExampleSet = nLabeledExampleSetInput.getData();ExampleSet testExampleSet = testExampleSetInput.getData();
/* logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm */
exampleSetOutput.deliver(model);exampleSetOutput.deliver(exampleSet);}
25
![Page 26: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/26.jpg)
package com.Rapid Miner.operator.learner;
import com.Rapid Miner.operator.Operator;
import com.Rapid Miner.operator.OperatorDescription;
public class SemiSupervisedLarner extends Operator {
private InputPort labeledExampleSetInput = getInputPorts().createPort("labeled Documents");
private InputPort unLabeledExampleSetInput = getInputPorts().createPort("unLabeled Documents");private InputPort unLabeledExampleSetInput = getInputPorts().createPort("test Documents");private OutputPort exampleSetOutput = getOutputPorts().createPort("exampleset");
private OutputPort modelOutput = getOutputPorts().createPort("model");
public SemiSupervisedLarner (OperatorDescription description) {
super(description);
}
public void doWork() throws OperatorException {
ExampleSet labeledExampleSet = labeledExampleSetInput.getData();
ExampleSet unLabeledExampleSet = unLabeledExampleSetInput.getData();
ExampleSet testExampleSet = testExampleSetInput.getData();
// logic of Algorithm i.e. call methods of Semi-Supervised Learning Algorithm
exampleSetOutput.deliver(model);
exampleSetOutput.deliver(exampleSet);
}
}
![Page 27: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/27.jpg)
No Package Classes
1 com.rapidminer.operator.learne
r
SemiSupervisedAbstractLearner
2.1 com.rapidminer.operator.learne
r.bayes
SSNaiveBayes
2.2 SemiSupervisedDistributionModel
3 com.rapidminer.ExampleSet ExampleSetUtils
No Method Name Description
1 doWork() Takes input Examplesets from input ports,
calls learn method and supplies output to
output port.
1 learn() Creates an instance of SemiSupervised
distributionModel. Returns Learned Model.
2.2 SemiSupervisedDist
ributionModel()
This is a constructor where all variables to
find prior and posterior probability are
initialized and all methods to perform semi-
supervised learning are called from here.
2.2 update() Finds weight of each attribute(feature) in
each class.
2.2 updateDistribution
Properties()
Finds posterior probability.
2.2 performPrediction() Predicts Labels of all unlabeled Documents
using NB.Returns PredictedExampleSet.
2.2 updateAfter
Prediction()
Updates weight of each attribute(feature) in
each class.
2.2 updateDistribution
PropertiesAfterPred
iction()
Updates posterior probability.
2.2 performTest() Predicts class labels of all documents in test
set and calculates precision, recall and
accuracy of each class and average accuracy.
3 merge() Merges labeled and predicted unlabeled
documents.Returns merged Example27
![Page 28: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/28.jpg)
28
![Page 29: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/29.jpg)
Performing Pre-processing in RapidMiner
29
![Page 30: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/30.jpg)
SemiSupervised Operator Complete Classification Process
30
![Page 31: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/31.jpg)
Analysis (Limitation)
Improvement in accuracy is lesswith SSL as compared toSupervised Learner NB as someunlabeled samples aremisclassified by the currentclassifier because the initiallabeled samples are not enough[10] and these misclassifiedsamples are directly consideredfor training.
31
Results shows improvement
in accuracy
Algorithm
No of Labeled
Documents
NB(Naïve Bayes) SSL (Basic
EM)
20 0 22.38
40 47.91 47.91
60 20.00 44.86
80 40.00 46.38
100 44.44 40.00
200 48 47.38
400 49.96 46.79
600 45.23 60.28
![Page 32: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/32.jpg)
Reference papers
[2] [6] [5] [4] [3]
Dataset Used 1) 20 News Group
2) WebKb
3)Reuters
Chinese Text Chinese Short Text Text document from
public forums in
Chinese internet
Reuter 21578
Distribution of
Dataset uniform
NS Yes Yes NS No
Training, Testing NS NS NS 3/4, ¼ 2/3, 1/3
Parameters compared
for Accuracy
1)No of labeled
Documents vs.
Accuracy
2)No of Unlabeled
Documents vs.
Accuracy
Times of iteration vs.
Macro F1
Times of iterations vs.
Macro F1
No of iterations vs.
Accuracy
Feature Selection
methods vs. Accuracy
Measures of
evaluation used
Accuracy 1) Macro F1
2)New measure,
IR = (IS – IL)/IL
NS Macro F1 Macro average
Accuracy
Method used for
initial distribution of
EM
Naïve Bayesian Naïve Bayesian Naïve Bayesian Random Sub-Space
method
Naïve Bayesian
Feature Selection
method used
NS TF-IDF in each
iteration
Chi-Square in each
iteration
NS DF * ICIF
Uses more than one
classifier
No Yes No Yes No
32
![Page 33: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/33.jpg)
We have proposed an algorithm in [10] in which we consider
votes of both Naive Bayes and Support Vector Machine
(SVM), and only those unlabeled documents for which both
NB and SVM predict the same label are considered in the next
iteration and the remaining unlabeled documents are
discarded.
This improved algorithm is also implemented in RapidMiner
as an extension. It gives better accuracy as compared to the
standard SSL algorithm for the same dataset [12].
33
![Page 34: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/34.jpg)
Semi-Supervised Learning with EM can beeffectively used for improving performance of TextClassification when limited number of labeleddocuments are available for training and it isimplemented in RapidMiner as an extension.
To implement other variants of SSL algorithm inRapidMiner proposed by different researchers [7] inorder to overcome limitation of classic EM basedSSL algorithm and to perform experiments on real-time dataset like SMS, e-mail etc, are our futuregoals.
34
![Page 35: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/35.jpg)
THANK YOU
35
![Page 36: RM World 2014: Semi supervised text classification operator](https://reader033.vdocuments.site/reader033/viewer/2022052522/548119da5806b5dd108b461f/html5/thumbnails/36.jpg)
[1] Kamal Nigam, Andrew Kachites Mccallum,“ Text classification from Labeled and Unlabeled Data using EM”,
Machine Learning, Kluwer Academic Publishers, Boston. Manufactured in The Netherlands, 2002.
[2] Xiaojin Zhu, “Semi-Supervised Learning Literature Survey”, Computer Sciences TR 1530, University of
Wisconsin – Madison, 2005.
[3] Wen Han, Xiao Nan-feng, “An Enhanced EM Method of Semi-supervised Classification Based on Naive
Bayesian”, Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 15- Sep-
2011.
[4] YueHong Cai, Qian Zhu; “Semi-Supervised Short Text Categorization based on Random Subspace”-
Computer Science and Information Technology (ICCSIT), 3rd IEEE International Conference on Page(s): 470
– 473 , 2010.
[5] Xinghua Fan, Zhiyi Guo; “A semi-supervised Text Classification Method based on Incremental EM
Algorithm”, WASE International Conference on Information Engineering, Page(s): 211 - 214, 2010.
[6] Xinghua Fan, Zhiyi Guo, Houfeng Ma. “An improved EM-based Semi-supervised Learning Method”
,International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, page(s): 529 -
532, August - 2009.
[7] Purvi Rekh, Amit Thakkar, Amit Ganatra, “A Survey and Comparative analysis of Expectation Maximization
based Semi-Supervised Text Classification”, International Journal of Engineering and Advanced Technology,
Vol 1, Issue- 3, page(s): 141 - 146, February – 2012.
[8] Zhu, Xiaojin. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences,
University of Wisconsin-Madison, 2008.
[9] Approaching Vega: The final descent How to extend RapidMiner 5.0
[10] Purvi Rekh,Amit Thakkar, Amit Ganatra, “An Improved Expectation Maximization based Semi-Supervised
Text Classification using Naïve Bayes and Support Vector Machine”, CiiT International Journal of Artificial
Intelligent Systems and Machine Learning, May -2012.
[11] Twenty News Group Data Set:
http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
[12] Purvi Rekh, Amit Thankkar, Semi-Supervised Text Classification using Naïve Bayes and Support Vector
Machine, Second International Conference on Emerging Research in Computing, Information,
Communication and applications, in press with Elsevier proceedings, 2014 36