ion of feature selection methods for web classification
TRANSCRIPT
-
7/31/2019 ion of Feature Selection Methods for Web Classification
1/14
Comparision of FeatureSelection Methods for Web
Classification
Hakan
zpalamutcuMa s 2012
-
7/31/2019 ion of Feature Selection Methods for Web Classification
2/14
Feature Selection
Prepares data for data mining and machinelearning.
Commonly used on high dimensional data.
Studies how to select a subset or list ofattributes or variables that are used toconstruct models describing data.
Purposes include reducing dimensionality,
removing irrelevant and redundant features
-
7/31/2019 ion of Feature Selection Methods for Web Classification
3/14
Feature Selection for Classification
Select among a set of variables thesmallest subset that maximizesclassification performance
a set of predictors features and aclass/category is given minimum set that achieves maximum
classification performance is found
-
7/31/2019 ion of Feature Selection Methods for Web Classification
4/14
Why Feature Selection isImportant? May improve performance of classification
algorithm
Classification algorithm may not scale up to
the size of the full feature set either insample or time
Allows us to better understand the domain
-
7/31/2019 ion of Feature Selection Methods for Web Classification
5/14
Comparision Steps
Choosing data set Preprocess of data Converting data to Weka format Applying feature selection methods to
data
Applying classification to data
-
7/31/2019 ion of Feature Selection Methods for Web Classification
6/14
Choosing data set
Category # of Documens
Course 927
Department 140
Faculty 1124
Other 3761
Project 504
Staff 137
Student 1640
Category
# of Positive
Document
# of
Negative
Document
Course 100 25
Faculty 100 25
-
7/31/2019 ion of Feature Selection Methods for Web Classification
7/14
Preprocess of Data
Removal of HTML tags Removal of punctuation characters and
numeric values
-
7/31/2019 ion of Feature Selection Methods for Web Classification
8/14
Converting data to WEKA format
Text2arff tool is used Stopwords are removed Min frequency is 100 Frequency calculated using tf-idf
scheme
Category
# of Initial
Attributes
Course 76
Faculty 93
-
7/31/2019 ion of Feature Selection Methods for Web Classification
9/14
Applying feature selection methods todata
Attribute Evaluators CfsSubsetEval
ConsistencySubsetEval
ClassifierSubsetEval
Search Methods GeneticSearch
BestFirst
RankSearch
Attribute evaluator Search method
CfsSubsetEval GeneticSearch
CfsSubsetEval BestFirst
ConsistencySubsetEval RankSearch
ConsistencySubsetEval BestFirst
ClassifierSubsetEval GeneticSearch
-
7/31/2019 ion of Feature Selection Methods for Web Classification
10/14
Applying feature selection methods todata
Category Attribute evaluator Search method
# of
Features
Selected
Selected Features
Course CfsSubsetEval GeneticSearch 18 3,6,9,13,14,19,34,40,42,43,45,48,49,52,59,65,69,70
Course CfsSubsetEval BestFirst 12 3,6,13,18,19,42,43,45,48,64,67,70
Course ConsistencySubsetEval RankSearch 12 6,14,18,19,40,42,43,45,48,64,67,70
Course ConsistencySubsetEval BestFirst 7 3,6,13,19,42,48,70
Course ClassifierSubsetEval GeneticSearch 6 2,27,31,33,73,75
Category Attribute evaluator Search method
# of
Features
Selected
Selected Features
Faculty CfsSubsetEval GeneticSearch 20
2,6,12,16,19,27,42,43,53,56,58,61,65,67,73,74,76,84,
90,92
Faculty CfsSubsetEval BestFirst 3 16,43,74
Faculty ConsistencySubsetEval RankSearch 3 16,43,74
Faculty ConsistencySubsetEval BestFirst 3 16,43,74
Faculty ClassifierSubsetEval GeneticSearch 10 1,3,35,47,49,59,64,67,81,90
-
7/31/2019 ion of Feature Selection Methods for Web Classification
11/14
Applying classification to data
Classifiers Naive Bayes (bayes) Class for a Naive Bayes classifier using
estimator classes
Bagging (meta) Class for bagging a classifier to reduce
variance. Can do classification and regressiondepending on the base learner
J48 (trees) Class for generating a pruned or unpruned
C4.5 decision tree
-
7/31/2019 ion of Feature Selection Methods for Web Classification
12/14
Results
Quality of measures CCI-correctly classfied instances F-measure
-
7/31/2019 ion of Feature Selection Methods for Web Classification
13/14
ResultsNaive Bayes J48 Bagging
CategoryCCI F-Measure CCI F-Measure CCI F-Measure
Course 106 0.857 121 0.967 107 0.821
Before appliying feature selection
After appliying feature selection
CATEGORY:COURSE Classification
Feature Selection Naive Bayes J48 Bagging
Attribute evaluator Search method CCI F-Measure CCI F-Measure CCI F-Measure
CfsSubsetEval GeneticSearch 106 0.853 120 0.959 104 0.779
CfsSubsetEval BestFirst 108 0.867 118 0.938 112 0.884
ConsistencySubsetEval RankSearch 100 0.801 118 0.938 112 0.881
ConsistencySubsetEval BestFirst 105 0.840 109 0.855 110 0.863
ClassifierSubsetEval GeneticSearch 104 0.813 119 0.947 105 0.794
-
7/31/2019 ion of Feature Selection Methods for Web Classification
14/14
ResultsNaive Bayes J48 Bagging
CategoryCCI F-Measure CCI F-Measure CCI F-Measure
Faculty 99 0.811 121 0.967 114 0.902
Before appliying feature selection
After appliying feature selection
CATEGORY:FACULTY Classification
Feature Selection Naive Bayes J48 Bagging
Attribute evaluator Search method CCI F-Measure CCI F-Measure CCI F-Measure
CfsSubsetEval GeneticSearch 101 0.815 119 0.951 112 0.898
CfsSubsetEval BestFirst 104 0.808 105 0.802 107 0.821
ConsistencySubsetEval RankSearch 104 0.808 105 0.802 107 0.821
ConsistencySubsetEval BestFirst 104 0.808 105 0.802 107 0.821
ClassifierSubsetEval GeneticSearch 92 0.750 105 0.794 107 0.821