inews: final presentation
DESCRIPTION
TRANSCRIPT
DATA MININGFINAL PRESENTATION
HUY VU – HIEU TRAN – THONG NGUYEN
DATASET: STACKEXCHANGE
• ~100 Q&A WEBSITES
• COLLECTIONS OF QUESTIONS POSTED ON THOSE WEBSITES
MOTIVATION
• DECIDE THE CATEGORY OF POST BASED ON THE PREDICTED CATEGORIES OF TAGS.
Tag1
Tag 2
Tag 3
Post
Classifier
Tag1 Pred.
Tag2 Pred.
Tag3 Pred.
Post’s Category
GOAL
• BUILD MODELS TO CLASSIFY A QUESTION INTO 1 OF ~100 CATEGORIES
• BASED ONLY ON THE QUESTION ITSELF?
(TEXT, TITLE, TAGS)
APPROACH
Huy:• Logistic
Regression• Neural Network• Perceptron
(OVA & AVA)
Hieu:• Bayesian
Network• Naïve Bayes
Classifier
Thong:• Decision Tree• Random Forest
SCORING METRICS
• MODELS ARE EVALUATED BASED ON ACCURACY (WHETHER IT CLASSIFIES CORRECTLY OR NOT)
• TRAIN EFFICIENCY: ACCURACY OF CLASSIFICATION ON TRAINING SET
• TEST EFFICIENCY: ACCURACY OF CLASSIFICATION ON TEST SET
PREPROCESSING
• RAW DATA IN XML FORMAT
• DATA IMPORT TO MONGODB UNDER JSON FORMAT
• DATA EXPORT FROM MONGODB UNDER CSV FORMAT
FEATURE ENGINEERING
• APPROACH 1: USE ONLY TAGS AS FEATURE
• APPROACH 2: VECTORIZE ALL TAGS, TEXT AND TITLE
APPROACH 1: USE ONLY TAGS AS FEATURE
• FEATURE: TAG
• TARGET: WEBSITE
• EACH TAG IS GIVEN A NUMERICAL ID
DATA CLEANING STRATEGY
• MULTIPLE TAGS? SPLIT
• NORMALIZE DATA: MAKE SURE INPUT VALUE HAVE APPROX. MEAN = 0
PYTHON TO CLEAN DATA
BINARY MODELS
• ONLY TRAIN SUBSET OF DATA
• ONLY RECORDS ASSOCIATED WITH THE FIRST 2 WEBSITES 2 LABELS
ADAPTIVE LOGISTIC REGRESSION• JAVA PROGRAM UNDER MAVEN
RUNNING IN ECLIPSE USING MAHOUT
• RATIO TRAIN : TEST ~= 4 : 1
PROBLEM: IMBALANCED DATA
• THERE ARE ~9,000 RECORDS OF DATA CLASS 1
• THERE ARE ~54,000 RECORDS OF DATA CLASS 2
VERY IMBALANCED CAUSING MODEL TO ALWAYS PREDICT CLASS 2
**DATA SELECTION: RANDOMLY CHOSEN ~9,000 OUT OF ~54,000 FOR DATA CLASS 2
RESULT (FROM DIFFERENT PRIOR FUNCTIONS)
L1 L2 UP TP UBP
Train SetImbalanced
45.71% 52.86% 55.11% 56.31% 44.83%
Test SetImbalanced
46.62% 51.08% 55.12% 55.74% 49.46%
Train SetBalanced
74.24% 80.28% 86.23% 82.94% 77.24%
Test SetBalanced
78.46% 81.10% 81.68% 79.52% 75.10%
L1 = LaplacianL2 = GaussianUP = UniformTP = T-distributionEBP = Elastic Band
NEURAL NETWORK• JAVA PROGRAM UNDER MAVEN RUNNING
IN ECLIPSE USING MAHOUT
• “MINUS_SQUARED” AS COST FUNCTION
• 3 LAYERS: INPUT, MIDDLE, OUTPUT
• RATIO TRAIN : TEST ~= 4 : 1
2 ACTIVATION FUNCTIONS SUPPORTED
“SIGMOID”
IMBALANCED:Train Efficiency: 84.75%Test Efficiency: 84.90%
BALANCED:Train Efficiency: 96.82%Test Efficiency: 94.30%
“IDENTITY”
IMBALANCED:Train Efficiency: 32.56%Test Efficiency: 30.78%
BALANCED:Train Efficiency: 40.83%Test Efficiency: 38.92%
PROBLEMS WITH MULTICLASSIFICATION
• SIGNIFICANT DECREASE IN EFFICIENCY WHEN TRAINING MULTICLASS
• EX: ADAPTIVE LOGISTIC REGRESSION DROP ~20% FOR FULL DATA, ~30% FOR BALANCED DATA
USE DIFFERENT APPROACH FOR FEATURE ENGINEERING
RANDOM FOREST
• ALTERNATING DECISION TREE
• BEST FIRST DECISION TREE
ALTERNATING DECISION TREE
LEGEND: -VE = 1, +VE = 2
TREE SIZE (TOTAL NUMBER OF NODES): 31
LEAVES (NUMBER OF PREDICTOR NODES): 21
CORRECTLY CLASSIFIED INSTANCES 16001 99.5582 %
INCORRECTLY CLASSIFIED INSTANCES 71 0.4418 %
KAPPA STATISTIC 0.983
MEAN ABSOLUTE ERROR 0.0118
ROOT MEAN SQUARED ERROR 0.0581
RELATIVE ABSOLUTE ERROR 4.5522 %
ROOT RELATIVE SQUARED ERROR 16.158 %
TOTAL NUMBER OF INSTANCES 16072
BEST FIRST DECISION TREE
• SIZE OF THE TREE: 37
• NUMBER OF LEAF NODES: 19
• CORRECTLY CLASSIFIED INSTANCES 16049 99.8569 %
• INCORRECTLY CLASSIFIED INSTANCES 23 0.1431 %
• KAPPA STATISTIC 0.9945
• MEAN ABSOLUTE ERROR 0.0028
• ROOT MEAN SQUARED ERROR 0.0368
• RELATIVE ABSOLUTE ERROR 1.0984 %
• ROOT RELATIVE SQUARED ERROR 10.2226 %
• TOTAL NUMBER OF INSTANCES 16072
NAÏVE BAYES - BINARY CLASSIFIERTraining Set Result
NAÏVE BAYES - MULTICLASS CLASSIFIER
Number of correctly classified instances is greatly dropped !!!
REASON?
• IMBALANCED DATA
TEST TIME !!!!
The category of this post should be 2.
Through voting system, the post’s category is indeed 2 !!!
APPROACH 2: USE ENTIRE QUESTION AS FEATURE
• 3 SOURCES: TAGS, TEXT, TITLE
• PREPROCESSING: LOAD ALL 3 ATTRIBUTES AS TEXT AND FEED IN MODEL
WEKA – VECTORIZE STRING
• APPLY STRINGTOWORDVECTOR FILTER
• USE STOPWORD TO REMOVE COMMON WORDS
• LARGE AMOUNT OF USELESS NUMBERS => USE ALPHABETICTOKENIZER
• USE INFO GAIN ATTRIBUTE EVALUATOR TO LIMIT THE NUMBER OF ATTRIBUTES
OVA MODELS (MAHOUT)
• VECTORIZING TEXT, TAGS, TITLE
• # FEATURES = VECTOR SIZE (30, 50, 100)
• # PERCEPTRON = # CATEGORIES
• TRAINING WITH 5, 10, 50 CLASSES
IMPORTANT STRUCTURE FOR NEURALNET
• PREVIOUSLY, NOT MANY FEATURE ONLY NEED 1 INTERMEDIATE LAYER
• NOW, 100 FEATURES MORE INTERMEDIATE LAYERS SHOULD BE USED TO SMOOTHLY REDUCE DIMENSION FROM 100 (INPUT) TO 1 (OUTPUT)
OVA RESULT (TEST SET ONLY)
5-class 10-class 50-class
30 features 71.54% 43.28% 25.09%
50 features 76.14% 46.52% 20.10%
100 features 74.29% 47.92% 26.02%
AVA PERCEPTRON MODELS
AVA RESULT (TEST SET ONLY)
5-class 10-class 50-class
30 features 91.02% 80.82% 55.23%
50 features 95.10% 81.57% 54.88%
100 features 92.09% 81.36% 56.18%
After vectorizing:• 658 attributes are kept• Ready to use !!!
NAIVEBAYES
EVALUATEMeasure:• MicroAveraging: 0.89707• MacroAveraging: 0.85
NAIVEBAYES MULTINOMINAL TEXT
• ONLY AVAILABLE IN WEKA 3.7.11
• AUTOMATICALLY VECTORIZE AND APPLY NAIVEBAYES ARGORITHM
EVALUATEMeasure:• MicroAveraging: 0.9177• MacroAveraging: 0.739
CONCLUSION
• DIFFERENT ML ALGORITHMS GIVE DIFFERENT BEHAVIOR
• EVEN WITHIN ONE ALGORITHM, CHOOSING THE RIGHT COST FUNCTION IS CRUCIAL
• GENERALLY WEKA IS SLIGHTLY BETTER THAN MAHOUT
• AVA PERCEPTRON IS BETTER THAN OVA (DESPITE LONGER RUNNING TIME)
QUESTION?