intelligent database systems lab n.y.u.s.t. i. m. bns feature scaling: an improved representation...
TRANSCRIPT
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
BNS Feature Scaling: An Improved Representation
over TF·IDF for SVM Text Classification
Presenter : Lin, Shu-Han
Authors : George Forman (Hewlett-Packard Labs)
Conference on Information and Knowledge Management (CIKM) (2009)
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
2
Outline
Motivation Objective Methodology Experiments Conclusion Comments
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Motivation
3
Multi-class classification: 1 of n problem, e.g., topic category.
Binary-class classification: 1 of 2 problem
A
B
C
DSo every problem can be
decompose to many binary classification problem:
The positive/negative problem
negative positive
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Motivation
Feature “Scaling” = “weighting”=“Scoring” ‘TF·IDF’ representation:
IDF is oblivious to the class labels Scales some features inappropriately
4
Positive (100) Negative (900) IDF
X 80 (80%) 0 (0%) Log(1000/80)=1.1
Y 8 (8%) 0 (0%) Log(1000/8)=2.1
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Objectives
Maximize classification “performance” Feature selection
Feature scaling: Make numeric range greater for more predictive feature
Predictive: 100% positive, 0% negative 0% positive, 100% negative
5
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics
6
F-1: The inverse normal cumulative distribution function
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
7
Positive(30)
Negative(400) BNS IDF LOR IG
Italy30
(100%)0 3.29 1.16 4.68 0.37
x 2 (7%) 0 0.14 2.33 1.76 0.02
patient30
(100%)400
(100%)0.00 0.00 0.00 0.00
cost 0400
(100%)3.29 0.03 -4.68 0.37
y15
(50%)200
(50%)0.00 0.30 0.00 0.00
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics
8
Positive(30)
Negative(400) BNS IDF LOR IG
Italy30
(100%)0 3.29 1.16 4.68 0.37
x 3 (10%) 0 0.36 2.16 1.95 0.03
+ [0% ~ 100%], - 0%
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics
9
Positive(30)
Negative(400) BNS IDF LOR IG
Italy 0 40 (10%) 0.36 1.03 -0.82 0.01
x 0400
(100%)3.29 0.03 -4.68 0.37
+ 0%, - [0% ~ 100%]
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics
10
Positive(30)
Negative(400) BNS IDF LOR IG
patient30
(100%)400
(100%)0.00 0.00 0.00 0.00
y15
(50%)200
(50%)0.00 0.30 0.00 0.00
+ [0% ~ 100%], - [0% ~ 100%]
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics
11
Positive(30)
Negative(400) BNS IDF LOR IG
Italy30
(100%)0 3.29 1.16 4.68 0.37
cost 0400
(100%)3.29 0.03 -4.68 0.37
+ [0% ~ 100%], - [100% ~ 0%]
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
12
Positive(30)
Negative(400) BNS IDF LOR IG
Italy30
(100%)0 3.29 1.16 4.68 0.37
x 2 (7%) 0 0.14 2.33 1.76 0.02
patient30
(100%)400
(100%)0.00 0.00 0.00 0.00
cost 0400
(100%)3.29 0.03 -4.68 0.37
y15
(50%)200
(50%)0.00 0.30 0.00 0.00
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments – Accuracy & F-measure
13
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.Experiments – Precision vs. Recall
14
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
Experiments – The effect of class distribution
15
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
Experiments – compare to other scoring metrics
16
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
Experiments – Feature selection + Feature scaling
17
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
18
Conclusions
BNS the difference between the rate of + class and - class
Use IG selection + BNS scaling No need to feature selection: better use all features for the
best performance Better to simply use all binary features
Intelligent Database Systems Lab
N.Y.U.S.T.I. M.
19
Comments
Advantage Idea is clear: consider the class distribution
Drawback Restrict to the 2-class problem
Use all features takes time
Application Instead of IDF