intelligent database systems lab n.y.u.s.t. i. m. bns feature scaling: an improved representation...

19
Intelligent Database Systems Lab N.Y.U.S. T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin, Shu-Han Authors : George Forman (Hewlett-Packard Labs) Conference on Information and Knowledge Management (CIKM) (2009)

Upload: merry-lane

Post on 29-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

BNS Feature Scaling: An Improved Representation

over TF·IDF for SVM Text Classification

Presenter : Lin, Shu-Han

Authors : George Forman (Hewlett-Packard Labs)

Conference on Information and Knowledge Management (CIKM) (2009)

Page 2: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

2

Outline

Motivation Objective Methodology Experiments Conclusion Comments

Page 3: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

3

Multi-class classification: 1 of n problem, e.g., topic category.

Binary-class classification: 1 of 2 problem

A

B

C

DSo every problem can be

decompose to many binary classification problem:

The positive/negative problem

negative positive

Page 4: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Motivation

Feature “Scaling” = “weighting”=“Scoring” ‘TF·IDF’ representation:

IDF is oblivious to the class labels Scales some features inappropriately

4

Positive (100) Negative (900) IDF

X 80 (80%) 0 (0%) Log(1000/80)=1.1

Y 8 (8%) 0 (0%) Log(1000/8)=2.1

Page 5: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Objectives

Maximize classification “performance” Feature selection

Feature scaling: Make numeric range greater for more predictive feature

Predictive: 100% positive, 0% negative 0% positive, 100% negative

5

Page 6: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics

6

F-1: The inverse normal cumulative distribution function

Page 7: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

7

Positive(30)

Negative(400) BNS IDF LOR IG

Italy30

(100%)0 3.29 1.16 4.68 0.37

x 2 (7%) 0 0.14 2.33 1.76 0.02

patient30

(100%)400

(100%)0.00 0.00 0.00 0.00

cost 0400

(100%)3.29 0.03 -4.68 0.37

y15

(50%)200

(50%)0.00 0.30 0.00 0.00

Page 8: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics

8

Positive(30)

Negative(400) BNS IDF LOR IG

Italy30

(100%)0 3.29 1.16 4.68 0.37

x 3 (10%) 0 0.36 2.16 1.95 0.03

+ [0% ~ 100%], - 0%

Page 9: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics

9

Positive(30)

Negative(400) BNS IDF LOR IG

Italy 0 40 (10%) 0.36 1.03 -0.82 0.01

x 0400

(100%)3.29 0.03 -4.68 0.37

+ 0%, - [0% ~ 100%]

Page 10: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics

10

Positive(30)

Negative(400) BNS IDF LOR IG

patient30

(100%)400

(100%)0.00 0.00 0.00 0.00

y15

(50%)200

(50%)0.00 0.30 0.00 0.00

+ [0% ~ 100%], - [0% ~ 100%]

Page 11: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Methodology – Feature Scoring Metrics

11

Positive(30)

Negative(400) BNS IDF LOR IG

Italy30

(100%)0 3.29 1.16 4.68 0.37

cost 0400

(100%)3.29 0.03 -4.68 0.37

+ [0% ~ 100%], - [100% ~ 0%]

Page 12: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

12

Positive(30)

Negative(400) BNS IDF LOR IG

Italy30

(100%)0 3.29 1.16 4.68 0.37

x 2 (7%) 0 0.14 2.33 1.76 0.02

patient30

(100%)400

(100%)0.00 0.00 0.00 0.00

cost 0400

(100%)3.29 0.03 -4.68 0.37

y15

(50%)200

(50%)0.00 0.30 0.00 0.00

Page 13: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments – Accuracy & F-measure

13

Page 14: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.Experiments – Precision vs. Recall

14

Page 15: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments – The effect of class distribution

15

Page 16: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments – compare to other scoring metrics

16

Page 17: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

Experiments – Feature selection + Feature scaling

17

Page 18: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

18

Conclusions

BNS the difference between the rate of + class and - class

Use IG selection + BNS scaling No need to feature selection: better use all features for the

best performance Better to simply use all binary features

Page 19: Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Intelligent Database Systems Lab

N.Y.U.S.T.I. M.

19

Comments

Advantage Idea is clear: consider the class distribution

Drawback Restrict to the 2-class problem

Use all features takes time

Application Instead of IDF