a survey on text categorization with machine learning

28
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito

Upload: hedy-hoover

Post on 01-Jan-2016

60 views

Category:

Documents


3 download

DESCRIPTION

A Survey on Text Categorization with Machine Learning. Chikayama lab. Dai Saito. Introduction: Text Categorization. Many digital Texts are available E-mail, Online news, Blog … Need of Automatic Text Categorization is increasing without human resource Merits of time and cost. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Survey on Text Categorization with Machine Learning

A Survey on Text Categorization with Machine Learning

Chikayama lab.Dai Saito

Page 2: A Survey on Text Categorization with Machine Learning

Introduction:Text Categorization

Many digital Texts are available E-mail, Online news, Blog …

Need of Automatic Text Categorization is increasing without human resource Merits of time and cost

Page 3: A Survey on Text Categorization with Machine Learning

Introduction:Text Categorization

Application Spam filter Topic Categorization

Page 4: A Survey on Text Categorization with Machine Learning

Introduction:Machine Learning

Making Categorization rule automatically by Feature of Text

Types of Machine Learning (ML) Supervised Learning

Labeling Unsupervised Learning

Clustering

Page 5: A Survey on Text Categorization with Machine Learning

Introduction:flow of ML

1. Prepare training Text data with label Feature of Text

2. Learn3. Categorize new Text

Label1

Label2

Page 6: A Survey on Text Categorization with Machine Learning

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 7: A Survey on Text Categorization with Machine Learning

Number of labels

Binary-label True or False (Ex. spam or not) Applied for other types

Multi-label Many labels, but

One Text has one label Overlapping-label

One Text has some labels

Yes

No

L1

L2

L3

L4

L1

L2

L3

L4

Page 8: A Survey on Text Categorization with Machine Learning

Types of labels

Topic Categorization Basic Task Compare individual words

Author Categorization Sentiment Categorization

Ex) Review of products Need more linguistic information

Page 9: A Survey on Text Categorization with Machine Learning

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 10: A Survey on Text Categorization with Machine Learning

Feature of Text

How to express a feature of Text? “Bag of Words”

Ignore an order of words Structure

Ex) I like this car. | I don’t like this car. “Bag of Words” will not work well

(d:document = text) (t:term = word)

Page 11: A Survey on Text Categorization with Machine Learning

Preprocessing

Remove stop words “the” “a” “for” …

Stemming relational -> relate, truly -> true

Page 12: A Survey on Text Categorization with Machine Learning

Term Weighting

Term Frequency Number of a term in a document Frequent terms in a document seems to be imp

ortant for categorization tf ・ idf

Terms appearing in many documents are not useful for categorization

Page 13: A Survey on Text Categorization with Machine Learning

Sentiment Weighting

For sentiment classification,weight a word as Positive or Negative

Constructing sentiment dictionary WordNet [04 Kamps et al.]

Synonym Database Using a distance

from ‘good’ and ‘bad’

good

bad

happyd (good, happy) = 2d (bad, happy) = 4

Page 14: A Survey on Text Categorization with Machine Learning

Dimension Reduction Size of feature vector is (#terms)*(#docum

ents) #terms ≒ size of dictionary High calculation cost Risk of overfitting

Best for training data ≠ Best for real data

Choosing effective feature to improve accuracy and calculation cost

Page 15: A Survey on Text Categorization with Machine Learning

Dimension Reduction

df-threshold Terms appearing in very few documents

(ex.only one) are not important    Score

 

If t and cj are independent, Score is equal to Zero

Page 16: A Survey on Text Categorization with Machine Learning

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 17: A Survey on Text Categorization with Machine Learning

Learning Algorithm

Many (Almost all?) algorithms are used in Text Categorization Simple approach

Naïve Bayes K-Nearest Neighbor

High performance approach Boosting Support Vector Machine

Hierarchical Learning

Page 18: A Survey on Text Categorization with Machine Learning

Naïve Bayes Bayes Rule

This value is hard to calculate ? Assumption :

each terms occurs independently

Page 19: A Survey on Text Categorization with Machine Learning

k-Nearest Neighbor

Define a “distance” of two Texts Ex)Sim(d1, d2) = d1 ・ d2 / |d1||d2|

= cosθ

check k of high similarityTexts and categorize bymajority vote

If size of test data is larger, memory and search cost is higher

d1

d2

θk=3

Page 20: A Survey on Text Categorization with Machine Learning

Boosting

BoosTexter [00 Schapire et al.] Ada boost

making many “weak learner”s with different parameters

Kth “weak learner” checks performance of 1..K-1th, and tries to classify right to the worst score training data

BoosTexter uses Decision Stump as “weak learner”

Page 21: A Survey on Text Categorization with Machine Learning

+

+

+ +

+

--

--

Simple example of Boosting

+

+

+ +

+

--

--

1.

+

+

+ ++

2.

+

+

+ +

+

--

--

3.

Page 22: A Survey on Text Categorization with Machine Learning

Support Vector Machine

Text Categorization with SVM[98 Joachims]

Maximize margin

Page 23: A Survey on Text Categorization with Machine Learning

Text Categorization with SVM

SVM works well for Text Categorization Robustness for high dimension

Robustness for overfitting Most Text Categorization problems are linearly se

parable All of OHSUMED (MEDLINE collection) Most of Reuters-21578 (NEWS collection)

Page 24: A Survey on Text Categorization with Machine Learning

Comparison of these methods

[02 Sebastiani] Reuters-21578 (2 versions)

difference: number of Categories

Method Ver.1(90) Ver.2(10)

k-NN .860 .823

Naïve Bayes .795 .815

Boosting .878 -

SVM .870 .920

Page 25: A Survey on Text Categorization with Machine Learning

Hierarchical Learning

TreeBoost[06 Esuli et al.] Boosting algorithm for Hierarchical labels Hierarchical labels and Texts with label as Trainin

g data Applying AdaBoost recursively Better classifier than ‘flat’ AdaBoost

Accuracy : 2-3% up Time: training and categorization time down

Hierarchical SVM[04 Cai et al.]

Page 26: A Survey on Text Categorization with Machine Learning

TreeBoost

root

L1 L2 L3 L4

L11 L12 L41 L42 L43

L421 L422

Page 27: A Survey on Text Categorization with Machine Learning

Outline

Introduction Text Categorization Feature of Text Learning Algorithm Conclusion

Page 28: A Survey on Text Categorization with Machine Learning

Conclusion

Overview of Text Categorizationwith Machine Learning Feature of Text Learning Algorithm

Future Work Natural Language Processing with

Machine Learning, especially in Japanese Calculation Cost