a comparative study of tf*idf , lsi and multi-words for text classification

Post on 19-Mar-2016

51 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

A comparative study of TF*IDF , LSI and multi-words for text classification. Presenter : Jian-Ren Chen Authors : W en Zhang , T aketoshi Y oshida , X ijin T ang 2011.ESWA. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Intelligent Database Systems Lab

Presenter : JIAN-REN CHEN

Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang

2011.ESWA

A comparative study of TF*IDF, LSI and multi-words for text classification

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Intelligent Database Systems Lab

MotivationAlthough TF*IDF, LSI and multi-word have been proposed for a long

time, there is no comparative study on these indexing methods,

and no results are reported concerning their classification

performances.

Intelligent Database Systems Lab

Objectives

• A comparative study of TF*IDF, LSI and multi-words for text classification.- information retrieval- text categorization

• indexing term:① semantic quality② statistical quality

Intelligent Database Systems Lab

Methodology - TF*IDF

1) wi,j : the weight for term i in document j2) N : the number of documents in the collection3) tfi,j : is the term frequency of term i in document j4) dfi : is the document frequency of term i in the collection

Terms (keywords) of the document collection

documents

Intelligent Database Systems Lab

Methodology - LSIGiven a term-document matrix X = [x1 , x2 , ... , xn ] є Rm

and suppose the rank of X is r, LSI decomposes the X using SVD as follows:

Terms (keywords) of the document collection

documents

1.

Xk=Uk’ΣkVkT’2.

Intelligent Database Systems Lab

Methodology - Multi-word

the length of the multi-word should be between 2 and 6

its occurrence frequency should be at least twice in a document.

Intelligent Database Systems Lab

Experiments - Datasets Chinese corpus : TanCorpV1.0

14150 documents 20 categories

Select

1200 documents 219,115 sentences 5,468,301 individual words

agriculture history politics economy

English corpus : Reuters-22173 distribution 1.022173 documents 135 categories

Select

2032 documents 50,837 sentences 281,111 individual words

Crude (520) agriculture (574) Trade (514) Interest (424)

Intelligent Database Systems Lab

Experiments - Evaluation

Intelligent Database Systems Lab

Experiments - Chinese

Intelligent Database Systems Lab

Experiments - English

Intelligent Database Systems Lab

Experiments – t-test

Intelligent Database Systems Lab

Comparison

information retrieval

text categorization

computationcomplexity

TF*IDF Chinese O(n m)

LSI English best O(n2r3)

multi-word O(ms2)

Intelligent Database Systems Lab

Conclusions

• LSI can produce better indexing in discriminative power.

• LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods.

• The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.

Intelligent Database Systems Lab

Comments• Advantages

- Compare with TF*IDF, LSI and multi-words• Disadvantage

- semantic quality and statistical quality are considered

merely by our intuition instead of theory• Applications

- text mining

top related