intelligent database systems lab presenter : jian-ren chen authors : wen zhang, taketoshi yoshida,...

15
Intelligent Database Systems Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF, LSI and multi-words for text classi cation

Upload: jemimah-lamb

Post on 21-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Presenter : JIAN-REN CHEN

Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang

2011.ESWA

A comparative study of TF*IDF, LSI and multi-words for text classification

Page 2: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Outlines

MotivationObjectivesMethodologyExperimentsConclusionsComments

Page 3: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

MotivationAlthough TF*IDF, LSI and multi-word have been proposed for a long

time, there is no comparative study on these indexing methods,

and no results are reported concerning their classification

performances.

Page 4: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Objectives

• A comparative study of TF*IDF, LSI and multi-words for text classification.- information retrieval- text categorization

• indexing term:① semantic quality② statistical quality

Page 5: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Methodology - TF*IDF

1) wi,j : the weight for term i in document j2) N : the number of documents in the collection3) tfi,j : is the term frequency of term i in document j4) dfi : is the document frequency of term i in the collection

Terms (keywords) of the document collection

documents

Page 6: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Methodology - LSIGiven a term-document matrix X = [x1 , x2 , ... , xn ] є Rm

and suppose the rank of X is r, LSI decomposes the X using SVD as follows:

Terms (keywords) of the document collection

documents

1.

Xk=Uk’ΣkVkT’2.

Page 7: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Methodology - Multi-word

the length of the multi-word should be between 2 and 6

its occurrence frequency should be at least twice in a document.

Page 8: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Experiments - Datasets Chinese corpus : TanCorpV1.0

14150 documents 20 categories

Select

1200 documents 219,115 sentences 5,468,301 individual words

agriculture history politics economy

English corpus : Reuters-22173 distribution 1.022173 documents 135 categories

Select

2032 documents 50,837 sentences 281,111 individual words

Crude (520) agriculture (574) Trade (514) Interest (424)

Page 9: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Experiments - Evaluation

Page 10: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Experiments - Chinese

Page 11: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Experiments - English

Page 12: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Experiments – t-test

Page 13: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Comparison

information retrieval

text categorization

computationcomplexity

TF*IDF Chinese O(n m)

LSI English best O(n2r3)

multi-word O(ms2)

Page 14: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Conclusions

• LSI can produce better indexing in discriminative power.

• LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods.

• The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.

Page 15: Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Intelligent Database Systems Lab

Comments• Advantages

- Compare with TF*IDF, LSI and multi-words• Disadvantage

- semantic quality and statistical quality are considered

merely by our intuition instead of theory• Applications

- text mining