a comparative study of tf*idf , lsi and multi-words for text classification
DESCRIPTION
A comparative study of TF*IDF , LSI and multi-words for text classification. Presenter : Jian-Ren Chen Authors : W en Zhang , T aketoshi Y oshida , X ijin T ang 2011.ESWA. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Intelligent Database Systems Lab
Presenter : JIAN-REN CHEN
Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang
2011.ESWA
A comparative study of TF*IDF, LSI and multi-words for text classification
Intelligent Database Systems Lab
OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments
Intelligent Database Systems Lab
MotivationAlthough TF*IDF, LSI and multi-word have been proposed for a long
time, there is no comparative study on these indexing methods,
and no results are reported concerning their classification
performances.
Intelligent Database Systems Lab
Objectives
• A comparative study of TF*IDF, LSI and multi-words for text classification.- information retrieval- text categorization
• indexing term:① semantic quality② statistical quality
Intelligent Database Systems Lab
Methodology - TF*IDF
1) wi,j : the weight for term i in document j2) N : the number of documents in the collection3) tfi,j : is the term frequency of term i in document j4) dfi : is the document frequency of term i in the collection
Terms (keywords) of the document collection
documents
Intelligent Database Systems Lab
Methodology - LSIGiven a term-document matrix X = [x1 , x2 , ... , xn ] є Rm
and suppose the rank of X is r, LSI decomposes the X using SVD as follows:
Terms (keywords) of the document collection
documents
1.
Xk=Uk’ΣkVkT’2.
Intelligent Database Systems Lab
Methodology - Multi-word
the length of the multi-word should be between 2 and 6
its occurrence frequency should be at least twice in a document.
Intelligent Database Systems Lab
Experiments - Datasets Chinese corpus : TanCorpV1.0
14150 documents 20 categories
Select
1200 documents 219,115 sentences 5,468,301 individual words
agriculture history politics economy
English corpus : Reuters-22173 distribution 1.022173 documents 135 categories
Select
2032 documents 50,837 sentences 281,111 individual words
Crude (520) agriculture (574) Trade (514) Interest (424)
Intelligent Database Systems Lab
Experiments - Evaluation
Intelligent Database Systems Lab
Experiments - Chinese
Intelligent Database Systems Lab
Experiments - English
Intelligent Database Systems Lab
Experiments – t-test
Intelligent Database Systems Lab
Comparison
information retrieval
text categorization
computationcomplexity
TF*IDF Chinese O(n m)
LSI English best O(n2r3)
multi-word O(ms2)
Intelligent Database Systems Lab
Conclusions
• LSI can produce better indexing in discriminative power.
• LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods.
• The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.
Intelligent Database Systems Lab
Comments• Advantages
- Compare with TF*IDF, LSI and multi-words• Disadvantage
- semantic quality and statistical quality are considered
merely by our intuition instead of theory• Applications
- text mining