reporter: shau-shiang hung( 洪紹祥 ) adviser:shu-chen cheng( 鄭淑真 ) date:99/06/15

Reporter: Shau-Shiang Hung( ) Adviser:Shu-Chen Cheng( ) Date:99/06/15 Introduction Document preprocessing Scoring measures for feature selection Classification, performance evaluation, and corpora description Experiments Reuters Ohsumed Comparing the results Conclusion Machine Learning (ML) automatically builds a classifier for a certain category by observing the characteristics of a set of documents that have been classified manually under this category. The high dimensionality of TC problems makes most ML-based classification algorithms infeasible. Many features could be irrelevant or noisy. Small percentage of the words are really meaningful. Feature selection is performed to reduce the number of features and avoid overfitting. Before performing FS must transform documents to obtain a representation suitable for computational use. Additionally, we perform two kinds of feature reduction. The first removes the stop words (extremely common words such as the, and, and to) Arent useful for classification. The second is stemming Maps words with the same meaning to one morphological form by removing suffixes. Information retrieval measures determine word relevance Information theory measures These measures consider a words distribution over the different categories. Information gain(IG) Takes into account the words presence or absence in a category. ML measures To define our measures, we associate to each pair (w, c) this rule w c : If the word w appears in a document, then that document belongs to category c. Then, we use measures that have been applied to quantify the quality of the rules induced by an ML algorithm. In this way, we reduce the quantification of the importance of a word w in a category c to the quantification of the quality of w c. Laplace measure (L) Modifies the percentage of success Takes into account the documents in which the word appears difference (D) Establishes a balance between the documents of category c and the documents in other categories that also contain w impurity level(IL) Take into account the number of documents in the category in which the word occurs and the distribution of the documents over the categories For a classifier, we chose support vector machines because Shown better results than other traditional text classifiers. Perform better because they handle examples with many features well and they deal well with sparse vectors. Are binary classifiers that can determine linear or nonlinear threshold functions to separate the examples of documents in one category from those in other categories. Disadvantages They handle missing values poorly Multiclass classification doesnt perform well Evaluating performance Precision (P) quantifies the percentage of documents that are correctly classified as positives (they belong to the category). Recall (R) quantifies the percentage of correctly classified positive documents. F 1 Gives the same relevance to both precision and recall The corpora We used the Reuters and the Ohsumed corpora. Reuters is a set of economic news documents that Reuters published in Ohsumed is a clinically oriented MEDLINE subset consisting of 348,566 references of 270 medical journals published between 1987 and 1991. The results show that our proposed measures are more dependent on some statistical properties of the corpora, particularly the distribution of the words throughout the categories and of the documents over the categories. However, ML measures exploit that dependence by finding at least a simple measure that performs better than IG and TF-IDF.

reporter: shau-shiang hung( 洪紹祥 ) adviser:shu-chen cheng( 鄭淑真 ) date:99/06/15

Documents