text .ppt

TEXT CLASSIFICATION

Using Fuzzy Self-Constructing Feature Clustering Algorithm

Under the guidance of: By:Mr. S.J.Prashanth B.E.MTech,LMISTE Chaithra K.V. Asst. Professor, CS & E Dept. 4AI08CS020

25/04/2012

OVERVIEW• Introduction• Motivation & Objectives• Feature Reduction• Feature Clustering• Fuzzy Feature Clustering(FFC)• Text Classification• An Example• Applications• Conclusion• References

INTRODUCTION• Text Classification;

– Process of classifying documents into predefined classes.class 1class 2

Documents ….….class n

• Text Classification is also called• Text Categorization• Document Classification• Document Categorization

Motivation and Objective

• In text classification the dimensionality is very huge

• The current problem with feature clustering algorithms are:– The number of extracted features need to be

specified in advance– Variance is not considered while comparing.• So need for reducing the dimensionality and

make it run faster

Feature Reduction• Purpose:

– Reduce the computational load– Increase data consistency

• Technique:– To eliminate redundant data– To reduce the dimensionality of the feature set– To find best set of vectors which best separate pattern

• Two ways:– Feature selection – Feature reduction

Feature Reduction• Feature Selection:

– is defined as the process of selecting a subset of relevant features.

– improves the classification accuracy by eliminating the noise features from various corpuses.

• Feature Extraction:– convert the representation of the original high-

dimensional data set into a lower-dimensional data– Most efficient than feature selection.

Feature Clustering• An Efficient approach for feature reduction• Group all features into some clusters where evey

features in a cluster are similar to each other• That is let ‘D’ be the set consisting of all original

documents with ‘m’ features then we obtain “D’” as the set containing converted documents with ‘k’ features with ‘k<m‘.

Fuzzy Feature Clustering

• Process:– A document set D of n documents d1,d2,…dn

– Find feature vector W of m words w1,w2,…wm

– P classes c1,c2,..cp

– Construct the pattern for each word in W, xi=[xi1,xi2,…xip]

– Let G be the cluster containing q word patterns x1,x2,…xq

– Let xj= [ xj1,xj2,…xjp ], 1 ≤ j ≤ q– Find the mean and deviation– Find the fuzzy similarity of a word pattern x to cluster G i.e

μG(x).

Fuzzy Feature Clustering

• The word pattern close to the mean of a cluster is regarded to be very similar to this obtained cluster

• Predefine a threshold ρ , 0≤ρ≤1• Check if μG(x) ≥ ρ. Then two cases may occur:

– No existing fuzzy clusters on which xi has passed the similarity test. Then create a new cluster Gh.

– There are existing clusters on which xi has passed the test. Then update the existing cluster.

• Sort the patterns in order by the xi values.• Perform the self constructing algorithm.

Fuzzy Feature Clustering• Find the Data transformation D’=DT, T is a weighting

matrix.• Perform the weighting {hard, soft, mixed}

– Hard: each word is only allowed to belong to a cluster, and so it only contributes to a new extracted feature.

– Soft:each word is allowed to contribute to all new extracted features.

– Mixed: is a combination of the hard-weighting approach and the soft-weighting approach.

Training set of documents

Feature Reduction

Training data set for

class 1

Train p-th Classifier (SVM)

Feature Reduction

Unknown Pattern

Train 1st Classifier (SVM)

Training data set for

class p

Overall flow of Text Classification

…

p classifiers are constructed

Classified documents

An Example• Here we illustrate how Fuzzy Self Constructing algorithm

method works. Let D be a simple document set, containing 9 documents d1, d2, . . . , d9 of two classes c1 and c2, with 10 words in the feature vector W, as shown in Table 1. For simplicity, we denote the ten words as w1, w2, . . . , w10, respectively.

Table 1: Sample Document set

• We calculate the ten word patterns x1, x2,. . . x10 for each word wi. As:xi = <xi1, xi2, . . . , xip>

i.e xi= <P(c1|wi,P(c2|wi),…,P(cp|wi) >;

• For Example, for the above document set,

P(c2|w6) = 1 x 0 + 2 x 0 + 0 x 0 + 1 x 0 + 1 x 1 + 1 x 1 +1 x 1 + 1 x 1 + 0 x 1/1 + 2 + 0 + 1 + 1 + 1 +1 + 1 + 0= 0.50.

• The resulting word patterns are shown in Table 2. Since there are two classes involved in D each word pattern is a two-dimensional vector.

Table 2: Word Patterns of W

• We run our self-constructing clustering algorithm, by setting σ0 = 0.5 (initial deviation) and ρ = 0.64 (threshold), on the word patterns and obtain 3 clusters G1, G2, and G3, which are shown in Table 3.

Table 3: Obtained clusters

• The fuzzy similarity of each word pattern to each cluster is shown in Table 4.

Table 4: Fuzzy Similarities of Word Patterns to Three Clusters

• The weighting matrices TH, TS, and TM obtained by hard-weighting, soft-weighting, and mixed weighting (with γ= 0.8, user defined constant), respectively, are shown in Table 5.

Table 5: Weighting Matrices: Hard TH, Soft TS, and Mixed TM

• The transformed data sets D’H, D’S, and D’M obtained as follows.

D=D’Twhere D = [d1,d2,…dn]T ,

D’=[d1’,d2’,…dn’]T ,

And t11 … t1k

t21 … t2k

T = … tm1 … tmk

with di = [ di1 di2 …dim],

di’ = [d’i1 d’i2 … d’ik ]

And T is a Weighting matrix. These transformed data sets for different cases of weighting are shown in Table 6.

Table 6: Transformed Data Sets: Hard D’H, Soft D’S, and Mixed D’M

• Based on D’H, D’S, or D’M, a classifier with two SVMs is built. Suppose d is an unknown document and d = <0, 1, 1, 1, 1, 1, 0, 1, 1, 1>. We first convert d to d’ by .Then, the transformed document is obtained as d’H = dTH = <2, 4, 2>; d’S = dTS = <2.5591, 4.3478, 3.9964>, or d’M = dTM = <2.1118, 4.0696, 2.3993>. Then the transformed unknown document is fed to the classifier. For this example, the classifier concludes that d belongs to c2.

Applications• Document Organization• Spam Filtering• Filtering Pornography Content• Web Page Prediction• Identity Based access & reporting• Mobile SMS Classification

CONCLUSION• FFC algorithm is an incremental clustering

approach to reduce the dimensionality of features

• Determines the features automatically• Runs faster• Better extracted features than other methods• The word patterns in a cluster have a high

degree of similarity to each other.

References• [1] A Supervised Clustering Method for Text Classification : Umarani Pappuswamy, Dumisizwe Bhembe,

Pamela W. Jordan and Kurt VanLehn Learning Research and Development Center, 3939 0’Hara Street, University of Pittsburgh, Pittsburgh, PA 15260, USA

• [2] Fuzzy Similarity-based Feature Clustering for Document Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee Department of Electrical Engineering, National Sun Yat-Sen University, Taiwan. 2009 Conference on Information Technology and Applications in Outlying Islands.

• [3] A Fuzzy Similarity Based Concept Mining Model for Text Classification : Shalini Puri, M. Tech. Student, BIT, Mesra, India. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 2, No. 11, 2011.

• [4] A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification: Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, Member. IEEE Transactions on Knowledge And Data Engineering, VOL. 23, NO. 3, MARCH 2011.

• [5] Text Classification Aided by Clustering: a Literature Review Antonia Kyriakopoulou Athens University of Economics and Business Greece: Tools in Artificial Intelligence.

• [6] Semantic Clustering for a Functional Text Classification Task:Thomas Lippincott and Rebecca Passonneau, Columbia University,Department of Computer Science, Center for Computational Learning Systems New York.

• [7] Using Clustering to Enhance Text Classification: Antonia Kyriakopoulou, Theodore Kalamboukis..Department of Informatics, Athens University of Economics and Business 76 Patission St., Athens, GR 104.34. SIGIR'07, July 23,27, 2007, Amsterdam, The Netherlands.ACM 978-1-59593-597-7/07/0007

Thank You!!!