introduction - ijsetr.com file · web viewword clustering is nothing but grouping of the words with...

1

Auto constructing Feature Clustering Algorithm for Text Classification.

Pallavi M Deshmane. Prof.S.V.Chobe. M.E Comp.Sci HOD of IT Dept D.Y.P.I.E.T Pimpri D.Y.P.I.E.T, Pimpri Pune, India Pune,India [email protected] [email protected]

Abstract A feature clustering is a powerful technique to reduce the size of feature vector of text classification. In this paper we propose classification of text by using self constructing feature clustering algorithm. The words in the feature are grouped into cluster. Based on equally and automatically we then have one feature for each cluster. The words that are same to each other are grouped into same cluster. If words are not same then new cluster is created automatically. Each cluster is grouped by membership function with statistical mean and deviation. When all words are given as input clusters are created automatically.

Index Terms— Feature clustering, feature selection, Feature reduction, Text classification.

I. INTRODUCTION

he aim of classification of text is to assign automatically a new document into one or more already defined classes

based on its contents present. Text classification is also called text categorization, document categorization and document classification. Two methods are mainly used for text classification that are manual classification and automatic classification. The text classification applications are.

T

• Email Spam Filtering: a process which tries to discern email spasm to legitimate mails

• Used in news writers to select imp topics.• Categorize newspaper articles into topics.• Sort journals and abstracts into subject

categories

Clustering is one of the powerful methods for feature extraction. Word clustering is nothing but grouping of the words with a high degree of pair wise semantic relatedness into clusters and each word cluster contain the grouped features treated as a single feature. In this way the dimensionality of the features can be drastically reduced. The main purpose of feature reduction is to reduce the classifiers computation load and to increase data consistency. There are mainly two techniques for feature reduction, those are feature selection and feature extraction. The feature selection methods use technique like sampling which takes a subset of the features and the classifiers only uses the subset instead of all original features to perform the text classification

task. The feature extraction methods convert the representation of the original documents to a new representation based on a smaller set of synthesized features A well known feature selection approach is based on information gain measure defined by the amount of decresed uncertainty given a piece of information. However there are some problems associated with the feature selection based methods. In these methods only a subset of the words is used for classification of text data therefore useful information may be ignored.

II. LITERATURE SURVEY

Support vector machines (SVMs) have been known as one of the most successful classification methods for many applications including text classification. Even though the learning ability and computational complexity of training in support vector. machines may be independent of the dimension of the feature space, reducing computational complexity is an essential issue to efficiently handle a large number of terms in practical applications of text classification adopts novel dimension reduction. Methods to reduce the dimension of the document vectors dramatically. Exist decision functions for the centric-based classification algorithm and support vector classifiers. The Bottleneck approach is already there. The divisive information-theoretic feature clustering, which is an information-theoretic feature clustering approach, and is more effective than other feature clustering methods?In these word clustering methods, each new feature is generated by combining a subset of the original words. However, difficulties are associated with these methods.

mailto:[email protected]

mailto:[email protected]

2

Track 2:Data mining.

Disadvantages:1. In these feature clustering methods, each new feature is generated by combining a subset of the original words. However, difficulties are associated with these methods.2. A word is exactly assigned to a subset, i.e., hard clustering, based on the similarity magnitudes between the word and the existing subsets, even if the differences among these magnitudes are small.Proposed system1. I propose a classification of text using self constructing feature clustering algorithm which is a clustering approach to reduce the number of words for the text classification task.2. The words in the feature vector of a document set are represented as distributions, and execute one after another.3. Words that are same to each other are grouped into the same cluster. Each cluster is characterized by a statistical mean and deviation.4. If a word is not same to any existing cluster, a new cluster is created for this word. Advantages:1.A classification of text using self constructing clustering algorithm which is an incremental clustering approach to reduce the dimensionality of the words in text classification 2. Determine the number of words automatically placed.3. Runs faster than other method. 4. Better extracted words than other methods.

III. IMPLEMENTATION DETAILS

A. Design

Use case diagram.

Fig 1: Use case diagram

Activity diagram

Fig 2: Activity diagram

B. Modules

We divide this is in 4 modules.Pre-processing

In this module we construct the word weight age pattern of given document set. Read the document set and remove the stop words.Get the feature vector from the given document .Next we construct the word weight age pattern. Suppose, we are given a document set D of n documents d1, d2, Dn, Together with the feature vector W of m words w1, w2, . . ., wm and p classes c1, c2, . . . , cp, we construct one word weight age pattern for each word inW.Preprocessing Flow:

Fig 3: Pre-processing flowAutomatic clustering

In this module we group the words using Automatic clustering algorithm. For each word weighatage pattern, the similarity of this word weightage pattern to each existing cluster is calculated to decide whether it is combined into an existing cluster or a new cluster is created. Once a new cluster is created, the corresponding membership function should be initialized. On the contrary, when the word weightage pattern is combined into an existing cluster, the membership function of that cluster should be updated accordingly

Training

set

Input text

Word

pattern

Self constructing clustering

Feature

Extraction

Text Classification

1

Fig 4: Automatic clustering

Word extraction

1.Word weighatage patterns have been grouped into clusters, and words in the feature vector (W) are also clustered.2. For one cluster, I have one extracted word. Since we have k clusters, 3. The elements of T are derived based on the obtained clusters, and feature extraction will be done.4. I propose three weighting methods: best, better, and worst. In the worst weighting approach, each word is only allowed to belong to a cluster, and so it only contributes to a new extracted feature

Fig 5: word extraction

Classification of textGiven a set D of training documents, text classification can be done as follows

Fig 6: Classification of text

IV. OUR METHOD

Our proposed method is an agglomerative clustering approach. The words in the feature vector document set are represented as distributions and processed one after another. Initially each word represents a cluster. Suppose a document set D of n documents { d1,d2,d3….dn} together with the feature vector W of m words {W1,W2….Wm} and P classes

{c1,c2….cp}.then word pattern . For each word is constructed. Based on these word pattern clusters are created. And is defined as.

It is the word pattern on which our proposed algorithm works. Principle component analysis is used to reduce this word pattern X from P-dimension to 2 dimension. All center coordinated should be positive. And within the range of 0 to 1. Since it is fuzzy based approach. There for transformation algorithm is used. For this purpose and finally we get word

pattern {x1,x2….⃗ xm}. Fig 7 represent transformation algorithm.

2


Fig 7 Transformation Algorithm

Once the word pattern constructed we use clustering algorithm to group the words into clusters. The clustering algorithm is given below. Here in this we use word pattern as input.

End if;

V. RESULTS

In this we present experimental result to show effectiveness of our self constructing clustering algorithm. To that we take three well known data sets .

A. Reuters corpus volume 1

RCV1 data set consist of 808,120 stories produced by reuter.

As shown in the fig.7 in this fig x axis indicates class numberAnd y axis indicates stories of each class

Fig.7

Fig 8 shows the execution time of other word reduction method on rcv1 data.

Fig 8: Execution time of all methods on RCV data.

B.Newsgroup data set

The newsgroup data set contain more than 20,000 articles and these articles are evenly distributed over 20 classes. Each class contain about 1000 articles as shown in fig 9.

Fig 9: Class distribution of news group data set

Fig 10 shows the execution time of other methods on newsgroup data.

1

Fig 10 Execution time of other method on newsgroup.

C .Cade 12

Cade 12 is contain set of web pages extracted from web directory. And the web pages are classified into 12 classes as shown in fig 11.

Fig 11 Class distribution of cade 12 data set

Fig 12 shows the execution time of other methods on cade12 data

Fig 12 Execution time of other methods on cade12

VI.CONCLUSION

The proposed method is new in the text classification field. It uses good optimization performance of support vector machines to improve classification performance. Automatic clustering is one of the methods that have developed in machine learning research. In this paper, apply this clustering technique to text categorization problems. And found that when a document set is transformed to a collection of word patterns, the relevance among word patterns can be measured, and the word patterns can be grouped by applying similarity-based clustering algorithm. This method is good for text categorization problems due to the suitability of the distributional word clustering concept.

There are many methods presented by various researchers for feature clustering algorithm for text classification. However those methods are resulted into unsolved limitations such as in

these methods each new feature is generated by combining a subset of the original words also the mean and variance of clusters are not considered when similarity with respect to the cluster is computed. Furthermore these methods require the number of features be specified in the advance by the user.

Future scope

This clustering method is applied to solve the text classification problems. This technique is also applied to the other problem such as web mining, image segmentation, data sampling and fuzzy modeling

ACKNOWLEDGMENT

I have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals and organizations. I would like to extend my sincere thanks to all of them.I am highly indebted to Prof S V Chobe. for their guidance and constant supervision as well as for providing necessary information regarding the project & also for their support in the project. I would like to express my gratitude towards my parents & member of D.Y.P.I.E.T Computer department. for their kind co-operation and encouragement

REFERENCES

[1] H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification With Support Vector Machines,” J. Machine Learning

[2] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992.

[3] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English Words,” Proc. 31st Ann. Meeting of ACL, pp. 183-190, 1993

2


[4] L.D. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. ACM SIGIR, pp. 96-103, 1998

[5] I.S. Dhillon, S. Mallela, and R. Kumar, “A Divisive Infomation- Theoretic Feature Clustering Algorithm for Text Classification,” J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.

[6] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002

introduction - ijsetr.com file · web viewword clustering is nothing but grouping of the words with...

Documents