[ieee proceedings of canadian conference on electrical and computer engineering ccece-94 - halifax,...

710

Generating Document Clusters Using Thesauri and Neural Networks

Jennifer Farkas Industry Canada

Centre for Information Technology Innovation (CITI) 1575 Chomedey Boulevard Laval, QuCbec, H7V 2x2

e- mail: jfarkas@ ci ti.doc .CA

June 30,1994

Abstract

In this paper, we describe the use of thesauri and neural networks for the classification of lexically similar natural language documents. We discuss the effect of extending the usual keyword representation of documents to a weighted, thesaurally-based, representation using relations among keywords. We present some experimental results obtained from a neural network prototype that uses Kohonen’s self- organizing map paradigm (cf. [5]).

1 Introduction

Automatic document classification by means of clustering is an active and challenging field of research. Recently, researchers have begun to investi- gate to what extent the pattem recognition power of neural networks can be exploited for this purpose [2. 3. I I]. In [l I], for example, a recurrent connectionist network is used to classify library books based on input of their rifles. In [2, 31, it is shown that the back propagation learning algorithm can also be used to build afUU-taz document-classifying neural network. Nevertheless, most applications of thls technology to information processing have, until now. been concerned with problems related

collections of textual material. Its aim was to sat- isfy user queries by retrieving only query-relevant documents from these collections. In [7], Lin et al., used Kohonen’s feature map to construct a self- organizing semantic map for information retrieval. Others such as MacLoed 181 and Deogun et al. [l] have attempted to break classification bottlenecks by using clustering algorithms to achieve a semantic grouping of documents.

In this paper, we go beyond the retrieval problem and describe a system that facilitates document retrieval through structured storage. In particular, we sketch the main features of a prototype of an automatic document classification system, called Pro- File. that is capable of classifying full-text documents relative to a targeted document clustering on the basis of a controlled domain-specific vocabulary and thesaural relations. Retrieval is managed and controlled by mathematical similarity measures that provide a metric structure for the underlying document space (cf. [4]) that measure the semantic overlap of different documents.

ProFile was developed using three fundamental components:

1. An ISO-type thesaurus constructed from semantic relationships between keywords to provide a context for term phrases.

to informarion renimul. In [61. Kwok. for example. built a neural network for probabilistic information retrieval to facilitate the processing of large

2. The Kohonen self-organizing learning map to find appropriate relationships among input pat- tems.

711

3. The Euclidean distance measure to determine document similarity.

2 TheThesaurus The thesaurus underlying the ProFiZe prototype is based on a structure in which each tennis related to aset of broader tenns, aset of namwertenns, anda set of related terms in accordance with the semantic content of the document space. The thesaurus thus approximates a semantic interpretation of the texm phrases encountered in the document space. The motivation for the specific structure of the thesaurus used in the development of ProFile is the same as the purpose of thesauri in general, viz., to provide “a grouping, or classification, of terms used in a given topic area into categories known as thesaurus classes. As in the manual indexing case. thesauri can be used for language normalization purposes in order to replace an uncontrolled vocabulary by the controlled thesaurus category identifiers. When hierarchical relationships are supplied for the entries in a thesaurus in the form of broader or m o w e r terms, the indexing vocabulary can be expanded in various directions by adding these broader or nar- rower terms, or certain related terms, as may be the case.” (Salton [lo])

The document space from which ProFile was developed consists of approximately 800 scientific ab- stracts taken from four different domains of the Zn- spec database. The documents were examined for their semantic content. and 92 characteristic terms were identified as preferred terms, and 122 as non- preferred terms. Using these terms, a thesaurus with broad terms (BT). narrow terms O. related terms (RT) and used-for terms (UF) was constructed. Re- lations such as USE and UF are used to standardize the keyword vocabulary. They define appropriate synonymity relations for ProFile.

3 The Architecture Using the relational structure on terms provided by the thesaurus, we introduced a weighting scheme fortenns, we interpreted all documents as numerical vectors suitable for neural network processing. The default weights assigned the broad terms, narrow terms and related terms were chosen to be 0.5.0.3, and 0.2 respectively. Contrary to the usual indexing practices. we assigned greater weight to the broad

terms than to the more specific m w terms, since our objective was document clustering rather than document retxieval.

3.1 The Vector Representation We began by indexing all terms of the thesaurus. This resulted in a list of tenns f l , .. . , ta. We then used this list to associate with each document D a vectur v (D) with 92 entries. We think of u ( D ) as the sum of the (numerical) vectors representing the thesaural terms in the document. The entries of u(D) express numerically the value of a thesaural term in the document. For each thesaural term t E D, the term vector u(t) is computed in three steps: The entries of u ( t ) are ail 0, except for the entry corresponding to the team t. That entry has as its value the frequency # ( f ) of occurrences of t in D.

Next we used the thesaural relationdps among terms to expand the vector u(f ) to a vector re1 (u(t)). We identified the broad terms, narrow terms and related terms of a term r in the thesaurus and computed their weights by multiplying # ( t ) by the default weights for these tenns. i.e., the frequencies of the broad terms are multiplied by 05, those of the m o w tems by 0.3, and those of the related t e m by 0.2. We replaced the zero entries corresponding to these terms in the vector u(r ) by these values.

Finally we added the vectors rel(u(t)) deter- mined by all thesaural terms f E D and normalized the vector using the Euclidean norm. We took the resulting vector v ( D ) to represent the document D.

3.2 The Kohonen Map

We designed a neural network that was able to cluster these representation vectors. We used the two- dimensional grids in the Kohonen layer of the self- organizing map and examined the degree to which an acceptable clustering could be achieved. Our work shows that the self-organizing learning map finds appropriate relationships among input patterns with the result that each document is actually assigned to one of the neurons in the Kohonen layer. In this paradigm. documents are presented sequen- tially to the input layer. and learning is achieved by finding the best-matching neurons in the Kohonen layer. The pattern relationships and pattern clustering. i.e.. the semantic similarities of documents. can be read off the Kohonen layer. They provide

7 12

the basis for the clustering of the documents by the network.

Table 2: Documents to Centroids

3.3 Performance Measures We examined the result of using both 9 (a 3 x 3 grid) and 16 (a 4 x 4 grid) neurons. The input pattern to the Kohonen feature map consisted of the 92 values corresponding to the number of p r e f d terms in the thesaurus. As pointed out above, the Kohonen self-organizing learning map found the R- lationships between the input patterns and assigned each document to one of the neurons in the Kohonen layer. The 16-element network turned out to yield results considerably closer to our expected classification scheme than that of the 9element network. We therefore chose the 16-element class classification scheme for the development of PruFiZe.

33.1 'lhining Accuracy

In order to assess the performance of the network at different training stages, we used two measures: The average disrance of a document from the cen- troid of its (similarity) class, and the average disrance between (the centroids of) different classes. The average distance of a document from the cen- mid of its class gives us a good idea of the type of clustering within a class achieved by the network. The average distance between classes pro- vides a measure of the level of document separation achieved by the network. The goal is to train the network until a level of clustering has been achieved at which the clusters remain unchanged by further training, and where the distances between the documents in a cluster remain constant.

The following tables show the clustering aggre- gates before training. after 2.000 training cycles. and after 4.000 training cycles. Tables 1 shows the initial random distribution of the training documents into the 16 classes C0-C 15. Table 2 shows the average distance of the documents from the centroids of their respective classes. Table 3 shows the average distances between centroids of the different classes:

Table 1: Document Distribution

CO 0 C1: 142 C2: 2 C3: 2 C4: 139 C5: 157 C6: 57 C7: 20 C8: 2 C9: 1 C10: 16 Cl l : 119 C12: 0 C13: 13 C14 3 C15: 90

- 0.5728 0.2953 0.5788 0.6595 0.7467 0.6857 0.4615 0.7071 O.oo00 0.7406 0.6673 - 0.7910 0.6285 0.4089

Table 3: Centroids to Centroids

9.2028 2.1828 2.2383 2.2348 2.0613 2.0989 2.1046 2.1578 2.1491 2.3526 2.0947 1.9445 9.2028 2.0462 2.1590 2.2358

After 2,000 training cycles, ProFile had clustered the documents as follows:


CO 114 C1: 4 C2: 28 C3: 104 C 4 36 C5: 47 C 6 10 C7: 46 C8: 7 C 9 60 C10 25 C11: 15 C12: 128 C13: 10 C14 18 C15: 111

Table 5: Documents to Centroids

0.4367 0.5973 0.5763 0.1615 0.5149 0.7342 0.7452 0.5603 0.7323 0.3957 0.8964 0.7251 0.6094 0.6569 0.7638 0.3558

Table 6: Centroids to Centroids

0.9878 0.8959 0.8829 1.0237 0.9396 0.8569 0.8090 0.8926 0.8830 0.7463 0.7851 0.8624 1.0256 0.9779 0.8642 1.0707

After 4,000 training cycles, ProFile had reached a stable state in its training, since further training cycles produced the same clustering. The following tables show the final result:


CO: 150 C1: 44 C2: 29 C3: 129 C 4 6 C5: 9 C6: 18 C7: 26 C8: 5 C9: 60 C10: 27 C11: 7 C12: 108 C13: 22 C14: 14 C15: 109

713

Table 8:

0.5186 0.7274 0.607 1 0.5492

Table 9:

0.9902 0.9009 1 . O M 1.0171

Documents to Centroids

0.6974 0.6471 0.2843 0.8389 0.7279 0.4838 0.4110 0.8916 0.5711 0.7824 0.6353 03359

Centroids to Centroids

0.8865 0.8721 1.0105 0.8278 0.8742 0.9384 0,7371 0.7751 0.91 13 0.8599 0.9238 1.0561

4 Conclusion The system described in this paper contributes to our understanding of the extent to which a neural network can be expected to produce document clusters whose elements correspond to a measurable degree to our concept of semantic similarity.

The next step in ;the deveiopment of Profile will be to validate the described technology in an op- erational environment in a specific department of the Government of Canada. In particular, the classification achieved by Profie will have to take into account the hierarchical structure of most document spaces. Thus we intend to enrich the system in various forms in order to achieve semantically acceptable hierarchical partitions of real document spaces.

References [ I ] J. S. Deogun. S. K. Bhatia and V. V. Ragha-

van. “‘Automatic Cluster Assignment for Doc- uments.” in Proceedings of the 7th IEEE Con- ference on Artificial Intelligence Applicatiolrs, Miami Beach. Florida. February 24-28. 1991. pp. 25-28.

J. Farkas. “Neural Networks and Document Classification,” in Proceedings of the 1993 Canudian Conference on Electrical and Com- puter Engineering, vol. 1. Vancouver, Canada, September 14-17, 1993, pp. 1-4.

J. Farkas. “Documents. Concepts and Neural Networks.” in Proceedings of rhe CASCON 93 Conference. vol. 11, Toronto, Ontario, October 24-28. 1993. pp. 1021-1031.

J. Farkas. “Mathematical Search Techniques for Intelligent Document Management,” in Proceedingsof the 1992 CanadianConference on Electrical curd Computer Engineering, vol. 11, Toronto, Ontario, September 13-16.1992,

T. Kohonen. ”The Self-organizing Map,” Neural Networks: Theoretical Foundations and AnaZysis (Clif€o~I Lau, editor). IEEE Press, New York, 1992, pp. 7490.

K. L. Kwok. “A Neural Network for Proba- bilistic Information Retrieval,” in Proceedings of the 12th Annual International ACM SGIR Conference on Research and Development in Information Retrieval, New York, 1989, pp.

S. Lin, D. Soergel and G. Marchlonini, “A Self-organizing Map for Information Re- trieval,” in SIGIR 9I , Proceedings of the 14th Annual International Confuence on Research and Development in I n f m t w n Retrieval, Chicago. Illinois, October 1991, pp. 262-269.

K. J. Macleod, “A Neural Algorithm for Docu- ment Clustering,” lnfonnatwn Processing and Management, vol. 27, pp. 337-346.1991.

G. Salton, Automatic Text Processing, Addison- Wesley Publishing Co., Reading, MA., 1989.

G. Salton, Innoduction to Modem Information Retrieval, McGraw-Hill, New York, 1983.

S. Wermter. ‘‘Learning to Classify Natural Language Titles in a Recurrent Connectionist Model,” in ArtiLficial Neural Networks: Pro- ceedings of the 1991 International Confer- ence on Artifiial Neural Networks (ICANN-

pp. TA9-1.18.1 - TA9-1.18.4.

21-31.

91 ). vol. 2, NOnh-HOlland, AmStUdam, 1991, pp. 1715-1718.