keyword based text categorizationu.cs.biu.ac.il/~nlp/wp-content/uploads/libby-thesis.pdf · this...

1

Bar Ilan University

The Department of

Computer Science

Keyword based Text Categorization

by

Libby Barak

Submitted in partial fulfillment of the requirements for the Master's Degree in the Department of Computer Science, Bar Ilan University.

Ramat Gan, Israel June 2008, Sivan 5768

2

This work was carried out under the supervision of Dr Ido Dagan

Department of Computer Science,

Bar-Ilan University.

3

Acknowledgements

I would like to take this opportunity to thank the people whose joint efforts assisted

me in writing this thesis.

First and foremost, my greatest thanks go to Dr. Ido Dagan for introducing me to the

wonderful world of Natural Language Processing, and for supervising this research.

His constant support, thorough guidance, and great patience enabled this work.

My gratitude goes also to all my NLP lab members for sharing with me their time and

moral support. I especially want to express my appreciation to Idan Szpektor, Roy

Bar-Haim and Shachar Mirkin, for sharing with me their words of wisdom,

experience and advice when needed.

I would like to thank Michael Gutkin and Eyal Shnarch for their assistance beyond the

research processes throughout various academic tasks required along the way.

I am grateful for our Italian colleagues, Alfio Glizzo and Carlo Strapparava from ITC-

Irst for setting up the ground work for this research. I wish to thank them for their

help acquiring data structures and results and for their assistance in implementing

some of the methods.

I want to thank my parents and my brothers for encouraging me to pursue my

academic goals and dreams, and for giving me the special kind of support only family

can provide. I would also like to thank my husband, Oren, for his unique support,

understanding and faith in me, which encouraged me greatly throughout this work.

This thesis was partly supported by the Negev Consortium (www.negev-

initiative.org), funded by the Israeli Ministry of Industry, Trade and Labor.

4

Table of Contents

Table of Contents .............................................................................................................................. 4

List of tables and figures .................................................................................................................. 6

Abstract ............................................................................................................................................ 7

1. Introduction ............................................................................................................................. 10

2. Background ............................................................................................................................. 13

2.1. Supervised text categorization ........................................................................................... 13

2.2. Keyword based TC ........................................................................................................... 14

2.3. Categorization based on category name ............................................................................. 15

2.4. Lexical entailment ............................................................................................................. 17

2.4.1. Lexical Entailment Resources .................................................................................. 18

3. Text Categorization based on category name .......................................................................... 21

3.1. Research goals .................................................................................................................. 21

3.2. Categorization tasks .......................................................................................................... 23

3.3. Scoring methods ............................................................................................................... 23

3.3.1. Vector Space Model based on category seed terms ................................................... 23

3.3.2. Entailment expansion methods ................................................................................. 25

3.3.3. Context Similarity methods ...................................................................................... 31

3.3.4. Gaussian Mixtures model ......................................................................................... 34

3.3.5. Combination of knowledge and context .................................................................... 35

3.4. Binary classification methods ............................................................................................ 36

4. Evaluation ............................................................................................................................. 38

4.1. Data set and Pre processing ............................................................................................... 38

4.1.1. Experimental settings ............................................................................................... 41

4.2. Ranking ......................................................................................................................... 42

4.2.1. Ranking measure ..................................................................................................... 42

4.2.2. Ranking results ........................................................................................................ 43

4.2.3. Analysis ................................................................................................................... 47

4.3. Classification .................................................................................................................... 54

4.3.1. Classification measure ............................................................................................. 54

4.3.2. Classification results ................................................................................................ 55

4.3.3. Analysis ................................................................................................................... 57

4.4. Reuters-10 results .............................................................................................................. 64

5. Conclusion and future work .................................................................................................... 68

References ....................................................................................................................................... 71

Appendix A – Latent semantic Analysis......................................................................................... 75

Appendix B – Gaussian Mixtures ................................................................................................... 77

5

Appendix C – Support Vector Machines........................................................................................ 80

Abstract (Hebrew) .......................................................................................................................... 83

6

List of tables and figures

List of tables:

Table 1- WordNet synsets ....................................................................................... 18

Table 2- Initial seeds for the 20 Newsgroups collection ........................................... 39

Table 3 - Initial seeds for the Reuters-10 collection ................................................. 41

Table 4 - MAP values for the 20 Newsgroups collection. ......................................... 46

Table 5 – Document samples for the passing reference phenomenon. ...................... 49

Table 6 - Document samples for missing annotations. ............................................. 52

Table 7 - Micro average classification the 20 Newsgroups collection....................... 55

Table 8 - Classification results per category for the 20 Newsgroups collection. ....... 56

Table 9 - Confusion matrix for the Simcombined score based method …………………60

Table 10 - Confusion matrix for the Simcontext score based method ………………….62

Table 11 - MAP values for for the Reuters-10 collection ......................................... 64 Table 12 - Micro average classification results for the Reuters-10 collection. .......... 65

Table 13 - Classification results per category for the Reuters-10 collection. ............. 66

List of figures:

Figure 1 - Recall-Precision curves for overall baselines ....………………………… 44 Figure 2 - Recall-Precision curves for entailment baselines …...……………………45

Figure 3 - Recall-Precision curves for specific categories . ...................................... 47

Figure 4 - Context Scoring influence ....................................................................... 53

7

Abstract This thesis investigates Keyword-based Text Categorization (TC) using only a topical

taxonomy as input. TC task is mostly approached via supervised or semi-supervised

methods. Supervised TC requires excessive manual labor in order to manually

annotate text samples as training data for the supervised TC. Although there are

several legacy categorization systems which already acquired labeled text samples

this solution is not feasible for most systems currently. New taxonomies, new TC

collections which require classification and rapid growing of unlabeled text

documents are only some of the reasons to seek for a more automated TC method.

Keyword-based semi-supervised TC methods have made the first step towards

a more automated TC framework. These methods recognize the great computational

power which lies in the excessive amount of unlabeled data available currently for

various domains and applications. The basic idea of these methods is to represent the

categories by a set of characteristic keywords, and to set a similarity measure to

determine the similarity between texts and categories. Those keywords should contain

the meaning of the category topic. The supervised aspect of these methods lies in the

specification of the characteristic keywords instead of manual classification of a large

amount of documents. This step is considered to require less work than the one

required by the fully supervised methods. Nevertheless, it still requires specific

manual annotation for each category, which requires certain expertise. Therefore, new

taxonomies require specific manual effort by domain experts once more.

Our research is based on a new approach, first proposed in (Gliozzo et al.,

2005), which do not require specific manual annotation for each category. This

research relies on the assumption, used also in previous works, that the category name

itself should be highly informative for the TC goal. Each category name is selected by

domain experts to represent most accurately the category topic. It, therefore,

encompasses useful information for the TC purpose. To obtain a set of characteristic

terms for each category the method utilized an automated expansion method of the

initial category name. In (Gliozzo et al., 2005) the expansion is based on co-

occurrence information extracted from the TC collection used. Using Latent Semantic

Analysis (LSA) and a standard similarity measure they obtain an initial set of

automatically labeled documents, which are then used to train a supervised classifier,

and by that acquire the final classification.

8

Basing the similarity measure on co-occurrence data has several

disadvantages. First and for most, the co-occurrence data does not capture the exact

semantic relation needed to assess classification decision. Co-occurrence data

typically models the broader context of the text and not the specific topic it discusses.

High similarity according to co-occurrence data assures that the text is relevant in a

general context sense to the category topic. It does not assure that the topic itself is

mentioned in the text. For example, a text which discusses certain computer software

is relevant to the general computers context, however its context may not be directly

related to any specific computer branch.

In this research we offer a novel taxonomy based approach for keyword-based

TC, which bases its similarity measure on a Lexical Entailment (LE) measure instead

of a context measure only. LE defines a more accurate semantic relation, which aims

to identify whether the meaning of a certain text is referenced by another text. This

measure aims at a more appropriate relation to base the TC assumption on, since it

requires the actual reference to the category topic in the text, rather than general

context similarity.

In order to identify whether the topic is addressed by the text as the main topic

and not as one of the text minor topics, we integrate the context model in our overall

framework. Once a reference to the category topic, i.e. an entailment evidence, is

recognized in a text, we also measure its context similarity to the category topic.

Using this novel integrated framework we achieve a complementary semantic

measure which quantifies the topics mentioned and the contextual relevancy at the

same time.

We utilize two preliminarily resources for the LE methods. The first LE

knowledge base used is based on the WordNet (Fellbaum, 1998) semantic relation

ontology. This resource enables us to extract semantic relations from a dictionary

style knowledge base. It supplies necessary morphological variations and useful

entailing terms. As a complimentary resource, we utilize a Wikipedia LE knowledge

base. This encyclopedia oriented knowledge base supplies us with entity names,

commercial products and general knowledge terms. The two resources are

complementary by nature and as expected they contribute for different types of

categories and relations addressed in this research.

Our context based method is based on the co-occurrence based method used in

(Gliozzo et al., 2005). We utilize a Latent Semantic Analysis (LSA) method to

9

represent the context similarity of documents and categories. LSA is a dimensionality

reduction method which maps similar terms, by means of co-occurrence data, to a

lower dimensional space in which terms and documents are represented by

"concepts". Those "concepts" aims to capture the context similarity of the data. LSA

has the advantage of modeling both first order and second order similarity, and by that

offers a powerful context-similarity measure. It measures not only the likelihood of

terms to appear in the same document as standard co-occurrence based methods, but it

also captures the likelihood of terms to co-occur by their joint mapping to the same

LSA "concepts".

We applied the similarity measure described above for two TC goals. The first

is ranking of documents for each category according to the similarity score, and the

second is binary classification of documents to one or more categories. We compared

our results to the research results of (Gliozzo et al., 2005), as well as a comparison of

the component methods implemented as part of our full system. Ranking evaluation

enables the analysis of the accuracy gained for each category, since it creates a

separate list for each category. It also enables a more concrete analysis of the

methods' precision since it inspects the relative scoring of the documents inside the

category. Classification, on the other hand, enables the analysis of the comparative

scores obtained for a document per each category, and analysis of the inter-category

relations. It also enables the analysis of false negative misclassification, and by that

implies on necessary improvements.

Positive empirical results are presented for our complete method. It indeed

shows higher precision results which support the hypothesis that the LE based

approach is more accurate than the context based approach. The results are

accompanied by comprehensive analysis of the expansion types and various

mechanisms needed to further improve the results.

10

1. Introduction

Text categorization (TC) is the task of classifying textual documents into preset

(topics) categories. The majority of research works in the TC area use supervised

methods, which relies on a vast, sometime prohibitive, amount of labeled training data

to achieve accurate classification. The lack of sufficient available resources of labeled

training data for those kinds of tasks requires manual annotation of textual data, which

is often a long and very expensive task. On the other hand, there are large available

resources of unlabeled training data which can be of aid for TC tasks. In the recent

years there were several research efforts which tried to base their methods on

unlabeled data rather than resorting to manual annotation, one group of these methods

is Keyword-based Text Categorization.

Keyword-based topical TC relies on keyword representation of categories and

documents, which requires only manual specification of the keyword representation

for the topical categories. The documents can be naturally represented by a processed

collection of the keywords contained in them. However, since the quality of the

classification is highly dependent on the accuracy of the representing keywords of the

topical category it may require careful manual specification of those keywords.

Methods such as (Liu et al. 2004) tried to make the manual specification process more

efficient by partially automating it using a clustering method that creates a candidate

list of representing keywords for each category. Nevertheless, the method still

requires manual specification as part of the classification process.

To acquire a fully unsupervised set of the categories' characteristic keywords the

TC algorithm should use unsupervised methods to find relevant keywords in

unlabeled data. Therefore the available input within the collection for such method is

the names of the topical categories and unlabeled data alone. (Ko and Seo, 2004;

Gliozzo et al.,2005) recognized the category names as a significant part of the input

since they should capture the meaning of the topic. For that reason the category names

could be used as a seed for the collection of category keywords in unsupervised TC

methods.

Collecting category-representing keywords by using the category name as seed

relies on context models which use co-occurrence information to extract related

keywords from the text. (Ko and Seo, 2004) used a co-occurrence metric to employ a

feature-projection method in order to extract keywords for each category. Another

11

context model, Latent Semantic Analysis (LSA), which also relies on co-occurrence

data, was used in (Gliozzo et al. 2005) to measure similarity based on LSA topics.

However, basing the classification decisions on co-occurrence data has a major

drawback of relying on a weaker semantic relation than needed for this type of task.

Term co-occurrence only indicates that words tend to co-occur together and therefore

have high probability to be related to the same context. LSA captures a stronger

aspect of co-occurrence, since it captures both tendencies of terms to appear together

in the same document and to appear in similar contexts of other words. However it

still captures only the co-occurrence similarity and not necessarily a reference to the

category topic. For that reason, documents which share co-occurrence data with

categories characteristic terms might by topically related to the category, but do not

necessarily discuss specifically the category's topic.

In this thesis we propose to base the classification method on a different lexical

relation, namely Lexical Entailment inference (LE). Lexical Entailment inference

(Glickman et al.,2006) aims to define a more concrete criterion for word similarity, than

the criterion defined by distributional similarity, to enable deduction whether a certain

textual meaning can be inferred from a specific text. Generally speaking, a word w is

lexically referenced by a text t if there is an implicit or implied reference from a set of

words in t to a possible meaning of w. For instance, the word "Mercedes" entails the

word "car", since instead of "Danny drives his Mercedes", one may say "Danny has a

car". Entailment models should be helpful for the text categorization process in

finding texts for a specific category, which do not contain the exact keyword from the

initial characteristic terms. Texts about cars might contain only words like

automobile, vehicle or even cars' manufactures and not the exact phrase – "car".

Lexical entailment can help us enrich the seed keyword of a category name with

entailing words, as shown above.

We propose a TC method which uses only the category names as seeds for an

expansion step with virtually no manual processing requirement during the

classification process. The method relies on external resources to acquire knowledge

needed for the TC, instead of requiring manual analysis per category. Our method

relies on LE expansions which are integrated with co-occurrence data. The

combination of LE, which indicates that the topic is indeed referred to in the

document, with a context model, which indicates that the topic was addressed

prominently within the document context, is more likely to capture the meaning

12

needed for accurate classification. In addition, we use the automatic expansions to

create an initial set of classified documents which are then used as input for a

supervised learner in a bootstrapping procedure in order to acquire a final

classification.

In section 2 we provide some background on recent works and the resources used

for our method. We describe the entailment and context models used in our method in

sections 3.3.2 and 3.3.3. Section 4 discusses the evaluation of the proposed method

and analyzes the results of each step of the method. We show that using an initial

entailment method as the basis for the classification decision provides preliminary

promising results, which are restricted mostly by the recall of the LE resource in use.

The proposed method reaches higher precision results which imply that indeed the

entailment assumption is more suitable to the needs of the TC task. The analysis of

the method in section 4 describe the aspect in which the entailment based method out

performs the context based method. With the ongoing development of promising LE

resources it is highly expected that TC methods based on the LE approach can reach

further improved results.

13

2. Background

Text categorization (TC) is the task of clustering textual documents by a given set

of categories. In this thesis we focus on keyword-based TC which represents

documents and categories as set of keywords. This section describes related work and

provides motivation for our method. Supervised text categorization is presented (2.1),

and then keyword based text categorization methods are described (2.2). Next,

unsupervised methods for TC are presented and the framework and motivation of the

method we employ is presented (2.3). Finally, background on the lexical entailment

framework and resources and the motivation to use it are explained.

2.1. Supervised text categorization

The supervised approach for TC uses a set of labeled documents to train a supervised

classifier (learner). Most work in text categorization has used a "bag of words"

representation, in which each feature corresponds to a single word, which is then used

to train the supervised classifier. (Tan et al., 2002) added bigram features to the

standard use of unigrams by selecting bigrams according to their Information Gain for

the category, and showed improvement of F1 and break-even point measures. Other

works tried to improve the accuracy of supervised TC tasks by means that are

independent from the amount of labeled documents the method requires as input.

Several works, for example, tried to exploit lexical relations, such as hypernyms and

synonyms, to enrich the "bag of words" representation.

(Cai and Hofman, 2003) used context models to enhance the feature

representation of documents. They used Probabilistic Latent Semantic Analysis to

automatically extract semantic concepts in order to achieve robustness with respect to

linguistic variations such as vocabulary and word choice. They used Adaboost, a

boosting algorithm proposed by (Freund et al., 1998), to combine the hypotheses

based both on the semantic concepts and on word features from the documents. The

combination of the two types of hypotheses showed an overall improvement of about

5% in accuracy.

WordNet was used as a source for synonyms and hypernyms to enhance

feature data for TC methods in several works. (de Buenaga Rodriguez et al. 1997)

utilized WordNet as a source for synonyms based on the assumption that the name of

the category can be a good predictor of its occurrence. They used WordNet synsets to

14

perform a category expansion, similar to query expansion, using the category

synonyms. This information was added to labeled training examples as the input of

supervised learning algorithms. The integrated algorithm achieved an improvement of

20 points in precision and was found as extremely helpful for low frequency

categories which have a lower number of training examples. Another research that

combined WordNet information with labeled training data is of (Scott and Matwin,

1999) who used WordNet as a source for synonyms and hypernyms which were

added to the representation of each document.

2.2. Keyword based TC

This research focuses on keyword based TC which represents both categories

and documents by a set of keywords. One approach of supervised keyword based TC

is to ask the user to create a list of representing keywords for each category

(McCallum & Nigam, 1999), which will be used to identify an initial set of

documents for each category, instead of labeling training documents manually. (Liu et

al. 2004) recognized this step as difficult in the overall procedure, since the user can

only provide a limited set of words which might be inefficient for accurate learning.

They proposed a keyword based method constructed by the following steps: (a)

Cluster the unlabeled texts with k-means algorithm using the cosine similarity metric

from information retrieval (Salton & McGill, 1983) as the distance measure. The

words from each cluster are ranked by their information gain value and are given to

the user as candidate words for the initial feature vector. The user chooses the most

descriptive words from the list for each category and is given the option to add more

keywords which do not appear in the top ranked list. (b) Create an initial set of

labeled documents using the cosine similarity metric (Salton & McGill, 1983) to

measure the similarity between documents and representing keywords, and finally (c)

train a Naïve Bayes (NB) classifier using the Expectation Maximization algorithm. In

each iteration the EM algorithm trains a NB classifier and re-estimates the probability

of a document to be classified to a category.

(Liu et al. 2004) performed their evaluation on four easier categories of the 20

categories of the 20 Newsgroups dataset composed by 4 – 5 categories each. The

evaluation compared the results for choosing initial category seed characteristic

keywords with the use of the original k-means keywords. It shows that classification

based on the selected keywords obtains better results by a large margin.

15

A more recent attempt to approach keyword based TC by categorizing

unlabeled text examples has been reported by (Ko and Seo, 2004). The first step of

the keyword based TC used a bootstrapping algorithm on the co-occurrence

information of the unlabeled data to extract keyword lists. The list were based on

frequent co-occurrences with seed keywords constructed from the category names

knowledge; for the second step the authors used a NB classifier to create an initial set

of labeled documents. The third step trained a classifier (Ko and Seo, 2002) based on

a feature projection technique which is robust to noisy data.

For evaluation, they compared the results to a semi-supervised and a

supervised method, and showed comparable results to the supervised categorization

method. The semi-supervised approach used experts to choose initial keywords as

seeds for the procedure described above. Using experts' knowledge was found to be

an expensive but worthwhile task, as it achieved significant improvement in one of

the data sets used for evaluation.

2.3. Categorization based on category name

TC approaches which do not require manual effort over the course of the

classification procedure have been attempted rather rarely in the literature. One of the

approaches to perform TC without any manual effort during the classification process

is to use keyword based methods by automated creation of category representations.

Based on the semi-supervised keyword based TC methods described in section 2.2,

(Gliozzo et al. 2005) introduces an unsupervised bootstrapping keyword based

method for text categorization, which uses only the category name as the input for the

bootstrapping algorithm. Their method was constructed from the three following

steps:

(i) Initial creation of representing vectors for each category - the category

name was generalized using Latent Semantic Analysis (LSA) in which

documents and categories are represented in a latent semantic space. LSA is a

dimension reduction method which decreases the number of dimensions in the

document-by-term matrix. It converts the co-occurrence data represented in

the matrix to a representation of implicit semantic concepts in the latent space.

(ii) Initialization of labeled documents set for the supervised learning – A

Gaussian Mixtures (GM) algorithm was employed to obtain uniform

classification using the similarity scores in the latent semantic space as input.

16

The GM algorithm outputs new values for the probability of each document's

to be classified to a specific category given the similarity score. An initial set

of labeled documents is classified according to those probabilities, where each

document is classified to the best scoring category.

(iii) Supervised classifier training - SVM classifier was trained to categorize

the unlabeled text based on the initial categorized set.

The authors reported results on two data sets – 20 News groups1 and Reuters-

10 (the 10 most frequent categories2 in Reuters-21578

3), showing improvement

relative to earlier keyword based methods. A more detailed description of each of the

three methods used in this research can be found on Appendix A, B and C

respectively.

Another TC approach which does not require manual effort over the course of

the classification procedure is topical TC via clustering. This approach uses

unsupervised clustering methods to split the data into a preset number of clusters

which are then matched to the predefined categories. (Sahami et al., 1996; El-yaniv

and Souroujon, 2001) used clustering methods in both a semi-supervised setting, in

which a small set of training examples were used, and in an unsupervised setting,

where no training examples were used. Both methods were evaluated over small

datasets which were obtained from subsets of the Reuters collection and the 20-

Newsgroups collection.

We base our method on the automated keyword-based approach, and in

particular the approach described in (Gliozzo et al. 2005) by creating a two phased

method, (1) automatically create category representations to acquire an initial set of

labeled documents based on a similarity score between the categories and the

document representations, (2) classify the unlabeled documents based on the initial

categorized set using a SVM based classifier. As opposed to creating categories

representation based on context models such as LSA, we utilize an integrated model

based on an entailment requirement instead of just co-occurrence data. We will next

describe the lexical entailment framework and the lexical semantic relations resource

which were used to acquire lexical entailment rules.

1 The collection is available at www.ai.mit.edu/people/jrennie/20Newsgroups. 2 The first 10 categories are: Earn, Acquisition, Money-fx, Grain, Crude, Trade, Interest, Ship, Wheat

and Corn.

3 Available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html).

17

2.4. Lexical entailment

The ability to identify semantically equivalent pieces of text is important to various

NLP tasks. The Textual Entailment (TE) framework addresses this task by trying to

formulate the degree of semantic matching between snippets of text (Dagan et al.,

2006), that is to decide whether the meaning of one text, termed hypothesis, can be

inferred (entailed) from another text. This framework was identified as a core

semantic inference paradigm in the NLP field (Giampiccolo et al., 2007). It can

contribute to various tasks such as Information Retrieval (IR), Query Expansions (in

IR) and Question Answering (QA). For example, to be able to answer on a question

such as “Whom did SCO sue?” a QA system should be able to deduce that "SCO sued

IBM" can be inferred from "SCO won a lawsuit against IBM".

This thesis focuses on a subtask of the TE framework, proposed in (Glickman

et al., 2006), which is termed Lexical Entailment (LE). LE aims to recognize whether

each lexical meaning in the hypothesis text is referenced by some meaning in the

entailing text. More concrently, a word w is lexically referenced by a text t if there is

an implicit or implied reference from a set of words in t to a possible meaning of w.

LE relations may be represented by rules, denoted LHS ⇒ RHS, where the term on

the right hand side is entailed by the term on the left hand side. For instance, the rule

"Toyota ⇒ car" which is equivalent to a hyponym (is-a) relation can be useful to help

classify documents related to a "Cars" category. Another example, is the rule

"Prisoner’s dilemma ⇒ game theory" which can be useful to help classify a document

discussing the "Prisoner's dilemma" to a "Game theory" category. We'll refer to this

type of rules as entailment rules for the TC task utilization.

In this thesis we aim to use entailment rules to expand the seed terms of the

category name in order to improve the accuracy of the TC task. Our application of LE

rules for the TC task is similar to the application of the LE framework for the Query

Expansion task, in which the entailing words are considered as expansions for the

query as suggested in (Clinchant et al., 2005). To integrate the LE rules in the TC

scheme described above, the initial seeds based on the category name are expanded

with entailing terms extracted from the LE rules. For each rule in which the RHS of

the rule is one of the seed terms for a specific category, the LHS term of this rule is

added to the seed terms of this category to create the set of representing keywords for

18

the category. Below we describe the external resources used by our method to extract

LE rules.

2.4.1. Lexical Entailment Resources

In the absence of dedicated comprehensive knowledge bases for lexical entailment

rules we based our entailment rules extraction methods on external resources

available online. The resources utilized for this purpose are a lexical resource, the

WordNet lexical ontology, and a textual resource, Wikipedia the online encyclopedia.

Given the different nature of the two resources, the methods applied to each of them is

quite different. Below we give a short description of each resource and its

characteristics.

# Terms Gloss

1 Infinite the unlimited expanse in which everything is located

2 - an empty area (usually bounded in some way between things)

3 - an area reserved for some particular purpose

4 Outer Space any location outside the Earth's atmosphere

Table 1- The 4 top synsets for the noun "space". Each row presents other terms in the synset (if exist) and the gloss for this sense.

WordNet WordNet4 is a fundamental computational lexical resource (Fellbaum,

1998) developed by a group of lexicographers led by Miller, Fellbaum and others at

Princeton University. It is a lexical ontology of semantic relations, available online,

widely used for natural language processing systems while being updated and

growing over the last fifteen years. It provides a large repository of English lexical

items and consists of nouns, verbs, adverbs and adjectives which are organized into

synsets.

WordNet sysnsets are synonym sets which represent a single sense of an

English term. A sense is defined to be the meaning of a single term for a given Part

Of Speech (POS). A gloss definition of the concept of this sense is given for each

sense. The senses are ordered by decreasing sense frequency according to their sense

frequency in the SemCor corpus (Miller et al.,1993). Since the frequency is estimated

according to corpus statistics, synsets with no measured estimation appear at the

4 We used version 3.0 of WordNet available at http://WordNet.princeton.edu/obtain

19

bottom of the synset list. Table 1 presents an example of the top four synsets for the

term "space", their glosses and their order according to the sense frequency.

WordNet contains two kinds of relations which link the different synsets,

lexical and semantic. Lexical relations hold between morphologically related word

forms such as derivations (formation of words from bases/words), while semantic

relations hold between word meanings. Among the semantic relations WordNet

consists of are hyponyms (is-a relation) and meronyms (is-part-of relation) which are

used in our method. For instance, some of the hyponyms of the word "auto" are "cab"

and "minivan", its meronyms include the words "bumper" and "window" and its

derivations are "automobile" and "automobilist". Nouns and verbs are organized in a

hyponymy/meronymy hierarchy while adjectives are organized in clusters with a head

synset for which all related synsets have similar meaning. Adverbs usually point to

the adjective from which they are drived.

Various NLP tasks exploited WordNet as a source for lexical expansion.

Among those tasks are the TC tasks described earlier in section (2.1) such as (Scott

and Matwin, 1999) who used WordNet as a source for synonyms and hyponyms.

Automatic indexing has been improved by adding the synsets of query words and

their hypernyms to the query (Mihalcea and Moldovan, 2000). Our method exploits

derivations, synonyms, hyponyms and meronyms of the seed term to acquire LE rules

based on the WordNet knowledge, as explained in section 3.3.2.

Wikipedia Wikipedia5 is collaborative online encyclopedia which covers a wide

variety of domains. Wikipedia is constantly growing and evolving based on the

contribution of online users, and had more than 1,700,000 articles on the English

version as of March 2007 (Kazama and Torisawa, 2007). (Gilies, 2005) shows that the

quality of Wikipedia articles is comparable to those of the Britannica internet

encyclopedia.

Each Wikipedia article describes a unique concept, mostly named entities. The

article's text is a promising source for knowledge regarding the article's title as an

encyclopedia definition of it. Wikipedia also contains sources for knowledge

regarding the article titles which are common to encyclopedias or Web knowledge

bases. As most encyclopedias, Wikipedia uses term canonization so that a single

5 We used the English version from February 2007 available at

www.ukp.tudarmstadt.de/software/JWPL

20

article would represent a group of analogous terms, and denotes this connection as

redirection. All terms contained in this type of equivalence group are redirected to the

same article with a single title, and the redirection relation between them can be

extracted. Another typical type of connection, which is common to Web based pages,

is hyperlinks, which connect terms in an article's text to the article defining them in

Wikipedia. The common structure employed for all Wikipedia articles contains

several data fields and specific structures.

The Wikipedia-based LE resource, developed by Eyal Shnarch for his MSc thesis,

aims to exploit the knowledge available in Wikipedia as a general online resource, but

without committing the method to Wikipedia specific structure. Therefore, it uses

only the article's text, and the common Web encyclopedia relations, redirection and

hyperlinks. The evaluation of our method in section 4 shows that, as expected,

WordNet and Wikipedia are complementary resources, providing the typical

knowledge which can be found in each of them as a dictionary and an encyclopedia.

We describe the details of this method in section 3.3.2.

21

3. Text Categorization based on category name

3.1. Research goals

As described in section 1, keyword based TC is constructed of two steps (1) setup, in

which a set of characteristic terms for each category is assembled, constituting the

category's feature vector, and (2) classification, in which the term-based feature vector

of the classified document is compared with the feature vectors of all categories.

The framework for keyword based TC which omits the manual annotation in

the setup phase must face the challenge of assembling the characteristic terms for

each category automatically. The most natural seed for the representing terms is the

category name itself, as was suggested by several semi-supervised and unsupervised

methods (McCallum & Nigam, 1999; Ko and Seo, 2004; Gliozzo et al. 2005). The

category name is selected by domain experts to represent the category's topic as

precisely as possible, and therefore is likely to be the most appropriate seed for this

purpose.

Based on some analysis of labeled data, we identify two requirements that

document which belongs to a category should satisfy:

(i) Entailment requirement - the category name should be entailed by the

document text.

(ii) Context requirement – the category's general context should be

matched by the document.

The first requirement is that the category topic should be referred at some

semantic level in the text. This can often be identified by the appearance of the terms

in the document which entail the topic name. We will refer to terms which entail the

category name as entailing terms for that category. This group of terms consists of

terms which entail at least one of the terms denoting the category's name, that is,

terms which appear on the left hand side of lexical entailment rules whose right hand

side is identical to the category name. For example, the category "Autos" is entailed

by the term "Car" according to the rule "car ⇒ auto" and by the term "Ford Escort"

extracted from the rule "Ford Escort ⇒ auto" etc. Therefore, the entailment

implementation for the setup phase will be done by expanding the category seed (its

name) with terms which entaile it.

22

The second requirement is that the overall context of the document should be

typical for the category topic. This is needed to assure that the (i) entailing terms for

that category appear as part of the main topic of the text, and (ii) not in different sense

than the one entailing the category name. This requirement can be captured by a

group of terms which describe typical category contexts, even though they do not

necessarily entail the category. Such terms frequently appear in the category context

and therefore tend to co-occur with the category's entailing terms. Occurrence of such

terms implies that the text might be related to the category. For example, the word

wheel doesn't entail the category "Autos", as it can appear within the context of

several other vehicle categories. However, the presence of a significant amount of

such context words in a document increases the likelihood that this document may be

related to the category's topic. On the other hand, the lack of any context word in a

document decreases the likelihood that this document is relevant to the category's

topic. For that purpose, we will use context models based on co-occurrence data of

terms.

Following this idea, the goal of our approach is to combine the likelihood that

a certain document entails the category name and the likelihood that its context is

relevant to the category. Each measure will be based on the seed category names and

the results will be combined to obtain a unified score. We aim to improve the

precision of the classification by forcing an entailment evidence for each

classification decision

Overall, our method consists of the following steps:

(i) Initialization and scoring

a. Seeds: initiate each category vector by the category seed terms, which

correspond to the category name.

b. Entailment: represent each category by its seed terms along with the

entailing terms for the seeds, which together form the category's

entailment terms feature vector; and obtain an entailment similarity

score between the vectors of each document-category pair.

c. Context: represent each category and document by a co-occurrence-

based vector, and compute a context similarity score for each

document-category pair.

d. Combine the entailment score and context score to a single

categorization score for each document-category pair.

23

(ii) Classification –

a. Initial labeled set creation: use the scores obtained in step 1.d to

classify each document to the category with the best score.

b. Bootstrapping: use the initial labeled set to train a supervised classifier.

3.2. Categorization tasks

The categorization scoring of documents may be applied in two different task settings.

The first is ranking, where a ranked list of documents is created for each category.

The documents are sorted in a descending order according to their categorization

score. The second goal is classification of the documents, where each document is

classified to the best category according to its similarity score.

Ranking the documents aims at achieving better precision at the top of the

sorted list, which means ranking true category documents at the top of the list while

ranking irrelevant documents at the bottom. Ranking allows evaluation of the scoring

method quality per category.

On the other hand, binary document classification provides a complementary

aspect of the scoring method. Binary classification makes a binary classification

decision for each document-category pair, whether the document belongs to this

category according to the similarity score between them. It reflects the method's

ability to differentiate between categories and classify documents to the right

category. Section 4 evaluates and analyzes the methods investigated in this research

from the perspective of each of these two tasks.

3.3. Scoring methods

This section describes the scoring methods utilized as part of our TC method. The

scoring method use a similarity measure to provide a similarity score for each

document-category pair. We first describe the basic similarity method which relies on

the categories' seed terms derived from the category names. We then describe the

entailment similarity method (3.3.2) and context similarity method (3.3.3) used to

calculate the total similarity score provided by our method (3.3.5).

3.3.1. Vector Space Model based on category seed terms

Keyword based TC is often approached by exploiting a Vector Space Model scheme

as in (Liu et al. 2004; Gliozzo et al. 2005). Documents and categories are represented

24

as vectors and their similarity is measured in the vector space. For each category a

ranked list of documents can be created using the similarity score, and the

classification of each document is set to the most suitable category according to the

similarity score.

The most natural representation for the documents' feature vectors is as

vectors in the term space, similar to what suggested in (Salton & McGill, 1983). The

vectors are made out of feature-value pairs where in most cases each word in the

vocabulary corresponds to a single feature and the value corresponds to its document

frequency or standard Term Frequency Inverse Document Frequency (TF-IDF) score.

We expanded the vectors to be unigrams and bigrams of POS-tagged lemma with the

square root value of their frequency in the document as feature value, that is tf . The

reason for using bigrams as well as unigrams is to capture expanding multi-word

phrases as well as expanding words in the lexical expansion step of the method (for a

more detailed explanation of this, see 3.3.2). The vectors are filtered using a common

feature selection which removes the most common and the least common features in

the corpus.

Vector space models use a vector representation, similar to the document

feature-value representation, for the categories as well. The categories are represented

as feature vectors in the word space, containing characteristic words most suitable to

describe the category and a weight to signify their importance to the category. Since

the creation of a taxonomy of topics for TC task requires manual specification done

by domain experts, it can be assumed that the terms chosen to represent each topic as

the topic title in this process are selected carefully to encapsulate the common topic

the documents share. Therefore, those manually selected terms should be the best seed

terms to represent each category in its feature vector. Thus, as a baseline, we use a

category vector that includes only the seed category name or immediate variations of

it (without expansions). For example, the seed for the category "comp.graphics" was

simply taken to be the term "Graphics" as a noun term, for the "talk.politics.mideast"

on the other hand the term "Middle East" was chosen as the most suitable seed term

since the use of "mideast" is not common.

Then, we use a standard cosine similarity measure to measure the similarity

between document vectors and category vectors. Using cosine similarity as the

similarity measure, and (the square root of) word frequency as the value of the

25

features, the similarity between the vectors described above is analogous to measuring

the frequency of the category name in each document. The highest scores will be

given to documents with the highest frequency of the seeds. Since we used the square

root value of the term frequency of each term w, denoted ( )tf w , instead of simple

term frequency (tf), the impact of high term frequency of a single term on the

classification decision regarding a certain document is decreased, giving a higher

weight to documents that contain several of the seed terms.

More formally, let t T∈ be a document described by a vector V

t R∈r

in the

text collection T, and let ( )seed c V⊂ be the seed of category c C∈ , where V is the

lexicon and C is the set of categories in the collection taxonomy.; let V

c R∈r

be the

vector representing the category c C∈ which at this stage contain the terms in

( )seed c . The similarity function, Simseed, between the document vector and the

category representing vector is defined to be

( , ) cos( , )c t

sim c t c tc t

⋅= =

⋅

rrr rr r

rr

where for each term i

t t∈r

its value is set to be the square root value of its term

frequency, ( )i i

t tf t= , and for each category representing term ic c∈

r, its value is set

to be 1i

c = , meaning the weight of the seed terms in the category's vector are all

equally weighted and equal 1.

3.3.2. Entailment expansion methods

Entailment expansions for the category vectors are done by expanding the seeds of

each category, ( )seed c , using entailment rules. The seeds expansion is similar to the

notion of query expansion in Information Retrieval (IR), where the seed is analogous

to the query expanded. Each category is expanded by the left hand side (LHS) of all

lexical entailment rules (defined in 2.4) whose right hand side (RHS) is one of the

category seeds. We will refer to the set of terms which appear on the LHS of the

entailment rules extracted for a category as the entailing terms for this category. The

entailing terms are added to the feature vector representing the category, which was

described in the previous section. For example, the vector representing the category

"Autos" will be expanded by the rule "automobile ⇒ auto".

26

In the lack of a generic lexical entailment rules knowledge base, we used two

preliminary lexical entailment methods to obtain rules. The first method obtains

entailment rules from WordNet, exploiting the lexical semantic relations it contains.

The second method extracts entailment rules from Wikipedia, exploiting the vast

amount of general definitions and information it holds. Both methods are applied for

each of the categories, and all the LHS terms in the corresponding rules are then

merged into the category feature vector.

WordNet The appearance of the category name in a document, although very

precise, overlooks occurrences of entailing terms such as synonyms and derivations.

The need for this type of relation motivates us to use WordNet as a large lexical

resource which includes information of various lexical relations. For example, terms

such as 'car' and 'automobile' can be extracted using the synonym lexical relation

which associates them with the category name "Autos". Similarly, terms such as

'medical' and 'medication' which are morphological derivations of the category name

"Medicine", and may be extracted as well.

Ambiguity is a common problem when using WordNet lexical relations for

expansions. Ambiguity of seed terms constructed from category names is rare in the

TC framework, since the required sense is mostly the dominant sense in the TC

corpus corresponding to the given taxonomy. However, when an irrelevant sense is

expanded via an external resource, the expansions of the irrelevant senses may be

frequent in the corpus. For example, for the category "Space" which is part of the

science hierarchy the "any location outside the Earth's atmosphere" sense can be

considered as relevant while the "an empty area" sense is clearly irrelevant as a source

of expansions for this category. Using entailment rules expanded from the irrelevant

sense might add significant number of frequent words irrelevant to the category

expanded. Since WordNet knowledge is organized by synsets, synonym sets which

represent one underlying sense, we can utilize this mechanism to guaranty that only

the required senses will be used for the expansions, so that irrelevant expansions

would not be used.

We base our utilization of WordNet on the assumption that during the manual

specification of the corpus taxonomy it is possible to provide the information of the

relevant WordNet synset(s) for each category, since the taxonomy creator addresses

difficulties such as disambiguation in the taxonomy creation process. Moreover, the

27

relatively small number of categories in an average corpus makes it worthwhile to add

this information in order to overcome ambiguity difficulties. Thus, in our

experiments, we have manually indicated the appropriate WordNet sense(s) for each

category seed word (see Table 2 and Table 3).

Our approach differentiates between several possible types of topics, which

were identified by manual analysis of sample taxonomies for TC tasks. The manual

analysis raised the different needs of expansion for different types of topics such as

topics which describe a general subject and topics which refer to a collection of

component parts. For example, the topic "Middle East" relates to a geographic region

and therefore requires expansion to its geographic parts. On the other hand a topic

such as "Medicine", which relates to a general scientific subject, requires expansions

such as branches of medicine. The differentiation between the topic types revealed

that some types of categories can be described as class topics and therefore require

expansion based on the hyponymy relation (is-a), while other types can be described

as ensembles of components and therefore require expansions based on the meronymy

relation (is-a-part-of). The hyponymy relation enables us to expand seeds to the

members of their class type, for example "Autos" can be expanded with "ambulance"

and "taxi" which are types of cars. The meronymy relation expands the topic to

members of its group, such as "Iran" and "Israel" for the "Middle East" topic.

Accordingly, meronyms of the group as "Class type" were found overall less

useful for entailment rule extraction in many categories, since they mostly describe

common parts of the class and therefore relate as meronyms to most of the co-

hyponyms of transportation means in general. For example, the meronyms "wheel",

"door" and "window" of the topic "Autos" describe technical components of that

concept, while the meronyms described above for the "Middle east" category are part

of that specific entity. For that reason, we've automated the extraction of rules for

those two groups of categories by extracting rules based on the meronymy relation

only when the topic seed has no hyponyms in WordNet. The WordNet expansion

method starts its expansion process from the seeds based on the categories' name,

which are then expanded in an iterative manner. For each term expanded in one of the

expansions' steps the method checks weather it has hyponyms to be expanded to or

should be expanded to its meronyms.

The core of the WordNet extraction method is expanding terms to their

hyponyms, or to their meronyms if no hyponym exists for this term. This core step of

28

the method is augmented by basic steps of derivations and synonym expansion. The 4

steps of our WordNet expansion procedure are as follows:

(i) Expand each ( )seed c representing the category c C∈ to its

derivations, ( ( ))der seed c , and synonyms, ( ( ))syn seed c .

(ii) For each term ( ( )) ( ( ))w der seed c syn seed c∈ ∪ , if w has any

hyponyms,

a. Expand w to its hyponyms and add them to form the group of terms

( ) ( ( )) ( ( )) ( )core c der seed c syn seed c hyponym w= ∪ ∪

b. Else, expand w to its mernonyms and add them to form the group of

terms ( ) ( ( )) ( ( )) ( )core c der seed c syn seed c meronym w= ∪ ∪

(iii) Expand each term ( )w core c∈ to its derivations and synonyms to form

the group ( )wn c .

(iv) For each ( )w wn c∈ , add w to the expanded category vector cr

, denoted

( )wn cr

Accordingly, the likelihood of category c, represented by the expanded vector

( )wn cr

, to be entailed by each document represented by the term vector tr

is measured

by the same similarity function used to measure the seeds and documents similarity.

The WordNet similarity function, Simwn is defined to be

( ( ), ) cos( ( ), )wn

Sim sim wn c t wn c t= =r rr r

The weights of the expanded terms in the new category vector remain equal and set to

one as before, that is for each ( ) ( )iwn c wn c∈r

, ( ( ) ) 1iweight wn c = .

As described above, our utilization of WordNet addresses the ambiguity of the

seed terms by specifying the required senses for them. In addition, our method partly

addresses the potential ambiguity of the expanding terms extracted. The expanding

terms are extracted from synsets of different sense frequencies, as they are extracted

from the sense which entails the seed terms sense. Infrequent senses of the expending

term might cause misclassification due to their ambiguity, while they rarely increase

the recall since they seldom appear in the required sense. Our method uses the

WordNet synset order to filter infrequent synsets. WordNet synsets are ordered by

their frequency in the SemCor corpus (Miller et al., 1993), where the most frequent

sense is listed first. For example, the verb "steal" entails the seed term "Baseball" in

its fifth sense, while its frequent sense in the collection is its first sense "take without

29

the owner's consent", which is equally likely to appear in the context of any of the

other categories. Our method filters infrequent synsets using a configuration

parameter maxSense, which specifies the maximum ordinal number of a synset to be

used as a source for entailment rule extraction.

Another parameter which influences the accuracy of the expanding terms

extracted from WordNet is the depth within the hierarchy used to expand terms. The

WordNet structure enables recursive expansion, for example by expanding a term w

to all the hyponyms of all its descendents in the hyponyms hierarchy. We will refer to

this parameter, denoted by maxDepth, within the parameter setting discussed in 4.1.1.

Wikipedia Wikipedia is a collaborative online encyclopedia which covers a wide

variety of domains. Extraction of entailment rules from an online encyclopedia can

not only extract entailment rules from the definitions themselves but also extract rules

based on HTML links and references. The Wikipedia resource, being an encyclopedic

resource containing cultural and day-to-day terms, is complementary by nature to the

type of rules extracted from the WordNet resource, which provides language oriented

terms similar to terms which can be found in a dictionary. We used a lexical

entailment resource extracted from Wikipedia which has been developed and utilized

by Eyal Shnarch for his MSc thesis, and integrated it into the general scheme of our

TC method to expand the seed terms for each category.

Each wikipedia article describes a specific subject which is denoted by the title

of this wikipedia entry. Following the notion suggested in (Kazama and Torisawa,

2007), the extraction method assumes that the best source for that definition lies in the

article's opening sentence. The motivation for this assumption is that the definition of

the topic mostly appears at the beginning of an encyclopedia article. It is noted that

extending the source of the definition to be the first paragraph as a whole was found

to be less accurate, and not beneficial in terms of lexical entailment rules recall.

The subject of an encyclopedia entry is mostly generalized by its definition.

Therefore, the rules extracted from each entry are entailment rules in which the title of

the article appears on the left hand side of the rule and the terms extracted from the

definition sentence appear on the right hand side, that is, the terms from the definition

are assumed to entail the article's title. The terms extracted are nouns and noun

phrases from the title and the definition, since the majority of Wikipedia titles are

noun phrases. For example, the rule "Yamaha SR500 ⇒ motorcycle" can be extracted

30

from the article defining "Yamaha SR500", to expand the seed name for the category

"Motorcycles".

Several extraction methods from the definition have been explored. We chose

to use only the extraction types that were found to be the most precise. Prior to all

rule extraction the Wikipedia-based method parses the definition sentence to facilitate

extraction of the preferred noun phrases. This method extracts the noun phrases which

appear as a nominal complement of the 'be' verb in the definition.

In addition to extraction from the definition, the encyclopedia structure was

also utilized to extract rules. Similar to traditional encyclopedias and dictionaries,

Wikipedia authors also provide manual canonization of the terms defined in it. All

terms contained in the same canonized group are redirected to the same Wikipedia

article. Since all terms in such a group are semantically equivalent, the rules extracted

by this type are considered bi-directional. For example, the search for the term "mac",

redirects the search to the article titled "Macintosh" as the category name. Based on

this knowledge the Wikipedia-based method can extract the bi-directional rule "mac

⇔ Macintosh", which results in the expansion of the seed term "Macintosh" to the

term "mac".

Unsurprisingly, significant part of the terms extracted from Wikipedia based

rules are noun phrases (NP) longer than one word. For example, the expanding terms

extracted from Wikipedia include names of car types for the category "Autos" and

names of baseball players for the category "Baseball". The complex NPs extracted

from this resource motivated us to incorporate word bigrams as part of the features

extracted from the corpus of documents to be classified. The addition of bigrams to

the set of features included in the corpus vocabulary does not overload the

classification in terms of noise or overhead in calculations, since the cosine similarity

measure only considers terms which appear in the category's representing vector.

Thus, the vast majority of bigrams extracted from the corpus will be ignored.

Moreover, only NP bigrams are included by our method in the corpus vocabulary.

Consequently, the bigrams influence the classification measure only if they are

extracted by the entailment expansion method and also appear in the corpus, and

therefore are assumed to be of high importance.

Due to the detailed nature of an encyclopedia definition, we employed feature

selection based on frequency statistics to filter common English words extracted from

31

Wikipedia, in addition to the feature selection done on the corpus itself. The feature

selection performed on the corpus data was based on the corpus statistics data; on the

other hand the feature selection performed on the Wikipedia output rules was based

on general English statistics data. The first efreqFilter most frequent words in the

English language according to the Brown corpus6 were filtered. On the contrary,

terms extracted from WordNet did not require the same type of filtering since they

were more precise than the terms extracted from Wikipedia.

We tried to enhance the terms extracted by the Wikipedia resource using

several methods. First, we used the Wikipedia method to expand the characteristic

terms of each category as they were extracted from WordNet. That is, expanding the

seeds with their WordNet expansions, and then expanding those terms by Wikipedia

expansion. For example, instead of expanding the category seed "Auto" first by

WordNet to acquire terms such as the seed's synonyms ("Car", "Automobile"), and

hyponyms ("Convertible", "Minivan"), those entailing terms extracted from WordNet

can be used as input for the Wikipedia method. Although this addition increased the

recall, it decreased the overall precision of the classification more substantially.

Accordingly, the seed terms based on the category names are expanded

independently from each resource and then the extracted terms are merged to generate

a single vector, ( )entail cr

for each category c. The entailment similarity score, Simentail

is obtained by

( ( ), ) cos( ( ), )entail

Sim sim entail c t entail c t= =r rr r

Our system allows applying each of the expansion methods for each of the two

resources separately, to evaluate partial configurations. We denote as Simwn, the

similarity score between the vector ( )wn cr

which contains expansions from WordNet

and each document vector tr

, that is Simwn. Similarly, the similarity score between the

vector ( )wiki cr

containing the expansions from Wikipedia and each document vector

tr

is referred to as Simwiki, where ( ( ), ) cos( ( ), )wikiSim sim wiki c t wiki c t= =r rr r

.

3.3.3. Context Similarity methods

The occurrence of entailing words in a document suggests that the topic they entail

was mentioned in the document. However it does not guarantee that this is one of the

main topics discussed. Such occurrence of the category's entailing terms may appear

6 Available at http://www.edict.com.hk/lexiconindex/frequencylists/words2000.htm

32

sometimes as a passing reference that is when the term is mentioned in the correct

sense as part of the context of another topic, or due to ambiguity of the entailing term.

For example, appearance of the term "car" which entails the category "Autos" may

also appear in "politics" context as in a document which include a political discussion

over environmental pollution: "...equal to the combined formic acid contributions of

automobiles...". A different example, for ambiguity of an entailing term, is the word

"race", which entails the category "Autos", in the sense of "a contest of speed". This

word may also appear in its other sense, "people who are believed to belong to the

same genetic stock" in a politics discussion regarding racism as part of the "Politics"

category, which may result in scoring errors.

Hence, the general context of the category topic should be prominent in the

document. We define context words as words which are likely to appear in the context

of a certain topic, although they do not entail the topic directly. For instance, words

such as driver and wheel do not entail the topic "Autos", although they tend to co-

occur with this topic. Based on that assumption, we aim to measure the likelihood of a

document to belong to a certain context to complement the entailment measure

described above.

For this purpose our method the Latent Semantic Analysis (LSA) method

which is described below.

Latent Semantic Analysis To measure context similarity we employed a Latent

Semantic Analysis (LSA) method (Deerwester et al., 1990), on the documents'

vectors. LSA is a dimensionality reduction method for co-occurrence data. The main

idea of LSA is to map each document vector into a lower dimensional space in which

the vector will be represented by "concepts" instead of terms. The dimension

reduction is performed using Singular Value Decomposition (SVD) to map the

document by term matrix into lower dimensional latent space.

The latent semantic vectors for terms and vectors were calculated by a

variation of the fold-in documents methodology suggested by (Berry, 1992; Gliozzo

and Strapparava, 2005). As explained in Appendix A, LSA reduces the dimensions of

the term-by-document matrix to obtain a new representation of each term in a lower

dimensional space. The matrix contains the co-occurrence data of the terms in the

document level. That is, given the matrix Mr

, a term-by-document matrix which

33

represents the terms in the original space, each ,i jm M∈r

, , ( , )i j j im tf w d= , where tf is

the term frequency of wj, which is the jth term in the lexicon, in di, which is the ith

document. The Mr

matrix is of size V N× , where V is the number of distinct terms

in the corpus and N is the number of documents in the corpus. Similarity between

terms is measured as the similarity between their representing vectors. It therefore

considers both first order similarity in which terms tend to appear together in the same

document, and second order similarity in which terms tend to appear in similar

contexts of other words. Since terms which tend to appear together are likely to be

mapped to the same "concept" in their representation in the latent space, similarity

between vectors will measure closeness of those concepts as expected by second order

similarity.

The terms in the reduced space are represented as the vectors in the latent

space, denoted ( )iLSA wr

. We follow the scheme used in (Gliozzo and Strapparava,

2005) in which an Inverse Document Frequency (idf) scheme is used on top of the

LSA representation for each term. The scheme multiplies each tem by its corpus idf

value and normalizes the vector:

( ) ( )( )

( ), ( )

i inorm i

i i

idf w LSA wLSA w

LSA w LSA w

⋅=

rr

r r

The representing vectors of the documents in the latent space, referred by us as

( )LSA tr

, are obtained by averaging the latent space vectors ( )norm iLSA wr

for each iwr

which appears in this document:

( ) ( , ) ( )i

i norm i

w t

LSA t tf w t LSA w∀ ∈

= ⋅∑r

r r r

where ( , )i

tf w tr

is the term frequency of the term wi. Similarly, the representing

vectors for the categories, ( )LSA cr

, are obtained by averaging the LSA vectors of the

seed terms of each category constructed from the category name. The ( )i

LSA wr

for

each unigram term was supplied to us by A. Gliozzo and C. Strapparava. Hence, the

documents were represented by unigram features in the LSA implementation. Bigram

terms, which represent a category, were taken as unigram terms and treated as

separate words for the LSA implementation.

The LSA similarity score between documents and categories, SimLSA, is

obtained by calculating the cosine similarity between the representing LSA vectors

34

( ( ), ( )) cos( ( ), ( ))LSASim sim LSA c LSA t LSA c LSA t= =r rr r

3.3.4. Gaussian Mixtures model

The final similarity score is obtained by employing a Gaussian Mixture (GM) model,

which rescales the scores obtained for each document-category pair, to obtain scores

on a common scale for all categories. The importance of the GM utilization stems

from the need to classify each document to the best scoring category to acquire a final

classification. Comparison between the scores of a document for each of the

categories is more reliable when the scores are normalized through the GM

estimation. The GM algorithm is described in detail in Appendix B.

In essence, the GM algorithm aims to estimate the probability of classification

of a document to a category given the similarity score of the document and the

category, i.e. ( ( , ))LSAC Sim c tΡrr

(referred to as ( , )c iSim id d in Appendix B). The

estimation of this probability is based on the assumption that the similarity scores are

derived from two Gaussian Probability Density Functions (PDF) for category and

non-category documents. The GM use the similarity scores achieved by the methods

described above to obtain the parameters of the two Gaussians using an EM

algorithm.

In our implementation the similarity scores are taken from the cosine

similarity results between the LSA vectors of documents and category seeds. For each

SimLSA score, measured by the cosine similarity function, the algorithm performs the

following steps:

(i) It assumes that the similarity scores of the positive and negative

documents in the (unlabeled) training set can be described by two

"hypothetic" Gaussian distributions for C and C , which compose the

empirical distribution by a Gaussian mixture.

(ii) The conditional probability ( ( , ))LSAC Sim c tΡrr

is estimated by applying

the Bayes theorem on the distribution C and C .

The GM probabilities are then used as the scoring method for the context step

in our method, denoted as Simcontext. Each document is classified to the category with

the maximum value of ( ( , ))LSAC Sim c tΡrr

:

35

( ( , ))context LSASim C Sim c t= Ρrr

Unfortunately, it was impossible to use the GM model to rescale the

entailment similarity score as well, since an entailment similarity score is obtained

only for documents which include at least one of the category entailing terms. Due to

the lack of sufficiently many entailment rules at this point of our research, most of the

documents obtain a zero score and therefore do not provide input data needed for the

GM algorithm. The zero scores could not be used as input for the GM estimation,

since they cause the GM model parameters (mean and variance) to diverge to zero and

the algorithm reaches computational underflow. We also tried to disregard the zero

scores, under the assumption that they were not actually classified as negative, but

were unclassified due to data sparseness. Disregarding the zero scores given by

document which were not classified to any category is still insufficient due to the

sparseness of the data. Most of the document is classified to 1-2 categories. For that

reason, most of the documents are given some positive score for one or two

categories, while they obtain a zero score or all other categories. Given that, omitting

the documents which obtained zero score for all categories still outputs a set of mostly

zero scores for each category and the original problem holds for this scored set as

well.

GM models could be employed on the LSA similarity results since for LSA

vectors a positive score is obtained for each category and document pair. The set

obtained from the similarity score for each document is sufficient as input for the GM

algorithm to obtain the two groups of negative and positive classifications for each of

the categories.

While in our system the scoring method which uses the GM probabilities as

the similarity scores of documents and categories is one of the steps with in the

complete method, it was used as the sole scoring method in (Gliozzo et al. 2005). We

denote the scoring method based on the GM probabilities as Simcontext, as defined

above, to reflect the probability that a document would be classified to the category.

We refer to Simcontext as a baseline for evaluation purposes in section 4.

3.3.5. Combination of knowledge and context

As mentioned in section 2, basing TC methods on context similarity alone overlooks

the importance of the actual appearance of the category topic within a document and

36

evaluates only the general context for the category. We aim at a more accurate

measure of relevance for a category by basing our method on entailment of the

category name instead of mere evidence of contextually related text, and combining it

with a context model.

To combine the scores obtained by these two components of our scoring

method, we examined integration of the two scores both by addition and by

multiplication. The latter was found as a more suitable method which produced better

accuracy. Moreover, using a multiplication integration scheme obtains a combination

of the scores without being obligated to address the different scaling of the two

methods. Therefore, the combined similarity score, denoted as Simconbined, is obtained

as follows

combined entail contextSim Sim Sim= ⋅

Using multiplication as the integration method of the scoring methods reduces

the score of documents which contain entailing terms but relate to irrelevant context.

Moreover, when the score obtained by the entailment scoring method is equal to zero

the integrated score would also be zero. Ideally, given perfect entailment knowledge it

means that when the text does not entail the category topic, it would not be classified

to it even if it involves related context.

For that reason, we find the combination of each entailment scoring method

with the context scoring method by multiplication as useful to increase precision and

gain a more accurate measure of classification. The results obtained by using the

Simcombined similarity measure and all other possible combination of entailment scores

available in our system with the context score will be evaluated and compared in

section 4.

3.4. Binary classification methods

One of the goals of TC is to obtain a binary classification of each document to decide

whether it belongs to a given category or not. Categorization of documents can be

done in two different approaches, classifying each document to a single category,

referred here as single-class, or classifying each document to several categories,

referred here as multi-class. Using the similarity score obtained by one of the scoring

models described in sections 3.2 and 3.3, our method can initialize a preliminary set

of labeled documents by classifying each document to the category for which it

37

obtained the best score. This set of labeled documents can then be used to train a

supervised classifier and obtain classification results, allowing also multi-class

classification.

Classification bootstrapping The similarity scores obtained by the Simcombined

measure were used to produce an initial labeled set of documents to train a supervised

classifier. We used the initial labeled set, in which each document is considered as

classified only to the best category, to train a SVM classifier for each category. For

this purpose, we used SVMlight

(Joachims, 1999), a state-of-the-art SVM classifier,

with the documents feature-value vectors as input. The features used were the same

terms used for the scoring methods described above, meaning the unigrams and noun

phrase bigrams, with the square root value of the term frequency as value. The vectors

were fed to the SVM learning module whose goal is to find an optimal separating

hyperplane between the positive and negative classification of the documents. We

used the default parameters settings for most of the parameters, excluding parameters

j and c which were manually tuned to obtain optimal classification. The SVM scores

for each document-category pair could be used for each of the classification

approaches depending on user choice.

Appendix C includes more technical detail of the training procedure and the

SVMlight

parameters. More details about the parameter settings can be found in

Section 4.1.1 below.

38

4. Evaluation

The evaluation of the categorization method was preformed over a datasets dedicated

for topical text categorization. Those datasets supplies a pre-defined set of categories

consisting of training and test sets with gold standard annotation. The evaluation

analyzes the performance of our method compared with a baseline of the work of

(Gliozzo et al. 2005). For this purpose we replicated their work and compared our

scoring method which uses the Simcombined similarity score with their scoring method

based on the SimContext similarity score. We also applied the bootstrapping phase on

both scoring methods for comparison. We used the components of our scoring

method, Simseed, Simwn, Simwiki and Simentail as additional baselines, to evaluate the

contribution of each. We first describe the data sets and settings of our evaluation

(4.1), followed by an evaluation of our method in a ranking task (4.2), an evaluation

of our method as a classification task (4.3) and then evaluation on the Reuters-10

collection (4.4)

4.1. Data set and Pre processing

In this section we describe the datasets used for the evaluation of our method and the

data pre-processing steps.

20 Newsgroups The 20 Newsgroups corpus is a collection of newsgroup

documents gathered from 20 different categories from the Usenet Newsgroups

hierarchy, which are detailed in Table 2. We used the "bydate" version7 of the corpus

which is the recommended version for TC tasks. This version contains approximately

20,000 documents partitioned (nearly) evenly across the 20 categories and divided in

advance to train (60%) and test (40%) parts.

The categories are taken from the Usenet hierarchy, originally consisting of

eight major topics at the top level of the hierarchy. This hierarchy also contains non

topical category which is more difficult to classify since the documents that belong to

it do not "discuss" this topic in the common manner. For example, the "Forsale"

category, which is a non-topical category, belongs to the Miscellaneous branch,

containing topics irrelevant to the other seven major topics. Moreover, some of the

7 The collection is available at www.ai.mit.edu/people/jrennie/20Newsgroups

39

categories are topically closer than others, such as the three religion categories:

"Atheism", "Religion" and "Christianity". Since those categories originate from

different major topics at the top level of the hierarchy they are not organized as sub-

categories or any other formal representation of each other. Therefore, the semantic

relations between those categories do not reflect in the taxonomy hierarchy and

creates difficulties in the categorization as described in section 4.2.3.

Category Seed

alt.atheism atheism#n#1_2

comp.graphics graphic#n#1

comp.os.ms-windows.misc microsoft windows#n#-1

comp.sys.ibm.pc.hardware ibm#n#-1;pc#n#1

comp.sys.mac.hardware mac#n#-1;macintosh#n#-1

comp.windows.x x11#n#-1;x-windows#n#-1

misc.forsale sale#n#1

rec.autos car#n#1

rec.motorcycles motorcycle#n#1

rec.sport.baseball baseball#n#1

rec.sport.hockey hockey#n#2

sci.crypt cryptography#n#1_2

sci.electronics electronics#n#1

sci.med medicine#n#1_3

sci.space outer space#n#1

soc.religion.christian christian#n#1;christian#a#2

talk.politics.guns gun#n#1_2

talk.politics.mideast mideast#n#1;middle east#n#1

talk.politics.misc politics#n#3_4_5

talk.religion.misc religion#n#1

Table 2- Initial seeds for the 20 Newsgroups, each seed is descibed in the structure

lemma#pos#wordnet-sense

Overall, although the categories are represented in a flat hierarchy in the 20

Newsgroups collection, they can be divided into six main topically related subjects:

scientific, computers, religion, politics, recreation and miscellaneous, which contains

only the topic "Forsale". Another example for the complexity of this collection is that

it also contains miscellaneous topics within the six subjects described above. For

example, the politics branch contains categories such as "Guns" and "Middle East", in

addition to a miscellanies "Politics" category. Finally, it should be noted that manual

examination of documents raised the hypothesis that some of the documents belong to

40

more than the single category they belong to according to the gold standard. Given

that the collection is based on online collaborative posting of opinions and discussions

it is highly possible that some of the posting should have been classified to several

categories.

Despite its limitations, the 20 Newsgroups is a widely used corpus for TC

tasks, which enables comparison and further development of methods. In addition,

the collection does not include cross-posts (duplicates) and non-textual headers,

which often help classification (Such as Xref, Newsgroups, Follow-up-To, Date).

The seeds used for category expansion and the corresponding WordNet senses

chosen for them are listed in Table 2. The seeds were derived from the topic name

itself and they are given in a term#pos#WordNet_sense structure.

Reuters-10 The Reuter-10 is a sub-corpus of the Reuters-215788 collection,

constructed from the 10 most frequent categories in the Reuters taxonomy, which are

detailed in Table 3. We used the Apté split of the Reuters-21578 collection, which is

often used in TC tasks. The complete collection contains 12,902 documents for 90

categories, where the 10 top categories include 9296 documents. The documents are

divided unevenly over the 10 categories, where most of the documents belong to the

"Acquisition" and "Earn" categories. The documents are divided in advance to train

(70%) and test (30%) parts. The collection's gold standard is multi-class classified,

hence each document is classified to one or more categories.

The Reuters categories are domain specific, and are all relevant to economical

topics. The categories are organized in a flat taxonomy structure, which means there

is no defined hierarchy for the categories in the corpus. Moreover, some of the

categories are non-topical, such as the "Money-fx" category which contains

documents discussing foreign-exchange transactions, which are mostly in a specific

table structure, with no significant textual data contained in them.

For these reasons we chose to focus on the 20 Newsgroups collection for our

evaluation. We found it more appropriate to the type of topical categorization

addressed by our research. Furthermore, our research was conducted as part of a

media research project, and it aims, to improve TC accuracy for topical taxonomies.

The 20 Newsgroups collection contains various topical categories, common interest

8 Available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html).

41

categories and general subjects which are more suited to this type of research. Given

that, our analyses are based on the 20 Newsgroups collection, while the results for the

Reuters-10 dataset are shortly described in section 4.4.

Category Seed

acquisition acquisition#n#1_2

corn corn#n#1

crude crude#n#1

earn earn#v#1;earnings#n#1

grain grain#n#2

interest interest#n#4

money-fx money#n#3;foreign exchange#n#1

ship ship#n#1

trade trade#n#1

wheat wheat#n#1

Table 3 - Initial seeds for the Reuters-10 collection, each seed is descibed in the structure

lemma#pos#wordnet-sense

Pre-processing The textual documents were split to sentences and tagged for

Part-Of-Speech (POS) tags using the Opennlp toolkit9 and then the tagged terms were

lemmatize using the Gate toolkit10. The features considered for TC are nouns, verbs,

adjectives and adverb unigrams, and bigram nouns. We used standard feature

selection to remove highly frequent features which appeared in more than 20% of the

documents and infrequent features which occurred less than three times in the corpus.

All features were taken from the text part of the documents, omitting titles and

headers.

4.1.1. Experimental settings

The settings for the evaluation of our method require the tuning of several

parameters for the different methods. For that purpose, the training documents of the

20 Newsgroups collection were split into two groups, training (60%) and test(40%),

keeping the original proportion of the collection.

For the entailment method, described in section 3.3.2, three parameters were

tuned:

(i) maxSense – the number of senses considered to expand a term, which

controls the level of infrequent sysnsets taken, was set to 4 (inclusive), for

which the method obtained the most accurate results.

9 Opennlp tools available at http://opennlp.sourceforge.net/ version 1.3.0 10 The Gate toolkit available at http://gate.ac.uk/ version 3.1

42

(ii) maxDepth – the depth of the hyponymy/meronymy hierarchy considered

for the semantic relation expansions from WordNet was set to 1.

Preliminary experiments showed that when the entire hierarchy was

considered, many irrelevant terms were added, creating noisy classification

results.

(iii) efreqFilter- the number of most frequent English words, which were

filtered from the final expanding list (mostly from the Wikipedia

expansions since the intersection of WordNet expansions with this list was

empty). This parameter was set to the 500 most frequent words according

to the Brown corpus11

, for which the method obtained the most accurate

results.

The context model used required parameter settings for the dimNum parameter

which controls the number of dimensions in the latent semantic space. We adopted the

setting suggested by (Gliozzo et al. 2005) which set the number of dimensions to be

400.

The final step of our method is applying a bootstrapping method on the initial

labeled set of documents. As described in section 3.4.2 we used SVMlight for this

purpose with its default parameters, excluding the parameters j and c which were

manually tuned. The parameter j was set to correspond to the number of categories,

which hypothetically corresponds to the proportion between positive and negative

training instances in each category, as suggested by (Morik et al., 1999). Therefore,

the j parameter was set to 20 for the 20 Newsgroups collection, corresponding to the

20 categories, and to 10 for the Reuters-10 collection, corresponding to its 10

categories. The second parameter c was manually tuned and was set to 0.01, which

gave the best classification results on the development set as described earlier.

4.2. Ranking

4.2.1. Ranking measure

As described in section 3, the first categorization goal which we examine is ranking,

which ranks the documents for each category according to their score. A ranking

evaluation measure quantifies the system's ability to rank documents for a given

category, preferring a ranking which ranks documents that truly belong to the

11 The Brown corpus available at http://www.edict.com.hk/lexiconindex/frequencylists/words2000.htm

43

category before the others. Average precision is a common evaluation measure for

system rankings, and is computed as the average of the system's precision values at all

points in the ranked list where recall increases (Voorhees and Harman 1999). In our

case, a point where recall increases corresponds to ranking of category documents, as

annotated in the gold standard. More formally, it can be written as follows:

1

1 ( ) ( )( )

n

i

E i CorrectUpToRank iAP c

R i=

⋅ = ⋅

∑

where n is the number of documents classified to a specific category c in the test set,

R is the total number of correct classifications in the test set for this category, ( )E i is

1 if the ith document is classified to this category according to the gold standard and 0

otherwise, and i ranges over the documents, ordered by their ranking.

The score calculated by the average precision measure range between 0 – 1, where 1

stands for perfect ranking which places all the category documents before the non-

category ones. This value corresponds to the area under the non-interpolated recall-

precision curve for the target word. Mean Average Precision (MAP) is defined as the

mean of the average precision values for all the categories.

4.2.2. Ranking results

We evaluated the ranking quality of the scoring method Simcombined described in

section 3.3, using the MAP measure explained above. The scoring method ranks

documents which contain at least a single occurrence of an entailing term, as

explained in section 3. Given that, documents which do not contain any entailing term

are not ranked using our method. We therefore considered the parameter R in the

Average Precision measure to be the number of documents which contain at least one

entailing term, which corresponds to the maximal coverage that can be obtained by

the knowledge available to our system.

We compared our scoring method which uses the Simcombined similarity score to

three baselines. The first is ranking the documents by the Simseed similarity score as

explained in section 3.3.1. This illustrates a naive method which uses only the most

basic information provided by category names. The second baseline is using the

Simentail similarity score which uses only entailment expansions to set the score.

Finally, we took the Simcontext similarity score which was used in (Gliozzo et al. 2005),

using an LSA similarity score rescaled by the GM algorithm to rank each document

44

for all categories12. The methods were compared within the range of knowledge

acquired by our method, meaning the number of correct rankings considered for the

MAP calculation of all method was limited to the first R rankings which our method

also ranked, which is an average of 37% of the documents. This comparison

corresponds to comparing the area under the non-interpolated recall-precision curve

of each of the methods up to the recall level achieved by using the Simcombined

similarity score.

Error! Reference source not found. presents the recall-precision the curves

obtained by using the following similarity scores: Simcombined, Simseed, Simentail and

Simcontext. It shows that using Simcombined outperforms other methods (almost)

constantly by several percents. Although the score obtained by Simseed is extremely

precise, it is limited in recall since it relies on limited knowledge. Consequently, the

curve denoting this scoring method decreases rapidly since most of the categories'

recall does not reach 100% and the average precision is decreased. Comparison

between the two curves denoting the Simcombine scoring and the Simentail scoring shows

that integrating the entailment knowledge with the LSA-based context model achieves

higher precision, showing average improvement of 3 points. Moreover, the Simcombined

12 The context method originally suggested by (Gliozzo et al. 2005) was replicated by us and the results

reported here are based on that replication

Figure 1 - Recall-Precision curve for the sub-set of documents which match entailment

knowledge. The results are showen for the Simcombined, Simcontext, Simentail and Simseed similarity

methods within this range of documents.

45

score shows a steady advantage of an average of 5.5 points in the average precision

over the Simcontext scoring method suggested in (Gliozzo et al. 2005). It implies that

using the entailment hypothesis results in a more precise measure than relying on

context modeling alone, as will be discussed in detail in the analysis section below.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1

Recall

Pre

cis

ion

Simcombined

Simwiki * Simcontext

Simwn * Simcontext

In order to evaluate the contribution of each of the entailment expansion

methods used we also compared the Average Precision gained by each of those

scoring methods, Simwn ⋅⋅⋅⋅ Simcontext, Simwiki ⋅⋅⋅⋅ Simcontext and Simentail, as can be seen in

Figure 2. Since the scores obtained by the Simentail score are based on the knowledge

gained by the combination of the WordNet and Wikipedia resources used for the

Simwn and Simwiki scores respectively, the two scoring methods have lower or equal

number of correct classification than by using Simentail. Consequently, the curve of

those methods ends at a lower recall point. It can be seen that the combination of the

two methods not only improves the recall but also improves precision by relying on

more knowledge which results in more accurate estimation of the entailment evidence

in the documents. It is also clear that Simwiki score is highly precise although limited in

recall.

Figure 2 - Recall-Precision curve for the sub-set of documents which match entailment knowledge.

The results are showen for the entailment methods, Simentail, Simwn and Simwiki, integrated with the

context method, Simcontext, within this range of documents.

46

Simseed ⋅ Simcontext Simwn ⋅ Simcontext Simwiki ⋅ Simcontext Simcontext Simcombined

Atheism 0.31 0.66 0.67 0.56 0.67

Graphics 0.60 0.60 0.60 0.77 0.60

Ms-Windows 0.03 0.03 0.68 0.37 0.68

IBM PC 0.23 0.23 0.23 0.25 0.23

Macintosh 0.51 0.51 0.55 0.34 0.55

Windows-X 0.74 0.74 0.74 0.31 0.74

Forsale 0.77 0.77 0.77 0.92 0.77

Autos 0.74 0.83 0.75 0.86 0.83

Motorcycles 0.30 0.88 0.35 0.92 0.92

Baseball 0.55 0.68 0.56 0.97 0.69

Hockey 0.94 0.94 0.94 0.90 0.94

Cryptography 0.21 0.97 0.23 0.96 0.98

Electronics 0.57 0.60 0.57 0.23 0.61

Medicine 0.39 0.83 0.46 0.78 0.84

Space 0.01 0.91 0.03 0.87 0.90

Christian 0.26 0.59 0.26 0.57 0.57

Guns 0.62 0.77 0.62 0.84 0.72

Middle East 0.10 0.89 0.10 0.68 0.89

Politics 0.03 0.20 0.03 0.13 0.18

Religion 0.07 0.22 0.07 0.17 0.18

Average 0.40 0.64 0.46 0.62 0.67

Table 4 - MAP values for each of the methods used within the range of documents which contain

entailment terms.

To conclude the ranking evaluation, in Table 4 we present the MAP values

calculated for each category gained ranking according to each of the following scores:

Simcombined, Simwiki, Simwn, Simseed and Simcontext. Ranking by the Simcombined score

achieves higher MAP value than all methods. In particular, it achieves 5.5 points

higher MAP value than ranking according to the Simcontext score, and obtains higher

MAP value in 12 of the 20 categories and comparable MAP value in 2 of the

categories. It should be noted that the MAP value for the methods13 which rely on less

knowledge are lower in average since the R parameter is larger than the number of

correct classifications they achieved.

13 Those ranking are based on Simseed, Simwn and Simwiki which rely on partial entailment knowledge

relatively to the rankings according to Simentail and Simcombined.

47

4.2.3. Analysis

Ranking the documents according to their score revealed interesting characteristics of

each of the scoring methods and of the topical categorization in general. As opposed

to a single-class classification, the ranking approach shows the score that each

document obtained by the scoring method used. It allows us to explore the different

phenomena which effect the categorization. Below in a detailed analysis of some of

the aspects in which our method showed improvements, followed by error analysis

explaining some of the reasons of the errors and suggesting how our method can be

improved in future work.

Figure 3 - Recall-Precision curves for 4 categories for documents for which our method has knowledge

in the 20 Newsgroups collection. "Medicine" – which shows rescaling by context score to increase

accuracy of Simcombined method; "Electronics" – which demonstrates higher accuracy by entailment

score than for context score; "Atheism" – which demonstrates misclassification with high rank;

"Forsale" – which demonstrates results for non-topical category.

Successful ranking of category's documents We first describe several aspects

in which our method showed improvements and demonstrated appropriate behavior

1. Passing reference: Our preliminary manual analysis of TC behavior using the

ranking by the Simseed score as baseline showed that the dominant phenomenon which

causes misclassifications is passing references. Passing references tends to occur

when the topic name or any partial group of its characteristics terms appear in a

Atheism

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

Forsale

0.1

0.3

0.5

0.7

0.9

1.1

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

Medicine

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

Electronics

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.2 0.4 0.6 0.8 1 1.2

Recall

Pre

cis

ion

48

documents in their required sense, but not referring to the main topic of the document.

This phenomenon is relevant to all types of topics, including named entities such as

software names which are commonly mentioned, general topics which may be

discussed as an allegory or an object which is referred widely in the language. Table 5

shows several examples of documents which contain passing reference to one of the

20 Newsgroups collection topics.

Our method is designed to identify this phenomenon by two different

mechanisms. The first one is the entailment expansions of the characteristics terms

which results in higher scores for documents which contain multiple occurrences of

entailing terms. The entailment score integrated in our ranking method uses the cosine

similarity score based on the square root value term frequency of the entailing terms

which appear in the document. This scoring method gives higher ranks to documents

with occurrences of several different entailing terms. Re-occurring terms would also

enhance the document score, but more moderately since the term frequency is

measured using a square root function. Occurrence of different terms in the document

would contribute more to the total score of the document. This scheme prefers

documents which use multiple terms which entails the topic. Documents which only

address the topic as a passing reference would most likely use a single occurrence of

one of the terms and will be ranked lower.

The second mechanism which decreases the score obtained by documents with

a passing reference is the use of context models. When a term which entails a certain

topic appears out of context, a context model should give lower score to the document

since its context is irrelevant for this topic. Indeed, by multiplying the score gained by

the entailment scoring method with the context scoring method, documents which

obtain high entailment score were re-ranked due to low context score. Figure 3

demonstrates the improvement of accuracy when using this mechanism for the

"Medicine" category. It can be noticed that the irrelevant documents which decrease

the precision are re-ranked at the bottom of the ranked list in the Simcombined curve.

High scores of documents which received high entailment score are decreased by

context model multiplication. Unfortunately, when the context model fails to identify

the irrelevant context correctly this mechanism does not improve the ranking. We will

discuss the misclassifications caused by this reason later in this section. Figure 4

demonstrates the improvement gained by using the Simcontext score for the "Autos"

49

category, in which false positive rankings are rescaled to lower score and true positive

rankings are (mostly) left the same.

Gold Standard

Category

Method's

Classification

Document Example

Religion Christianity "…The fictional Christian or Moslem or Jew who is…"

Windows-X MS-Windows "…I was wondering if Microsoft had bought Xhibition? ... I thought Xhibition was for "X-

Windows"…"

Forsale IBM Pc "…the title says it all (not IBM brand)"

Graphics Macintosh "Which newsgroup discusses graphic design on PCs and macs? ..."

Guns Forsale "…my opinions are mine and are not for sale…"

Table 5 – Document samples for the passing reference phenomenon.

2. Topically close categories: topically close categories are mostly sister terms of the

same level in the topical taxonomy hierarchy. In the 20 Newsgroups collection, for

instance, topically close categories exist as sister terms in the computers group of

topics, such as "MS-Windows" and "IBM Pc". While topically close categories also

exist as topics in different branches of the taxonomy, such as the "Electronics" topic

which is a part of the science branch. This topic is highly related to the topics in the

computer branch, mainly the computer hardware topics.

Analysis of the ranking obtained using the Simcontext score shows that this

model fails to make clear differentiation between closely related topics. Mostly, for a

document which belongs to a category that has several close topics in the categories

set, the context model will score this document similarly for all those categories. A

document which received high score for computer "Graphics", for instance, would

mostly be ranked likewise in the "Electronics" science topic. However, the entailment

models expand each category name with terms, which entail specifically that

particular category, and therefore obtain better results empirically. The similarity of

the context of the terms has minimal effect on the entailment model ranking since it is

not designed to judge context. Nevertheless, there is yet another aspect of topically

close categories, in which our full model tends to misjudge documents. This aspect

will be discussed in more detail below, under the "Taxonomy structure" subsection.

Figure 3 demonstrate the superior results of the Simentail based ranking over the

Simcontext based ranking for the topically close category case, showing the curves for

the "Electronics" category. It can be noticed that the ranking based on the Simentail

50

score obtains the best accuracy for the "Electronics" category, while the ranking

according to the Simcontext score fails to obtain high precision. Moreover, the

combination of the two methods obtains comparable results to the using Simentail

alone, since the Simcontext scores are not accurate enough.

3. Infrequent seed term: a common difficulty which may cause classification errors

are terms which have low frequency in the collection, i.e. infrequent terms. Methods

which rely their scoring on the collection's data, such as distributional, statistical and

co-occurrence methods, (among them the Simcontext based method) can not obtain

sufficient data for those categories. It seems by analysis of the ranking according to

the Simcontext scores, for example, that the low frequency of certain terms results in

poor LSA vectors to describe those categories. When the topic name appears rarely in

the collection most of the examples seems as negative for the unsupervised LSA

method and the method can not collect sufficient data to obtain the category's

characteristics. For example, the category name of the "Windows-x" category is

relatively rare and the Simcontext scores obtained for it are inaccurate and result in poor

MAP results as can be seen in Table 4.

Our method is not restricted to the amount of knowledge which exists in the

collection itself. It uses external resources to enrich the knowledge about the

categories and exploits it to make better classifications. Categories such as "Middle

East", "Macintosh" and "Ms-windows" obtained poor ranking results using the

Simcontext score which is based on co-occurrence data from the collection, compared

with our method as demonstrated in Figure 3. Moreover, the extent of external

knowledge used for our system increases both recall and precision. The recall

increases since more entailing words are considered so a score can be provided for a

larger portion of the documents, and the Precision increases since the similarity score

calculation is based on more data and thereby reflects the frequency of entailing

words in this document more accurately. The effect of infrequent terms on the scoring

quality is also reflected in Figure 4, which demonstrates the Simcontext influence on the

Simcombined score. It can be seen, that the combined scores for "Middle East" are

mostly decreased by the context score, for both false and true positive rankings. The

context of the "Middle East" category is thus not recognized accurately enough.

Nevertheless, we should point out the advantage of a collection-based method,

which can be used to complement external knowledge. Collections are often biased

and reflect topic frequency which is different from their frequency in general English.

51

For example, the most frequent sense of the topic name "Space" in the 20

Newsgroups collection is space as a science, while in general English it is ranked as

the fourth sense according to the sense frequency. Using scoring methods based on

the collection dada can be helpful for disambiguation and tuning, while it does not

limit the knowledge learned by our method.

Error analysis Several error cases that were detected and categorized are

detailed below. The first type of error causes irrelevant documents to be ranked within

categories to which they do not belong. The second type causes scaling errors, where

documents are ranked higher (or at the same rank) for the parent of the category they

truly belong to than they are ranked for the true category according to the gold

standard. In this case the documents are mostly relevant to the parent category at

some level in terms of being a close topic or having some context relevancy. The third

type addresses collection annotation problems and discusses the single-class

classification complexity. The last case addresses a difficulty which origins in the

categories definition.

1. Ambiguity of expanding terms: the seed terms of our method are extracted from

the topical taxonomy of the TC collection, by taking the topic name as the category

seed. Ambiguity of the topic name within the collection is rare since it is chosen to be

very precise and to capture the full meaning behind the topic. However, by using

entailment expansions as part of our method, terms are being added to the seed term

to represent the category. One of the reasons for high ranking of irrelevant document

is ambiguity of the expanding terms.

As described in section 3.3.1 we employed an initial method to filter

ambiguous terms originated in the WordNet expansion, using the WordNet sense

information as a threshold parameter. Unfortunately, it only made a partial

improvement and infrequent senses of terms were still added as expanding terms. For

example, the noun "steal" was added as an expansion to the category "Baseball" in its

second sense.

2. Taxonomy structure: taxonomy structure is another aspect of topic closeness,

which was discussed earlier as a challenge which our method addresses relatively

successfully. Hypothetically, the structure of a given taxonomy for text categorization

should be hierarchical such that each topic should logically contain the terms below it

in the hierarchy. Miscellaneous topics, by this logic, should include documents that

52

belong to the parent topic while they do not belong to any co-hyponym topics under

that parent topic. In the 20 Newsgroups the documents are categorized by user

annotation due to their postings nature. The categorization, as well as the basic

structure of the taxonomy, does not always follow the hierarchical reasoning

described above.

For example, the category "Atheism" is not a sister-term of the category

"Religion" in the given taxonomy, nor does the category "Christianity". Each of these

three categories has a different parent in the hierarchy of the collection's taxonomy.

This topical closeness creates classification difficulties for the scoring methods, both

entailment-based and context-based. For the entailment methods, terms which entail

both the term and its hypernym (its parent in the hierarchy) can be used for expansion,

which results in ranking highly the same documents for both. The context model, on

the other hand, would identify similar contexts as the representing context of both

categories.

Figure 3 demonstrate this type of errors for the "Atheism" category. It can be noticed

that all methods obtain low precision at low recall. It indicates that irrelevant

documents, according to the gold standard, were ranked at the beginning of the list. It

can also be noticed that by using the Simentail score better results are obtained, since it

relies on entailing words which can be better distinguished than context for those

categories. We believe that the ranking can be improved so it would avoid this type of

mistakes by exploiting the taxonomy structure in the classification procedure. That is,

to rank the documents for each level of the taxonomy iteratively, corresponding to the

topics' semantic relations. We discuss the details of this idea in section 5.

Gold Standard

Category

Method's

Classification

Document Example

Christianity Atheism "…I am interested in finding out why people

become atheists…"

MS-Windows IBM Pc "Because of the technology… IBM PC can't

read them without special hardware…"

Macintosh Forsale "I have for sale a Hayes 2400…"

Religion Christianity "If a Christian means someone who believes in

the divinity of Jesus…"

Guns Autos "…question quiz and to drive a car around the

block…Most states do not require the

registration of cars that are not…"

Table 6 - Document samples for missing annotations.

53

3. Gold standard missing classifications: this type of errors is the cause of a

significant number of false rankings and might also be the cause of some errors of

other types described in this section. The gold standard of the 20 Newsgroups

collection classifies each document to a single category, based on the user posting of

the documents. Some of the documents contain a comparison between topics

discussed in different categories, such as "Baseball" vs. "Hockey", or "Autos" vs.

"Motorcycle". Other documents discuss topically related categories, such as a

"Religion" discussion which relates to different religions among them "Christianity".

Looking at the 10 top ranked documents of each category, out of 66 errors we identify

30 documents as documents which should be classified to both categories, the gold

standard and the one identified by our method. Table 6 shows several examples of

missing annotation in the 20 Newsgroups collection.

Figure 4 - Context Scoring influence

The influence of the context scoring on the entailment scoring as reflected in the Simcombined score,

where axis y is the score and axis x is number of documents. The dark (blue) curve stands for the

Simentail score and the bright (pink) curve stands for the Simcombined score. It can be noticed that the

context score decreases mostly false positive scores for the "Autos" category, where for the "Middle

East" category it decrease the score of true positive scores as well.

4. Non-topical categories: our method aims to capture the topic of each category in

order to make correct classification. Non-topical categories which do not gather

documents concerning a joint topic fall out of the scope defined for the text

classification task as we see it. The category "Forsale" in the 20 Newsgroups

54

collection is one example for non-topical categories. Figure 3 shows that indeed, our

method obtains lower accuracy for this category than the context model. We aim to

utilize a corpus of topical categories for our future work, since we believe it would

better reflect the essence of the task.

4.3. Classification

4.3.1. Classification measure

The second goal of TC is the binary classification of documents into the predefined

categories. We report the results of the scoring-based classification using cosine

similarity to obtain a classification score for each document-category pair. Similarly

to the ranking evaluation, we report results using the Simcombined score-based method as

the final scoring method of our system, and the Simcontext score-based method as a

baseline proposed by (Gliozzo et al. 2005). In addition, we present the results

obtained by each of the partial score-based methods from which the Simcombined score

is composed: Simseed, Simwn, Simwiki and Simentail. We also report the classification

results obtained for the bootstrapping step for our final method based on the Simcombine

score and the baseline method which is based on the Simcontext score.

Given the gold standard of the collection used for the evaluation, standard

accuracy measures can be calculated. For each method evaluated the following

measures were calculated:

#correct classifications per categoryRecall=

#document in category

#correct classifications per categoryPrecision=

#document classified to category

( )

2 Precision RecallF1=

Precision Recall

⋅ ⋅

+

We report the micro-average values of these measures for each of the methods

and the detailed results for each category for the methods based on Simcombined and

Simcontext. Similar to the evaluation done for the ranking goal, we calculate the results

of all methods according to the maximum knowledge obtained by our most

comprehensive score, Simcombined. For the 20 Newsgroups collection by using the

Simcombined knowledge about 60% of the documents in the test set were classified, and

therefore this portion was considered as the full collection for the calculations. Only

55

documents which were given a score by Simcombined participated in the evaluation.

Following this perspective we enable measuring the results on the amount of

knowledge we are capable to obtain at this time, and in which comparison between

the two methods can be considered relevant.

The classifications for all datasets were performed by the single-class

classification approach for all score-based methods. The classification for the Reuters-

10 collection according to the bootstrapping results in section 4.4 was preformed in

the multi-class classification approach, since the gold standard for this collection is

given by a multi-class classification.

4.3.2. Classification results

TC tasks have the advantage that each document can be classified to one or more

categories depending on the application requirements. We followed the classification

standard used in the collection settings. Therefore, the classification for the 20

Newsgroups collection was a single-class classification, meaning each document was

classified to a single category. Documents were classified to the best-scoring category

according to the score obtained by the scoring method employed.

Scoring method Recall Precision F1

Simseed 0.19 0.55 0.28

Simwn 0.29 0.56 0.38

Simwiki 0.22 0.57 0.31

Simentail 0.31 0.57 0.40

Simcombined 0.32 0.58 0.41

Simcontext 0.30 0.55 0.39

Bootstrap Simcombined 0.33 0.63 0.44

Bootstrap Simcontext 0.51 0.53 0.52

Table 7 - Micro average classification results for all method within the portion of the 20 Newsgroups

collection covered by the entailment knowledge. 14

Table 7 presents the classification results for the methods composing our

Simcombined score and for the Simcombined based method itself, showing the advantage of

the Simcombined based method in both recall and precision. In addition, Table 7 also

shows the classification results for the bootstrapping step employed on the initial

classified document set obtained by the Simcombined and Simcontext scoring methods.

14 The documents considered for the recall and precision calculations of the Simcontext based method are

only documents which contain at least single entailment evidence, which is 55% of the documents.

56

The bootstrapping step achieves different results based on each of the labeled

sets of documents created by the two methods. The bootstrapping step achieves higher

precision based on the Simcombined initial set of documents, while it achieves better

recall based on the training set constructed by the Simcontext score. The reason for that

lies in the accuracy obtained by the SVM algorithm separation on the training set (the

SVM algorithm is described in Appendix C). The SVMlight algorithm achieved

97.74% average precision on the training set according to its separation of the sample

documents obtained by using the Simcombined score, while it achieved an average

precision of 50.51% on the sample training documents based on the Simcontext score. It

implies that the separation obtained for the sample documents obtained using our

method, Simcombined, results in a more precise set, which enables a clear separation

with fewer misclassifications according to the SVM separation in the training set.

Simcombined Simcontext

Recall Precision F1 Recall Precision F1

Atheism 0.24 0.73 0.36 0.43 0.49 0.46

Graphics 0.19 0.50 0.28 0.22 0.43 0.29

Ms-Windows 0.37 0.66 0.48 0.35 0.61 0.44

IBM PC 0.24 0.26 0.25 0.32 0.30 0.31

Macintosh 0.35 0.54 0.43 0.14 0.19 0.16

Windows-X 0.06 0.75 0.11 0.06 0.24 0.10

Forsale 0.33 0.68 0.44 0.23 0.83 0.36

Autos 0.61 0.62 0.62 0.66 0.72 0.69

Motorcycles 0.48 0.90 0.63 0.50 0.90 0.65

Baseball 0.34 0.76 0.47 0.38 0.90 0.54

Hockey 0.34 0.96 0.50 0.43 0.72 0.54

Cryptography 0.38 0.86 0.52 0.38 0.93 0.54

Electronics 0.12 0.54 0.20 0.10 0.51 0.17

Medicine 0.38 0.60 0.47 0.28 0.77 0.41

Space 0.33 0.68 0.45 0.34 0.84 0.48

Christianity 0.60 0.57 0.58 0.56 0.58 0.57

Guns 0.25 0.52 0.34 0.32 0.66 0.43

Middle East 0.27 0.80 0.41 0.02 0.53 0.05

Politics 0.15 0.27 0.19 0.08 0.13 0.10

Religion 0.16 0.15 0.15 0.11 0.12 0.11

Micro average

0.32 0.58 0.41 0.30 0.55 0.39

Table 8 - Classification results per category for Simcombined and Simcontext methods for the 20

Newsgroups collection.

Table 8 presents the classification results gained by the Simcombined and

Simcontext scoring methods for each of the categories, as well as the micro-average

results obtained by these scoring methods. Our method shows comparable results to

57

the Simcontext based method. Details of the full analysis of both tables are given in the

following section.

4.3.3. Analysis

The type of classification used in our method was determined based on the method

applied for the gold standard of the test collection. The 20 Newsgroups collection

gold standard is annotated based on a single-class classification, and therefore the

analysis is based on the results obtained by this scheme for it. Single-class results tend

to be deceptive for three main reasons:

(i) Flat taxonomy structure – TC taxonomies should be hierarchical by

definition, to encapsulate the relations between the topics. The hypernym

of a group of topics should be their parent in the taxonomy. The 20

Newsgroups collection is not structured in this manner. For example, the

"Religion" topic is a sister topic of the topic "Christianity", and the "MS-

Windows" topic which is a sister term of the topic "Windows-X"'. The

intuitive annotation rules would automatically classify documents which

belong to the "Christianity" topic to the "Religion" topic as well, for

instance. Since single-class classification selects a single category per

document, this hierarchical classification can not be obtained. This is one

of the causes for classification errors which we identify later on in this

section.

(ii) Equal score for different categories – In case a document obtains identical

scores for several categories then only one of the categories can be

selected in the single-class scheme as the "best" category. The method can

then choose the first category in some order or one of the categories

randomly. Our method simply chooses the first category in a lexicographic

order. We also tried random selection which did not obtain better results,

and on the contrary caused inconsistent classification which made the

analysis more difficult.

(iii) Gold standard missing classification –as described in section 4.2.2, taking

those missing classification under account, it is possible that the

classification method classified a document of this sort to both required

categories while the one taken was only the one which the gold standard

classification left out.

58

Those three difficulties in single-class classification were taken into consideration

when analyzing the classification results. Those difficulties enhance the advantage of

the Ranking based evaluation, which enables to concentrate on the scoring results

with no necessary consideration for the evaluation weaknesses. Below we describe the

classification phenomena which correspond to each type of method, entailment and

context-based. The entailment analysis is based on the results using the Simentail score,

and the context analysis is based on using the Simcontext score. We first describe the

entailment phenomenon, followed by the context phenomenon and conclude with

error analysis.

Entailment behavior Document classification based on entailment knowledge

depends on the quality of the expansions of the seed terms. Following are the main

types of category topics, where each requires a different type of entailment

expansions. The expansions needed might be different both in the type of semantic

relation needed for their expansion, and in the potential resource in which they can be

found. These differences influence the amount of expansion rules our method

succeeded to extract for each category type and also on the accuracy of the

classifications. The four types are described below:

1. General topic: includes categories which describe a general topic or a field of

interest. Categories which belong to this group are the science categories, such as

"Medicine" and "Space", religion categories, such as "Atheism" and "Religion", the

"Politics" category and the technical categories "Graphics" and "Electronics" which

describe a general field in the technical world rather than a specific brand or machine.

This type of categories requires expansion to their sub-fields, such as types of

medicine, types of religion etc. This type can also benefit from expansions of names

of people who are identified within this field, such as known politicians, known

cryptography scholars etc.

Therefore, the quality of classification for this type of topics depended on the

amount of knowledge obtained from our resources and their likelihood to appear

outside their context. The popular fields of interest obtained greater coverage from

WordNet. "Space", for example, in the sense of science was added only in the version

of WordNet used for our method (3.0) and has limited amount of data related to it.

Moreover, the Wikipedia resource has been a potentially good resource for the second

type of expansion resource; however the expansion rules reached insufficient

59

coverage for this type of categories. We wish to research the use of derivation and

synonyms as input for the Wikipedia resource to enhance the amount of data and the

quality of data we can obtain from Wikipedia.

2. Commercial brands: this sub-group of categories relates to a type of products

available on the market. WordNet does not contain most of them, and therefore their

expansions are mostly based on Wikipedia data. We found Wikipedia as a promising

resource for expansions of this sort, due to its updating and growing nature. New

products and their commercial brands are often covered in Wikipedia articles. Indeed

those categories, "MS-Windows", "IBM Pc", "Macintosh" and "Windows-X"

obtained very precise expansions and showed significant improvement in recall due to

their expansions. The reason for this is the efficiency of the expansions compared

with the general context of those computer products.

3. Classes: under this group of categories we can find classes of NPs, such as "Autos"

which contain types of cars, "Motorcycles" which include types of Motorcycles,

"Baseball" and "Hockey" which include sport teams of their types and "Guns" which

describe the guns class. On top of synonyms and domain terms which entail the topic,

the main type of entailing terms needed as expansions for this group of categories are

class members for each of them. WordNet and Wikipedia are complementary

resources regarding the types of expansions they can supply, and for that reason they

are both needed in this type of categories to provide each type of expansions required.

The accuracy obtained for this group depends mainly on the likelihood of the

category's entailing terms to appear outside their context and their frequency in the

general language. The expansions obtained for this group are very preliminary in

terms of scope, and we aim to improve them in future work. Therefore, their

likelihood to appear in the context of other categories becomes more influential and

decreases precision on top of the low recall obtained due to the limited expansions

obtained.

It should be noted that the "Middle East" category falls between the types

defined by us for the categories. It is close to the Classes type since it include the

geographical regions and countries in the Middle East. As a collective group, most of

60

Table 9 - Simcombined confusion matrix. Each row specifies the categoires' distribution according to the gold standrad, while each column specifies the

category the document was classified to by the method. The true positive classifications, which were classified to the correct category by the method,

are shaded.

61

the expansions for that topic originated from the meronym relation in WordNet which

was found as highly precise in manual analysis.

4. Non topical: this group includes the "Forsale" category. It is hard to expand a

category name which is not a topic, and therefore the accuracy obtained for it mainly

depends on the accuracy of the original category name.

Context behavior The analysis of the context classification showed that the

context model used does not recognize the topic of a document but rather divides the

documents into semantic clusters. This phenomenon can be observed clearly from the

confusion matrix of the Simcontext based method in Table 10. Although this is the

expected behavior from a context model, in practice, it enables us to use it as an

indicator of a general context and not of a specific context. For instance, it can

indicate when the topic name "Medicine" appears in a "Religion" context, by giving a

low context score to the document for its likelihood to discuss a "Medicine" context.

However, if the "Atheism" topic name would appear in a "Religion" context then the

context model would be of little help for such disambiguation.

Overall, for each semantic domain which includes several topics we can

identify the dominant topic in this domain. This topic is classified for the largest

amount of documents from this semantic domain. For example, Most of the

documents in the religion topical context obtain the highest score for the

"Christianity" category which results in the highest number of classification of

religion related documents to this category. Another example can be found in the

technical categories domain for which the leading category in terms of scores and

classifications number is the "IBM Pc" category. The leading categories obtained the

best accuracy according to this method classification and the rest of the categories in

the domain received respectively lower accuracy results.

It should be noted that for infrequent category names the Simcontext scoring-

method obtained poor statistics and yielded low accuracy accordingly. Some of the

examples for such category name are the "Windows-x" and the "Middle East"

categories which obtained low accuracy compared with the Simcombined scoring-

method. Table 9 presents the confusion matrix for the Simcombined score, in which the

superior results for infrequent category names can be seen, as based on an external

resource for expansions. Moreover, as opposed to the phenomena noticed for other

categories, infrequent category names also obtain unexplainable classification

62

Table 10 - - Simcontext confusion matrix on the portion of the corpus Simcombined acquired knowledge. Each row specifies the categoires' distribution according to the

gold standrad, while each column specifies the category the document was classified to by the method. The true positive classifications, which were classified to the

correct category by the method, are shaded.

63

mistakes by the context model, which are probably made due to general context words

rather than topical context. For example, the space document summarized in the line

"Want to obtain fax/email address for Planetary Society" obtains a high score for the

"MS-Windows" category, with no clear reason for why it obtained this high score.

Error analysis The error analysis for the classification mistakes mostly

corresponds to the behavior described in this section for each of the methods and for

some of the error analysis described in section 4.2.2. Therefore, below we sum up the

main reasons for classification errors:

1. Taxonomy structure: classification to a single category when the category and its

hypernym are both sister terms in the flat structure given. The document would be

classified to the "stronger" category which might be the wrong one. For example if a

"Christianity" document uses mostly general religion terms it might be classified only

to the "Religion" category, when according to the gold standard it belongs to

"Christianity". Both Simcombined and Simcontext scores are affected by this cause for

errors. For instance, Table 9 presents the Simcombined misclassifications of the

"Windows-X" category documents to the "MS-Windows" category, as well as for the

"Atheism" and "Christianity" documents to the "Religion" category.

2. Identical scores: some times documents obtain identical scores for several

categories and are arbitrarily classified to one of them which may be the incorrect

one. This type of error mostly occurs when the Simcontext based method is used for

context identification for a category which belongs to a broader semantic class. For

example, for a document which received high scores for several computers categories

the context model would not be able to rescale the scores appropriately.

3. Ambiguity of expanding terms as for ranking: when a category name is

expanded with an ambiguous term it would result in low scores for documents from

several un-related categories in which this term appeared in one of its other senses.

An example for such a term is the noun "steal" which relates to baseball in its second

most frequent sense – "a stolen base; an instance in which a base runner advances

safely during the delivery of a pitch", however it is more frequent in its first sense –

"an advantageous purchase". Most of the documents which obtain a high score

according to this type of terms do not belong to the corresponding category.

Moreover, the score would mostly be based solely on the single ambiguous term. If

the document was given a score for this reason alone, and other categories did not

64

obtain any information which appears in it, the document would be classified

mistakenly to an inappropriate category.

4. Single appearance of a single word: the bottom of the ranking list consists mainly

of documents which obtained low scores based on a single appearance of a single

term. This section of the ranking list contains more errors and irrelevant documents.

The context model is sometimes unable to be used to significantly decrease the score

when the document appears in a related topical category, such as a computer category

other than the computer category the document belongs to. If the document obtained a

score for a single category it would be classified to it even though it is likely to be

wrong. We aim at expanding the amount of knowledge acquired by the entailment

methods to improve accuracy and diminish this error type. Another possible solution

for this type of error, which will be discussed in section 5, is applying the gm

algorithm or a similar algorithm to rescale the classification scores and enable the

creation of a common threshold to filter these scores. In ranking analysis this is not a

dominant error cause since those documents are ranked at the bottom of the list.

5. False negative errors: this type of errors is generally caused by infrequent terms

and insufficient data. The expansion method acquired minimal knowledge for those

topic names, such as "Space" in the science sense and "Windows-x". The result of this

amount of knowledge is low recall and sometimes decreased precision for documents

which obtain low scores for other categories and no score for the true category, so

they are misclassified due to lack of knowledge.

4.4. Reuters-10 results

Simseed ⋅ Simcontext Simwn ⋅ Simcontext Simwiki ⋅ Simcontext Simcontext Simcombined

Acquisition 0.24 0.78 0.24 0.93 0.80

Corn 0.66 0.94 0.66 0.61 0.94

Crude 0.21 0.91 0.21 0.80 0.91

Earn 0.08 0.83 0.11 0.53 0.82

Grain 0.42 0.98 0.42 0.93 0.98

Interest 0.47 0.47 0.47 0.84 0.47

Money-fx 0.32 0.51 0.38 0.58 0.51

Ship 0.87 0.99 0.87 1.00 0.99

Trade 0.75 0.74 0.75 0.89 0.81

Wheat 0.97 0.97 0.97 0.85 0.97

Average 0.25 0.41 0.25 0.40 0.41

Table 11 - MAP values for each of the methods used within the range of documents which contain

entailment terms. The best result for each category is given in bold indication.

65

In this section we present the results obtained for the Reuters-10 collection. As

motioned above, we decided to focus most of our work on the 20 Newsgroups

collection rather than the Reuters-10 collection for our evaluations. The main reason

for that is that the 20 Newsgroups categories better suit the topical categorization

scheme addressed by this research, its categories are not domain specific and its

taxonomy is better structured than the Reuters-10 flat taxonomy.

The ranking results for the Reuters-10 corpus for the Simcombined, Simwn, Simwiki,

Simseed and Simcontext are described in Table 11. The table presents the MAP values for

all 10 categories within the range of knowledge obtained by the entailment methods,

which is an average of 87% of the documents in the collection. It shows advantage for

the Simcombined method in half of the categories, that is 5 out of 10 categories. Overall,

the average MAP achieved by Simcombined slightly outperforms the average gained by

the Simcontext score, in 1.1 points. It should be noted that several of the categories in

which the Simcombined did not achieved better MAP value than the Simcontext, are non-

topical categories such as the "Money-fx" category.

Scoring method Recall Precision F1

Simseed 0.22 0.67 0.33

Simwn 0.67 0.78 0.72

Simwiki 0.24 0.68 0.35

Simentail 0.69 0.80 0.74

Simcombined 0.66 0.77 0.71

Simcontext 0.47 0.54 0.50

Bootstrap Simcombined 0.78 0.74 0.76

Bootstrap Simcontext 0.66 0.48 0.55

Table 12 - Micro average classification results for all method within the entailment knowledge portion

of the Reuters-10 collection15

.

The classification results for Reuters-10 are presented in Table 12, for Simseed,

Simwn, Simwiki, Simentail, Simcombined and Simcontext methods. First, it can be noticed that

the names of the Reuters-10 categories are very indicative and therefore acquire

relative high precision using the Simseed score which is based on the category name

seed alone. Moreover, another result of the indicative category names is that the

WordNet based expansions significantly improve the recall obtained by the Simwn

score-based classification. Some of the categories require just a synonym or

derivational expansion to reach high recall, such as the category "Corn", for which the

expansion based on the rule "maze ⇒ corn" alone nearly doubles the recall.


only documents which contain at least single entailment evidence, which is 87% of the documents.

66

On the other hand, Wikipedia based expansions make small contribution to the

recall since the categories mostly do not describe a broad topic, but rather specify a

narrow topic within the economy domain. Nevertheless, the Wikipedia based

expansions do contribute to the precision and increase it by two percents, which

implies that although the expansions do not result in significantly more classified

documents, they improve the accuracy of the score obtained for the documents which

contain the seed terms. Moreover, the combination of WordNet and Wikipedia

expansions result in higher precision and recall than each one of them alone, since the

expansion are complementary in this dataset domain as well.

The context model accuracy for this dataset is not comparable to our method

accuracy, as can be seen from Table 12. It should be noted that the results obtained by

the SimLSA score-based method were higher for the Reuters-10 dataset and obtained

10% more of the Simcontext score-based method in all measures, Recall, Precision and

F1. This phenomenon can be explained by the vulnerability of our system to identical

scores. The Simcontext scores, are based on the SimLSA scores rescaled by the GM

algorithm as explained in section 3.3.4, are often identical for several categories since

the high scores are mapped to a probability values of 1. The system then classifies the

document to a single arbitrary category, while the rest of the categories do not obtain

any classification for this kind of documents, such as the documents which belong to

the "Interest" category but classified to other categories falsely.

Simcombined Simcontext

Recall Precision F1 Recall Precision F1

Acquisition 0.80 0.65 0.72 0.90 0.42 0.57

Corn 0.54 0.93 0.68 0.52 0.71 0.60

Crude 0.72 0.91 0.81 0.60 0.83 0.69

Earn 0.68 0.96 0.79 0.14 0.99 0.24

Grain 0.34 0.98 0.50 0.26 0.85 0.40

Interest 0.32 0.35 0.33 0.00 0.00 0.00

Money-fx 0.31 0.56 0.40 0.72 0.58 0.64

Ship 0.42 0.97 0.59 0.73 0.73 0.73

Trade 0.96 0.54 0.69 0.91 0.76 0.83

Wheat 0.62 0.91 0.74 0.58 0.73 0.64

Micro average

0.66 0.77 0.71 0.47 0.54 0.50

Table 13 - Classification results per category for Simcombined and Simcontext methods for the Reuters-1016

collection.


only documents which contain at least single entailment evidence.

67

Even though the results for the SimLSA are higher than those obtained by the

Simcontext based method, the results obtained are still significantly lower than the

results obtained by our method, based on the Simcombined score. It can be explained

simply by the type of categories composing the Reuters-10 category set, where

categories are all domain specific. As mentioned in the analysis section earlier, the

context model is not highly sensitive regarding context differentiation within specific

domains. The close nature of the categories' context in the Reuters-10 data yield less

accurate results.

Table 13 presents the results obtained for each of the categories for the

Simcombined and Simcontext score-based methods. It shows the high recall achieved by the

Simcombined method relative to the recall obtained for most of the 20 Newsgroups

categories, presented in Table 8. It is a result of the indicative nature of the Reuters-10

categories as described above. For that reason, the precision obtained for most

categories is also high.

68

5. Conclusion and future work

The main contribution of this thesis is a novel approach to TC which is based on the

entailment model. The proposed method integrates entailment models and context

models as a scoring method for the text categorization task, which was approached so

far mostly by context models utilization. Our investigation highlights the importance

of the entailment assumption for the TC task and the complementary nature of

entailment and context models. We suggest that this line of research may be

investigated further to enrich and optimize the entailment models used in order to

exploit additional entailment knowledge.

Our research revealed several important conclusions about the integration of

the two models and about each type of model separately:

(i) Indeed, our analysis reveals that the entailment requirement as the basis for

the TC score helps to classify documents according to the topic they actually

discuss as opposed to using context models which only reveal the documents

broader context. Strong entailment evidence within the documents implies that

the topic is discussed specifically in the document, as a main topic or one of

the sub-topics.

(ii) Notably, the context model's score complements the entailment model

score to distinguish cases where the entailment model recognized a passing

reference of a topic, or entailing terms in a different sense than the one which

entails the category topic. The multiplication scheme for the integration of the

similarity scores obtained by the two models was found effective, yielding

noticeably improved accuracy.

(iii) Context models tend to split the documents into general semantic

clusters. They do not recognize the specific context discussed in the

document, but rather recognizes the domains of contexts the collection

discusses. Therefore, categorizing documents according to context-based

models results in categorizing to the category which most prominently

represents the semantic cluster.

(iv) Combination of dictionary based expansions and encyclopedia based

expansions gives a more complete perspective and expansion abilities. On one

hand, the need for morphological and general definitions is needed due to

69

language richness. On the other hand, to better deal with named entities and

current general topics, encyclopedias give important complementary

definitions and knowledge.

Our study highlights the potential in combining the two methods and constructing

meaningful scores by using them. Yet, the results we achieved may be improved in

many aspects. Obviously, the recall of our method can be improved by utilizing

further entailment knowledge resources. By extracting more entailment rules, more

evidence is obtained, statistics become more significant and the number of entailing

terms is likely to increase. To assure increase of accuracy as a whole rather than

increase of recall alone, the accuracy of the additional entailment rules should be

guaranteed so that the precision would increase as well. Apart from technical issues,

such as improvements of the pre-processing steps or employment of additional

Wikipedia-based rule types, we will now describe several promising research

directions:

(i) Hierarchy based categorization: the topical hierarchical taxonomy, which

stands at the base of the topical TC tasks, can be exploited in more ways than just

using its category names as seeds for expansion. Hierarchy related errors might be

avoided by utilizing a hierarchy-oriented categorization approach. Firstly, the

categorization method can be performed iteratively for each level of the taxonomy

hierarchy. By that, documents which belong to a certain branch of the taxonomy

would first be categorized to it, and then distributed between the sub-topics

constructing that branch of the taxonomy. Moreover, the taxonomy hierarchy can

be exploited to identify and eliminate the use of entailing terms as expansions for

multiple sister categories, and by that eliminating classification of documents to

such multiple categories, while the document should belong to only one of them.

Finally, the hierarchy topic names can be used to disambiguate the required sense

of the topic name at the leaves of the taxonomy, by measuring the association

between the entailing term and the topic names which construct the path to that

leaf.

(ii) Entailing terms weighting scheme: our method used a uniform weight scale for

all the entailing terms. Future work may consider using the entailment rule

weighting scheme to apply weights for the entailing terms, or use an independent

scheme to weigh the entailing terms directly. For example, a weighting scheme

70

based on SemCor probabilities (Miller et al., 1993) or on Information Gain

statistics of the terms can be utilized. Using weights for the entailing terms can

help diminish the influence of ambiguous terms and terms which entail senses of

the topic name other than the sense denoted by the.

(iii) Re-scaling of the complete score – currently, the GM re-scaling is utilized

only for the similarity score obtained by the LSA, due to the sparseness of the data

obtained by the entailing methods. Augmenting the number of entailment rules

used in our method may result in possible utilization of the GM for them.

Otherwise, in case the problem of data sparseness is not resolved, a different re-

scaling method may be considered to allow setting a threshold to filter documents

scores.

(iv) Context models research – the context model in use for our method is LSA

similarity re-scaled using a GM model. This model tends to obtain similar scores

for similar context, i.e. most documents would obtain equal score for topically

related categories, such as the computer categories. In these cases the context

model does not provide us with the context differentiation we need in order to

filter entailment scores which are mistaken. Since the LSA model is difficult to

analyze and improve, it may be useful to evaluate other context models such as

co-occurrence.

(v) Evaluation and Analysis – we believe that evaluation on a topical collection

with standard and fully defined taxonomy structure may diminish errors originate

by this technical problems, and even raise additional interesting analysis

conclusions and research direction. Moreover, such collection with multiple-class

annotations may be useful to the full analysis of the results obtained for each

category.

71

References

Berry, M. 1992. Large-scale sparse singular value computations. International

Journal of Supercomputer Applications, 6(1):13Œ49.

Cai, L. and T. Hofman. 2003. Text Categorization by Bossting Automatically

Extracted Concepts. In Proc.of the 26th Annual Int.ACM SIGIR Conference on

Research and Development in Informaion Retrieval, Toronto,Canada,2003.ACM

Press.

Clinchant, S., C. Goutte, and E. Gaussier. 2006. Lexical entailment for Information

Retrieval. In Proceedings of the 28th European Conference on Information Retrieval,

volume 3936 of Lecture Notes in Computer Science, pages 217–228. Springer-

Verlag, 2006.

Dagan, I., O. Glickman, and B. Magnini, editors. 2006. The PASCAL Recognising

Textual Entailment Challenge, volume 3944. Lecture Notes in Computer Science.

Deerwester, S., S. Dumais, G. Furnas, T. Landauer, and R. Harshman. 1990. Indexing

by latent semantic analysis. Journal of the American Society of Information Science.

El-Yaniv, R., and O. Souroujon. 2001. Iterative double clustering for unsupervised

and semi-supervised learning. Advances in Neural Information Processing Systems

(NIPS) 14, 2001.

Fellbaum, C., editor. 1998. WordNet : An Electronic Lexical Database (Language,

Speech and Communication).The MIT Press.

Freund, Y., R. Iyer, R. E. Schapire, and Y. Singer. 1998. An efficient boosting

algorithm for combining preferences. In Proceedings 15th International Conference

on Machine Learning, pages 170-178.

72

Giampiccolo, D., B. Magnini, I. Dagan, and B. Dolan. 2007. The third pascal

recogniz-ing Gilies, 2005) textual entailment challenge. In Proceedings of ACL-

WTEP Workshop.

Glickman, O., E. Shnarch and I. Dagan. 2006. Lexical reference : a semantic

matching subtask. In Proceedings of EMNLP.

Gliozzo, A. and C. Strapparava. 2005. Domains kernels for text categorization. In

Proc.of the Ninth Conference on Computational Natural Language Learning

(CoNLL-2005), Ann Arbor, June.

Gliozzo, C. Strapparava, and I. Dagan. 2005. Investigating unsupervised learning for

text categorization bootstrapping. In Proc. of the Joint Conference on Human

Language Technology / Empirical Methods in Natural Language Processing

(HLT/EMNLP), Vancouver.

Joachims, T. 1999. Making large-scale SVM learning practical. In B. Scholkopf, C.

Burges, and A. Smola, editors, Advences in kernel methods: support vector learning.

MIT press, Cambridge, MA, USA, chapter 11, pages 169-184.

Kazama, J. and K. Torisawa. 2007. Ex-ploiting Wikipedia as external knowledge for

named entity recognition. In Proceedings of EMNLP-CoNLL.

Ko, Y. and J. Seo. 2002. Text categorization using feature projections. In Proc. of

COLING'2002.

Ko, Y. and J. Seo. 2004. Learning with unlabeled data for text categorization using

bootstrapping and feature projection techniques. In Proc. of the ACL-04, Barcelona,

Spain, July.

Liu, B., X. Li, W. S. Lee, and P. S. Yu. 2004. Text classification by labeling words. In

Proc. of AAAI-04, San Jose, July.

73

McCallum, A. and K. Nigam. 1999. Text classification by bootstrapping with

keywords, EM and shrinkage. In ACL99 – Workshop for unsupervised Learning in

Natural Language Processing.

Mihalcea R. and D. Moldovan. 2000. Semantic Indexing using WordNet Senses. In

Proceedings of ACL Workshop on IR and NLP.

Miller, G.A., C. Leacock, R. Tengi and R.T. Bunker. 1993. A semantic concordance.

In Proceedings of HLT.

K. Morik, P. Brockhausen, and T. Joachims. 1999. Combining statistical learning with

a knowledge-based approach - A case study in intensive care monitoring. Proc. 16th

Int'l Conf. on Machine Learning (ICML-99).

M. de Buenaga, J.M. Gomez, and B. Diaz. 1997. Using wordnet to complement

training in formation in text categorization. In Recent Advances in Natural Language

Processing II: Selected Papers from RANLP'97, volume 189 of Current Issues in

Linguistic Theory (CILT), pages 353-364. John Ben jamins, 2000.

Rodriguez et al. 1997)

Sahami,M.,Hearst,M.,and Saund,E.(1996).Applying the multiple cause mixture model

to text categoriza-tion. In Proceedings of the 13th International Machine Learning

Conference.

Salton, G. and M. H. McGill. 1983. Introduction to modern information retrieval.

McGraw-Hill, New York.

Scott, S. and S. Matwin. 1998. Text classification using WordNet hypernyms. In

Proceedings of the COLING / ACL Workshop on Usage of WordNet in Natural

Language Processing Systems. Montreal, Canada.

S.Scott and S.Matwin.(1999).Feature engineering for text classification.Proc.of 16th

International Conference on Machine Learning,Bled,Slovenia.

74

Chade-Meng Tan, Yuan-Fang Wang, Chan-Do Lee: The Effectiveness of Bigrams in

Automated Text Categorization. ICMLA 2002: 275-281

E. Voorhees and D. Harmann, editors. 1999. Proceedings of the Seventh Text

REtrieval Conference (TREC-7), Gaithersburg, MD, USA, July. NIST Special

Publication.

75

Appendix A – Latent semantic Analysis

Latent Semantic Analysis (LSA) is a dimension reduction method for co-occurrence

data. The main idea of LSA is to map the original representation of documents to a

lower dimensional space (latent space), in which documents will be represents by

"concepts" instead of terms, where the number of "concepts" is significantly lower

than the number of terms.

More formally, let t be the number of terms in the corpus, and N to be the total

number of documents. Define ( )ijM m=r

to be the term-by-document association

matrix with t rows and N columns, where mij is the weight of term i in document j.

The Mr

matrix is decomposed using SVD into three matrixes

tM K S D= ⋅ ⋅rr r r

Where Kr

is the matrix of eigenvectors derived from the term-to-term

correlation matrix given by tM M⋅

r r, and

tDr

is the matrix of eigenvalues derived from

the document-to-document matrix given by t

M M⋅r r

. Sr

is r r× diagonal matrix of

singular values where min( , )r t N= is the rank of Mr

.

The dimensions of the original space are than reduced by selecting a rank s

which will stand for the number of dimensions or "concepts" in the latent space. Only

the s largest singular values of Mr

are kept along with their corresponding columns in

Kr

and tDr

as a result. Accordingly, only the top s singular values in Sr

are kept, and

the rest are set to zero.

The rank s should be selected so that it will be big enough to represent all the

concepts in the original data, while it will also be small enough to filter out all

unnecessary data represented by the original number of dimensions (i.e. the number of

terms in the original corpus). The original representation of the co-occurrence

captures the first order similarity between terms, meaning their likelihood to co-occur

in the same text. The "concepts" amalgamate the co-occurrence data of the terms and

by that capture second order similarity, which means terms which tend to appear

together will be mapped to the same concepts and as a result the similarity will

measure re-occurrence of terms together.

76

Finally, the terms in the latent semantic space are represented by the rows of

the reduced matrix Kr

of the rank t s× . That is, the ith term in the lexicon wi is

represented by the ith row of the matrix Kr

. The document vectors, dr

are than

represented by the weighted sum of the LSA vectors representing their constructing

terms: td K⋅r r

.

77

Appendix B – Gaussian Mixtures

The Gaussian Mixture (GM) model aims to estimate the probability of a document to

be classified to a given class based on the similarity score between the class and the

document. In essence, the Gaussian Mixtures (GM) algorithm differentiates between

relevant and non-relevant category documents using similarity statistics of the

unlabeled data. The algorithm assumes that the distribution of a given similarity

function is in fact a mixture of two distributions, and approximates the unknown

density of the two assuming that they are Gaussian functions. Bellow we give a short

formal description of the algorithm.

For each document i

d T∈ and for each category c, where c

id V⊂ is the term-

based representation of category c, we define ( , )c i

Sim id d ∈� to be the similarity

function between the documents and the category idcs. The similarity function is taken

to be monotonically increasing according to the "closeness" of the documents and the

category idcs. As a first step, the algorithm obtains similarity scores for each

document-category pair, and assumes that the similarity scores obtained for each

category is a mixture of the distributions of the relevant and non-relevant category

documents.

In the second step, the algorithm aims to estimate the conditional probability

( )( , )c i

c Sim id dΡ as a mixture of the probability of this pair to be a positive example

and its probability to be a negative example. For that purpose, it defines the two

following probabilities: ( )( , )c iSim id d cΡ , the probability that the similarity between

a document di and a category id idc is drawn from the category's distribution function,

hence that they are a positive example, and ( )( , )c iSim id d cΡ which is the distribution

from which the negative examples are drawn. Each probability is assumed to be

drawn by a Gaussian distribution, for example the probability of the positive

examples:

( )2

( ( , ) )

21( , ) ( ( , ), , )

2

c i c

c

Sim id d

c i c i c c

c

Sim id d c G Sim id d e

µ

σµ σπ σ

−−

⋅Ρ = = ⋅⋅ ⋅

(Similarly for the probability of the negative examples). To achieve its goal

the algorithm should acquire an estimation of the value of the parametersc

µ , c

σ ,

78

cµ and

cσ , that is the mean and variance values of each probability function, as well

as the weight of each function in the overall Gaussian mixture. The mixture of the

functions is defined to be:

( ) ( ) ( )( , ) , , , , , ( , ) ( , )c i c c c c c c c c i c c iSim id d w w w Sim id d c w Sim id d cµ σ µ σΡ = ⋅Ρ + ⋅Ρ

Where wc is the weight of the positive Gaussian function as defined above,

and c

w is the corresponding weight of the negative Gaussian function. The weight of

the two functions are the prior probability of ( )cΡ and ( )cΡ , and therefore

( ) ( ) 1c cΡ + Ρ = . The calculation of those parameters is obtained via an EM procedure

described in detail in (Gliozzo et al., 2005). Using the two estimated functions the

algorithm can apply the Bayes rule to obtain the smoothed mixture of the two:

( ) ( )( ) ( )

( ) ( ) ( ) ( )

( , )( , )

( , ) ( , )

c i

i c i

c i c i

Sim id d c cc d c Sim id d

Sim id d c c Sim id d c c

Ρ ⋅ΡΡ = Ρ =

Ρ ⋅Ρ + Ρ ⋅Ρ

By acquiring the final estimation of the ( ) ( )( , )i c ic d c Sim id dΡ = Ρ

probability the algorithm achieves its goal to obtain the smoothed estimation of the

classification probability. Following the single-class paradigm it can then be used to

assign the most likely category to each document, that is ( )arg maxc i

c dΡ .

In short, the algorithm is constructed of the following steps:

(i) For each document i

d T∈ and category id, c

id V⊂ , calculate the similarity

score between them, ( , )c i

Sim id d ∈�

(ii) Estimate the probability ( ) ( )( , )i c i

c d c Sim id dΡ = Ρ using the following

steps:

a. Define ( )( , )c iSim id d cΡ and ( )( , )c iSim id d cΡ to be the

complementary probabilities for the positive and negative examples.

b. EM step: estimate the value of the parametersc

µ , c

σ , c

µ , c

σ , wc and

cw to acquire an estimation of:

( )( , ) , , , , ,c i c c c c c cSim id d w wµ σ µ σΡ =

( ) ( )( , ) ( , )c c i c c iw Sim id d c w Sim id d c⋅Ρ + ⋅Ρ

79

c. Estimate ( ) ( )( , )i c ic d c Sim id dΡ = Ρ using Bayes rule:

( ) ( )

( ) ( ) ( ) ( )

( , )

( , ) ( , )

c i

c i c i

Sim id d c c

Sim id d c c Sim id d c c

Ρ ⋅Ρ

Ρ ⋅Ρ + Ρ ⋅Ρ

(iii) For each document di assign the best category according to ( )arg maxc ic dΡ

80

Appendix C – Support Vector Machines

Support Vector Machines (SVM) is a state-of-the-art framework for supervised

learning which can be used to train linear classifiers. Linear classification maps the

input data, constructed from positive and negative classes of vectors, to a higher

dimension space in order to separate the two classes by a hyperplane, that is a multi-

dimensional plane. The hyperplane is chosen to maximize the distance between the

closest vectors of each of the two classes. This distance is donated margin, and the

closest vectors to the hyperplane are named support vectors. Given a new vector, the

classifier will use this separating hyperplane to classify the new vector to one of the

classes, positive or negative.

The SVM algorithm goal is to maximize the hyperplane margin in order to

create a clear separation. On the other hand, it aims to minimize the risk for

classification mistakes, which may be resulted from a larger margin. To control the

tradeoff between those goals the SVM algorithm includes a regulation parameter,

donated c. The margin chosen by the SVM is smaller for higher c values, and

therefore the number of False Positive classifications decrease for higher c values. On

the other hand, for lower c values the margin is larger and therefore more vectors are

left unclassified, meaning False Negative errors are created. Hence, this parameter

can be used to control the tradeoff between Precision and Recall, since Precision is

associated with the portion of false positive classifications, and Recall is associated

with false negative classifications.

When using the SVM algorithm one must also regard the unbalanced nature of

the classified instances the SVM uses to create the separating hyperpalne. Often, the

number of negative instances is significantly larger than the number of available

positive instances. The j parameter is a cost-factor by which training errors on positive

instances outweigh errors on negative instances, providing a way for compensating

unbalanced training data.

We used the SVMlight

implementation of the SVM algorithm for the

implementation of our method, which supports the tuning of the two parameters

described above. Our settings of SVMlight and its parameters are described in section

4.1.1.

81

מדד הדמיון מבוסס ההקשר במחקר זה מבוסס על הגישה המוצעת בעבודת המחקר של

)Gliozzo et al, 2005 .(שיטת ה, לשם כך - LSA מומשה במסגרת עבודת המחקר של התיזה לייצוג

הינה שיטה לצמצום מימדים ממרחב LSA. דמיון מבוסס הקשר בין קטגוריות ומסמכים טקסטואליים

. ספר מימדים קטן בצורה משמעותית ממספר המימדים במרחב הייצוג המקורימקורי למרחב בעל מ

- ל, במובן של דמיון בין נתונים סטטיסטים של הופעות משותפות, השיטה ממפה מונחים דומים

הינו , כשיטה לייצוג הקשר LSA -היתרון של שימוש ב. אשר מייצגים הקשר סמנטי דומה, "קונספטים"

אשר מייצג דמיון של מילים אשר נוטות להופיע באותם , טי מסדר ראשוןממדל דמיון סמנ LSA - ש

שיטות סטנדרטיות לייצוג הקשר ממדלות . של מילים שנוטות להופיע ביחד, ודמיון מסדר שני, מסמכים

מייצג גם דמיון מסדר שני באמצעות המיפוי המשותף של המילים LSA. לרוב רק דמיון מסדר ראשון

. ובכך יתרונו היחסי, "קונספטים"לאותם

המטרה . מדד הדמיון שהוגדר לעיל יושם במחקר זה עבור שתי מטרות שונות של סיווג טקסטים

הראשונה הינה ביצוע של דירוג מסמכי הטקסט עבור כל אחת מהקטגוריות על פי תוצאת הדמיון בין

תוצאות . ה אחת או יותרהמטרה השנייה הינה ביצוע של סיווג טקסט בינארי לקטגורי. המסמך לקטגוריה

וכן בהשוואה לכל מדד ) Gliozzo et al, 2005(השיטה נבדקו בהשוואה לשיטת הסיווג שהוצעה על ידי

הערכות הביצועים עבור משימת הדירוג . דמיון המרכיב את השיטה הכוללת המוצעת במחקר זה

מאחר ועבור כל , )accuracy(מאפשרות ניתוח של נכונות תוצאת הדירוג המושגת עבור כל קטגוריה

דירוג המסמכים מאפשר ניתוח מפורט של , כמו כן. קטגוריה נוצרת רשימה נפרדת של מסמכים מדורגים

, מצד שני. מאחר והמסמכים מדורגים באופן יחסי על פי תוצאת הדמיון, דיוק הדירוג עבור כל קטגוריה

סית שנקבעה עבור כל מסמך עבור משימת סיווג מסמכי הטקסט מאפשרת ניתוח של תוצאת הדמיון היח

ניתוח , בנוסף. וכן מאפשרת לבחון את היחסים בין הקטגוריות על פי תוצאות הסיווג אליהן, כל קטגוריה

ולפיכך מעלה , שאינה זו שהם משתייכים אליה, שיטה זו מעלה שגיאות סיווג של מסמכים לקטגוריה

.כיווני מחקר אפשריים נוספים

. מפיריות חיוביות עבור השיטה השלמה לסיווג טקסטים המוצעת בתזה זואנו מציגים תוצאות א

אשר תומך בהנחה שמדד הדמיון של גרירה לקסיקלית הינו , השיטה המוצעת משיגה דיוק גבוה יותר, אכן

התוצאות מלוות בניתוח מקיף של סוגי ההרחבות ושל המנגנונים . מדויק יותר לצורך סיווג טקסטים

.פור נוסף של התוצאותהדרושים לצורך שי

82

תוצאות הסיווג הסופיות מתקבלות על ידי אימון מסווג טקסט מפוקח סטנדרטי על אוסף . פי תוצאת הדמיון

.מסמכים זה

על פי נתונים , ישנם מספר חסרונות לביסוס ההרחבה האוטומטית על ידי גישה מבוססת הקשר

שמידע סטטיסטי זה אינו , ובה שבהם הינההראשונה והחש. סטטיסטים של הופעות משותפות של מילים

מבוססי מידע , םמודלים אופייניי. מייצג את הקשר הסמנטי הדרוש לקבלת החלטות סיווג טקסטים

את הנושא הספציפי אממדלים את ההקשר הרחב של הטקסט ולאו דווק, סטטיסטי של הופעות משותפות

שותפות מצביע על שייכות כללית לתחום נושאי ציון דמיון גבוה בין טקסט לנתוני הופעות מ. שנדון בו

טקסט אשר עוסק בתוכנת , לדוגמא. עיסוק בנושא הספציפי אותו מתארת הקטגוריה אולאו דווק, דומה

ולפיכך דמיון מבוסס הופעות , מחשבים כלשהי יהיה שייך להקשר הנושאי של תחום מדעי המחשב

עי המחשב באופן כללי יותר ולא לתוכנת מחשב משותפות יעריך את מידת הקרבה של הטקסט לתחום מד

.זו או אחרת

במחקר זה אנו מציעים גישה חדשה לסיווג טקסטים באמצעות מילות מפתח המבוסס על

אשר מבסס את מדד הדמיון בין טקסטים וקטגוריות על גישת הגרירה הלקסיקלית , טקסונומיה נושאית

)Lexical Entailment – LE( , גישת הגרירה . הופעות משותפות של מילים בלבדבמקום על נתוני

שבמטרתו לזהות סימוכין בטקסט מקור ספציפי לקטע , הלקסיקלית מגדירה יחס דמיון סמנטי מדויק יותר

מאחר וקשר זה דורש אזכור , קשר סמנטי זה הינו מתאים יותר עבור משימת סיווג הטקסטים. טקסט אחר

.הנידון של הנושא ולא רק דמיון כללי של ההקשר

או שמא כאזכור של , על מנת לזהות האם נושא מסוים נידון בקטע הטקסט הנבדק כנושא המרכזי

אנו מציעים שילוב של גישת הגרירה הלקסיקלית עם הגישה מבוססת ההקשר , נושא צדדי בטקסט

, המוצעת כאן, שיטת סיווג הטקסטים. כמרכיב נוסף במערכת הכוללת של סיווג הטקסטים במחקר זה

, )מילה או ביטוי המורכב ממספר מילים(מציבה כדרישת בסיס אזכור אחד לכל הפחות של מקטע טקסט

ייבדק בנוסף מדד הדמיון מבוסס ההקשר , שמקיימים את דרישה זו, עבור טקסטים. אשר גורר את הנושא

שה הממדלת שימוש במדד דמיון מאוחד זה יוצר שיטה חדי. בינם לבין הקטגוריה שנושאה אוזכר בטקסט

.הפנייה נושאית והקשר נושאי בו בזמן

מקור . לצורך חישוב מדד הדמיון מבוסס הגרירה הלקסיקלית אנו משתמשים בשני מקורות ידע

הינו האונטולוגיה הסמנטית , ששימש אותנו במחקר זה, הידע הראשון לצורך איסוף מידע גרירה לקסיקלי

WordNet אשר פותחה על ידי)Fellbaum, 1998 .( מקור זה מאפשר איסוף של קשרים סמנטיים

וקשרי גרירה נוספים הדרושים תהוא מספק קשרים סמנטיים כגון הטיות מורפולוגיו. ממקור ידע מילונאי

מקור . Wikipediaאנו משתמשים במקור גרירה לקסיקלית מבוסס , כמקור ידע משלים. למשימת הסיווג

, מוצרים מסחריים ומונחים מתחומי ידע כלליים ועדכניים, שויותידע אנציקלופדי זה מספק מונחים כגון י

שמהווה את הבסיס להרחבה באמצעות גרירה סמנטית במסגרת , הקשורים סמנטית לשם הקטגוריה

שני מקורות אלו הינם משלימים מטבעם ולפיכך הם משמשים להרחבות הנובעות מקשרים . השיטה

.קטגוריות שוניםוכן להרחבות עבור סוגי , סמנטיים שונים

83

Abstract (Hebrew)

אשר מתבסס על טקסונומיה נושאית , תזה זו מבצעת מחקר בתחום סיווג טקסטים מבוסס מילות מפתח

הגישה המחקרית הרווחת עבור סיווג טקסטים הינה גישה מפוקחת . כקלט היחיד עבור הסיווג

)supervised ( או מפוקחת חלקית)semi-supervised .(פוקחת לסיווג טקסטים דורשת הגישה המ

. עבודה ידנית מרובה לצורך תיוג קטעי טקסט הנחוצים למשימה המפוקחת כדוגמאות אימון מסווגות

פתרון , אשר סווג עבורן ידנית אוסף מסמכים נרחב, למרות שקיימות מספר מערכות עבר של סיווג טקסט

קצב הצמיחה של אוספי טקסט . כיום, קסטשנחוץ עבורן סיווג ט, ביצוע עבור רוב המערכות- זה אינו בר

טקסונומיות נושאיות חדשות וכן קצב הצמיחה של כמויות הטקסט הלא מסווג הן רק חלק , חדשים

.מהסיבות לצורך בגישות סיווג טקסט בדרגת אוטומאציה גבוהה יותר

וג סיווג טקסט מפוקח חלקית המבוסס על מילות מפתח עשה את הצעד הראשון לעבר גישת סיו

, שיטות מבוססות גישה זו זיהו את הפוטנציאל החישובי הגדול. טקסט בדרגת אוטומאציה גבוהה יותר

אשר זמינים עבור תחומי ידע , העומד מאחורי הכמויות המשמעותיות של קטעי טקסט לא מסווגים

הרעיון הבסיסי מאחורי גישה זו הינו תיאור הקטגוריות על ידי אוסף של מילות מפתח . קציות שונותואפלי

ייצוג הקטגוריה באמצעות מילות . וקביעת מדד דמיון בין ייצוג הקטגוריה למסמכי הטקסט, מייצגות

מות הרכיב המפוקח במשי. המפתח צריך לייצג את נושא הקטגוריה באופן שלם ומדויק ככל האפשר

כתחליף לרכיב המפוקח , מבוססות גישה זו הינו איסוף של מילות המפתח המייצגות עבור כל קטגוריה

משימה זו דורשת עבודה ידנית בהיקף מצומצם לעומת . שדרש סיווג ידני של כמות טקסטים גדולה

עדיין , על אף הצמצום של כמות העבודה הנדרשת. העבודה הנחוצה בגישות הסיווג המפוקחות לחלוטין

אשר דורשת מידה של מומחיות בתחום הנושאי של , מדובר בעבודה ידנית ספציפית עבור כל קטגוריה

.טקסונומיות נושאיות חדשות ידרשו ניתוח ועבודה ידנית של מומחים מהתחום הנושאי, לפיכך. הסיווג

Gliozzo(אשר הוצעה לראשונה על ידי , המחקר בתזה זו מתבסס על גישה חדשה לסיווג טקסט

et al, 2005( ,גישה מחקרית זו מתבססת על ההנחה . שאינה דורשת ניתוח ידני עבור כל קטגוריה

שטוענת כי שם הקטגוריה הינו , ששמשה גם למחקרים מפוקחים ומפוקחים חלקית, המחקרית

שמות הקטגוריה נבחרים על ידי אנשי מקצוע. אינפורמטיבי ביותר עבור מטרת משימת סיווג הטקסט

שם , מסיבה זו. כך שהם יתארו באופן המדויק והשלם ביותר את נושא הקטגוריה, מהתחום הנושאי

. ויכול להוות נקודת פתיחה לאלגוריתם האוטומטי, הקטגוריה מכיל מידע שימושי למשימת סיווג הטקסט

ל שמות גישה זו מבצעת הרחבה אוטומטית ש, לצורך יצירה אוטומטית של אוסף מילות המפחת המייצגות

מבססת את ) Gliozzo et al, 2005( -שיטת הסיווג המוצעת ב. הקטגוריה לבנייה של האוסף המייצג

שיטת ההרחבה האוטומטית על הנתונים הסטטיסטים של ההופעות המשותפות של מילות הלקסיקון על פי

Latentת על ידי שימוש בשיט. להלן גישה מבוססת הקשר, המידע הקיים באוסף האימון הלא מתויג

Semantic Analysis )LSA (מיוצר סט ראשוני של מסמכים מסווגים על, ומדד דמיון וקטורי סטנדרטי

84

עידו דגן. עבודה זו נעשתה בהדרכתו של דר

למדעי המחשב מחלקהמן ה

.אילן-של אוניברסיטת בר

85

אילן- אוניברסיטת בר

למדעי המחשב מחלקהה

חסיווג טקסטים מבוסס מילות מפת

לבי ברק

עבודה זו מוגשת כחלק מהדרישות לשם קבלת תואר מוסמך

אילן- בפקולטה למדעי המחשב של אוניברסיטת בר

ח"תשס, סיוון, 2008יוני ישראל, רמת גן