dynamic category profiling for text filtering and classification

Information Processing and Management 43 (2007) 154–168

www.elsevier.com/locate/infoproman

Dynamic category profiling for text filtering and classification

Rey-Long Liu *

Department of Medical Informatics, Tzu Chi University, No. 701, Chung Yand Rd., Sec.3, Hualien 970, Taiwan, ROC

Received 22 November 2005; received in revised form 24 February 2006; accepted 28 February 2006Available online 18 April 2006

Abstract

Information is often represented in text form and classified into categories. Unfortunately, automatic classifiers oftenconduct misclassifications. One of the reasons is that the documents for training the classifiers are mainly from the cate-gories, leading the classifiers to derive category profiles for distinguishing each category from others, rather than measuringthe extent to which a document’s content overlaps that of a category. To tackle the problem, we present a techniqueDP4FC that selects suitable features to construct category profiles to distinguish relevant documents from irrelevant doc-uments. More specially, DP4FC is associated with various classifiers. Upon receiving a document, it helps the classifiers tocreate dynamic category profiles with respect to the document, and accordingly make proper decisions in filtering and clas-sification. Theoretical analysis and empirical results show that DP4FC may significantly promote different classifiers’ per-formances under various environments.� 2006 Elsevier Ltd. All rights reserved.

Keywords: Text filtering; Text classification; Dynamic profiling

1. Introduction

Information is often represented in text form and classified into multiple categories. In the informationspace spanned by the categories, upon receiving a document, automatic text filtering and text classificationare essential. For each input document d, text filtering aims to filter out d if d falls out of the informationspace. On the other hand, text classification aims to classify d into some suitable categories if d lies in the infor-mation space.

One of the popular ways to achieve the task is to delegate a classifier to each category. The classifier is asso-ciated with a threshold, and upon receiving a document, it may autonomously make a yes–no decision for thecorresponding category. A document is ‘‘accepted’’ by the classifier if its degree of acceptance (DOA) withrespect to the category (e.g. similarity with the category or probability of belonging to the category) is higherthan or equal to the corresponding threshold; otherwise it is ‘‘rejected.’’ With the help of the thresholds, text

0306-4573/$ - see front matter � 2006 Elsevier Ltd. All rights reserved.

doi:10.1016/j.ipm.2006.02.008

* Tel.: +886 3 8565301x7193; fax: +886 3 8579409.E-mail address: [email protected]

mailto:[email protected]

R.-L. Liu / Information Processing and Management 43 (2007) 154–168 155

filtering is actually achieved in the course of text classification. Each document may be classified into zero, one,or several categories.

Unfortunately, since no classifiers may be perfectly tuned (Arampatzis, Beney, Koster, & van der Weide,2000; Liu & Lin, 2004; Zhang & Callan, 2001), estimation of DOA values does not always be proper. A doc-ument that is similar to a category does not always get a higher DOA value with respect to the category. Sim-ilarly, a document that is not similar to a category does not always get a lower DOA value with respect to thecategory. Improper DOA estimations may heavily deteriorate the performance of both filtering andclassification.

1.1. Problem definition and motivation

In this paper, we explore how various classifiers’ performances in text filtering and classification may beimproved by selecting and integrating more suitable features (keywords) to distinguish relevant documentsfrom irrelevant documents for each category. This goal differs from many previous attempts, which oftenaimed at improving the processes of classifier building (Wu, Phang, Liu, & Li, 2002), threshold tuning (Liu& Lin, 2004), and document selection (Iyengar, Apte, & Zhang, 2000; Schapire, Singer, & Singhal, 1998; Sing-hal, Mitra, & Buckley, 1997). The research result of the paper may complement the previous techniques.

Feature selection was often an experimental issue in previous studies (McCallum, Rosenfeld, Mitchell, &Ng, 1998; Mladenic, Brank, Grobelnik, & Milic-Frayling, 2004; Yang & Pedersen, 1997), although there weremany feature selection techniques developed (Mladenic et al., 2004; Yang & Pedersen, 1997). There were alsostudies that maintained an evolvable feature set covering all features currently seen (e.g. Cohen & Singer,1996), although inappropriate features may introduce inefficiency (Yang & Pedersen, 1997) and poorer perfor-mance in text classification. Therefore, there was no standard guideline to construct a perfect feature set. Afeature set was often determined by an experimental tuning process.

However, even a feature set may be perfectly tuned to distinguish the categories, it is not necessarily suitableto filter out those documents not belonging to the categories. This problem is due to the common goal of pre-vious feature selection techniques: selecting those features that may be used to distinguish a category from oth-ers. Under such a goal, whether a feature may be selected mainly depends on the content relatedness amongthe categories, without paying much attention to how the contents of a category c and a document d overlapwith each other. If d contains much information not in c or vice versa, d should not be classified into c, eventhough d mentions some content of c. Similar problems may also be found in similarity measurement betweentwo documents: two documents should not be similar to each other if they have a lot of different contents, nomatter how the feature set is tuned using training documents. This problem motivates the research in thepaper.

To tackle the problem, features should be dynamically selected in response to each individual input docu-ment (rather than training documents in the categories). The features may help to measure the extent to whichthe content of the document overlaps with that of each category. They are helpful when a document of a cate-gory does not employ a terminology different from that employed by training documents of the category.

1.2. Organization of the paper

In the next section, we present an analysis that provides significant hints to conduct the research. Accord-ingly, in Section 3, we present a novel technique DP4FC (Dynamic Profiling for Filtering Classification) thathelps various classifiers to dynamically create category profiles so that the performance of text filtering andclassification may be improved. DP4FC has been empirically evaluated under different circumstances, andhence in Section 4, we present and analyze the results. The paper is finally concluded in Section 5.

2. Misclassification of irrelevant documents: an analysis

Table 1 presents an analysis on the possible reasons of misclassifying an irrelevant document d into a cate-gory c. A feature tends to lead the classifier to make an error (i.e. classifying d into c) when the feature isincluded in the feature set, and it also appears in both c and d (i.e. Case 1) or it does not appear in both c

Table 1Effects of a feature f in classifying an irrelevant document d into a category c

Case f appears in c f appears in d f is included inthe feature set

Effect of the feature

1 Yes Yes Yes Type I: Incurring misclassification2 Yes Yes No Type II: No effect3 Yes No Yes Type III: Avoiding misclassification4 Yes No No Type II: No effect5 No Yes Yes Type III: Avoiding misclassification6 No Yes No Type II: No effect7 No No Yes Type I: Incurring misclassification8 No No No Type II: No effect

Table 2Complementing the underlying classifier by dynamic profiling

To discriminate c from others To validate content overlapping

Features thatcorrelate with c

Features that correlatewith other categories

Features that appear inc but do not appear in d

Features that do notappear in c but appear in d

Underlying classifier Considered Considered Not considered Not consideredDynamic profiling Not considered Not considered Considered Considered

156 R.-L. Liu / Information Processing and Management 43 (2007) 154–168

and d (Case 7). In both cases, d is similar to the category in the dimension (feature). The feature thus has theeffect of suggesting c to accept d (and thus making an error). On the other hand, there are two cases where afeature may help the classifier to avoid making the error: the feature is included in the feature set, but it onlyappears in c (i.e. Case 3) or d (Case 5), but not both. In both cases, d is not similar to c in the dimension (fea-ture). The feature thus has the effect of suggesting c to reject d.

The analysis suggests a dynamic profiling strategy: (1) employing those terms that appear in c but do notappear in d (for increasing the probability of Case 3, while reducing the probability of Case 1), and conversely(2) employing those terms that appear in d but do not appear in c (for increasing the probability of Case 5,while reducing the probability of Case 1). Therefore, each category’s profile should be composed of a featureset, which is dynamic in the sense that it is reconstructed once a test document is entered.

Dynamic profiling may complement the functionality of those classifiers that distinguish c from other cat-egories using static category profiles. As summarized in Table 2, the profiles are static in the sense that they areconstructed with respect to all categories (rather than individual input documents) and hence do not vary foreach input document. Dynamic profiling complements the classifiers by considering how d contains those con-tents not in c and how c contains those contents not in d. If d contains much information not in c or c containsmuch information not in d, d should not be classified into c, even though it mentions some contents of c.

On the other hand, the implementation of dynamic profiling is challenging. Three tasks should be consid-ered: (1) estimating the correlation strength of each feature with respect to c and d, since a feature that happensto appear in a c or d is not necessarily a good feature, (2) tackling the side-effect of rejecting many relevantdocuments, and (3) integrating dynamic profiling with the underlying classifier in order to make filteringand classification decisions. These tasks should be achieved before dynamic profiling may become really help-ful for various classifiers.

3. Dynamic profiling for filtering and classification

Based on the above analysis, we develop a dynamic profiling technique DP4FC (Dynamic Profiling for Fil-tering Classification) to promote various classifiers’ performances in text filtering and classification. Fig. 1illustrates the introduction of DP4FC to a classifier. In training, DP4FC joins the thresholding process, whilein testing, DP4FC joins the process of making filtering and classification decisions. Both the underlying clas-sifier and DP4FC estimate each document’s DOA with respect to each category. The key point is that DOA

Training

Documents for Filtering & Classification

Documents for Threshold Tuning

Filtered Documents

Classified DocumentsIntegrated

Filtering & Classification

Underlying Classifier

Documents for Classifier Building

Classifier Building

Threshold Tuning

Testing

DOA Estimation

DP4FCDOA Estimation by Dynamic Profiling

DOA Estimation by Dynamic Profiling

DOA Estimation

Fig. 1. Associating various classifiers with DP4FC.


values estimated by DP4FC are based on dynamic profiling, which aims to measure the extent to which a doc-ument’s content overlaps that of a category.

The algorithm is depicted in Table 3. Given a category c and a document d, the dynamic profile of c is com-posed of two kinds of terms: those terms that are positively correlated with c but do not appear in d (ref. Step2), and those terms that are negatively correlated with c but appear in d (ref. Step 3). Both kinds of terms leadto the reduction of the DOA values estimated by DP4FC (ref. Steps 2.2 and 3.2). Therefore, a smaller DOAvalue indicates that d contains more information not in c, and vice versa. It indicates that we have a lowerconfidence to classify d into c.

DP4FC employs the v2 (chi-square) method to estimate the correlation strengths. For a term t and a cat-egory c, v2(t,c) = [N · (A · D � B · C)2]/[(A + B) · (A + C) · (B + D) · (C + D)], where N is the total num-ber of documents, A is the number of documents that are in c and contain t, B is the number ofdocuments that are not in c but contain t, C is the number of documents that are in c but do not containt, and D is the number of documents that are not in c and do not contain t. Therefore, v2(t,c) indicates thestrength of correlation between t and c. We said that c and t are positively correlated if A · D > B · C; other-wise they are negatively correlated. Note that, t may appear in an input document d but do not appear in anytraining document. According to the dynamic profiling strategy, t should be considered. However, its v2 valuewith respect to each category is incomputable (since both A and B are zero). DP4FC tackles the problem bytreating d as a training document (i.e. both N and B will be incremented by 1).

With the DOA estimation, DP4FC may join the thresholding process to help the underlying classifier toderive proper thresholds for each individual category. The basic idea is that, each category has two thresholds:

Table 3DOA estimation by dynamic profiling

Procedure DOAEstimationByDP(c,d), where(1) c is a category,(2) d is a document for thresholding or testing

Return: DOA value of d with respect to c

Begin

(l) DOAby DP = 0;(2) For each term t that is positively correlated with c but does not appear in d, do

(2.1) DOAReduction = v2(t,c);(2.2) DOAbyDP = DOAbyDP � DOAReduction;

(3) For each term t that is negatively correlated with c but appears in d, do(3.1) DOAReduction = (number of occurrences of t in d) · v2(t,c);(3.2) DOAbyDP = DOAbyDP � DOAReduction;

(4) Return DOAbyDP;End


one for thresholding the DOA values produced by DP4FC, while the other is for thresholding the originalDOA values produced by the underlying classifier. The former threshold helps to reduce the number of doc-uments that should not be considered in tuning the latter threshold. The two thresholds work together in thehope to optimize the category’s performance in a predefined criterion.

Once a document is entered, its two DOA values (i.e. by DP4FC and the underlying classifier) are pro-duced, and the corresponding thresholds are consulted. The document may be classified into a category onlyif both DOA values are higher than or equal to their corresponding thresholds. DP4FC and the underlyingclassifier may thus work together to complement each other to make proper decisions.

4. Experiments

Experiments are conducted to investigate the contributions of DP4FC. For objective and thorough inves-tigation, DP4FC is evaluated under different circumstances, including (1) different sources of experimentaldata, (2) different kinds of test data, (3) different settings for training data, (4) different underlying classifica-tion methodologies, and (5) different parameter settings for the classifier. Table 4 summarizes the different cir-cumstances, which are explained in the following subsections.

4.1. Experimental data

Experimental data is from Reuter-21578, which is a public collection for related studies (http://www.daviddlewis.com/resources/testcollections/reuters21578). There are 135 categories (topics) in the collec-tion. We employ the ModLewis split, which skips unused documents and separates the documents into twoparts based on their time of being written: (1) the test set, which consists of the documents after April 8,1987 (inclusive), and (2) the training set, which consists of the documents before April 7, 1987 (inclusive).The test set is further split into two subsets: (1) the in-space subset, which consists of 3022 test documents thatbelong to some of the categories (i.e. fall into the category space), and (2) the out-space subset, which consistsof 3168 documents that belong to none of the categories. They help to investigate the systems’ performances intext classification and text filtering, respectively. An integrated text filtering and classification system should(1) properly classify in-space documents, and (2) properly filter out out-space documents.

As suggested by previous studies (e.g. Yang, 2001), the training set is randomly split into two subsets aswell: the classifier building subset and the threshold tuning (or validation) subset. The former is used to buildthe classifier, while the latter is used to tune a threshold for each category. Therefore, to guarantee that eachcategory has at least one document for classifier building and one document for threshold tuning, we removethose categories that had fewer than two training documents, and hence 95 categories remain. Among the 95categories, 12 categories have no test documents. From both theoretical and practical standpoints, these cate-gories deserve investigation (Lewis, 1997), although they were excluded by several previous studies (e.g. Chai,

Table 4Experimental design for thorough investigation

Aspects Settings

(1) Source of experimental data (A) Reuter-21578(B) A Yahoo text hierarchy

(2) Split of test data (A) In-space test data (for evaluating text classification)(B) Out-space test data (for evaluating text filtering)

(3) Split of the training data for classifierbuilding (CB) and threshold tuning (TT)

(A) 50% for CB; 50% for TT (with twofold cross-validation)(B) 80% for CB; 20% for TT (with fivefold cross-validation)

(4) Underlying classification methodologies (A) Vector-based methodology: Rocchio method with thresholding (RO)(B) Probability-based methodology: Naive Bayes method with thresholding (NB)(C) Probability-based methodology with fixed thresholds: Naive Bayes method with a

fixed threshold of 0.5 (NBFix05)(5) Parameter settings for the classifier (A) Different sizes of feature sets on which the classifiers were built

(B) Different parameter settings for RO

http://www.daviddlewis.com/resources/testcollections/reuters21578



Ng, & Chieu, 2002; Yang, 2001). After removing those documents to which no categories are assigned (i.e. notbelonging to any of the 95 categories), the training set contains 7780 documents. Moreover, since previousstudies did not suggest the way of setting the documents for classifier building and threshold tuning, we trydifferent settings to conduct thorough investigation: 50–50% and 80–20%, which conduct twofold and fivefoldcross-validation, respectively. In the twofold cross-validation, 50% of the data is used for classifier building,and the remaining 50% of the data is used for threshold tuning, and the process repeats two times so that eachtraining document is used for threshold tuning exactly one time. Similarly, in the fivefold cross-validation,80% of the data is used for classifier building, and the remaining 20% of the data is used for threshold tuning,and the process repeats five times.

Moreover, to test those out-space documents that are less related to the categories, we randomly sample 370documents from a text hierarchy that was extracted from Yahoo! (http://www.yahoo.com) and employed byprevious studies (Liu & Lin, 2005; Liu & Lin, 2003). The documents are randomly extracted from the catego-ries of science, computers and Internet, and society and culture, and hence are less related to the content of theReuters categories. With the help of the Yahoo out-space documents, we may measure the system’s text fil-tering performance in processing those out-space documents with different degrees of relatedness with respectto the categories.

4.2. Evaluation criteria

The classification of in-space test documents and the filtering of out-space test documents require differentevaluation criteria. For the former, we employ precision (P) and recall (R). Both P and R are common eval-uation criteria in previous studies (Lewis, 1995; Yang, 2001). To integrate P and R into a single measure, weemploy F1 = 2PR/(P + R), which places the same emphasis on P and R.

As in many previous studies, P, R, and F1 have two versions: the micro-averaged version and the macro-

averaged version. The micro-average version tends to view all categories as a system, and hence estimates P

by [total number of correct classifications/total number of classifications made], and R by [total number ofcorrect classifications/total number of correct classifications that should be made]. On the other hand, themacro-averaged version tends to view each individual category as a system, and hence estimates P, R, andF1 by the average of the P, R, and F1 values on individual categories, respectively. When computing themacro-average values, we exclude incomputable values (i.e. those whose denominators are zero).

On the other hand, to evaluate the filtering of out-space test documents, we employ misclassification ratio

(MR), which is estimated by [number of misclassifications for the out-space documents/number of the out-space documents]. A system should avoid misclassifying out-space documents into many categories (i.e. lowerMR).

4.3. The underlying classifiers

Each category c is associated with a classifier. Upon receiving a document d, the classifier estimates the sim-ilarity between d and c (i.e. DOA of d with respect to c) in order to make a binary decision for d: accepting d orrejecting d. To investigate the contributions of DP4FC to different kinds of classification methodologies,DP4FC is applied to two kinds of classifiers: the Rocchio classifier (RO, which is based on vector-based meth-odology) and the Naive Bayes classifier (NB, which is based on probability-based methodology).

RO was originally designed for query optimization in relevance feedback (Rocchio, 1971) and was com-monly employed in text classification (e.g. Wu et al., 2002), text filtering (e.g. Schapire et al., 1998; Singhalet al., 1997), and retrieval (e.g. Iwayama, 2000) as well. Some studies even showed that its performances weremore promising in user interest identification (e.g. Liu & Lin, 2005) and text filtering (Liu & Lin, 2004).

RO constructs a vector for each category, and the similarity between a document d and a category c is esti-mated using the cosine similarity between the vector of d and the vector of c. More specially, the vector for acategory c is constructed by considering both relevant documents and irrelevant documents of c:g1 �

PDoc2P Doc=j P j �g2 �

PDoc2N Doc=j N j, where P is the set of vectors for relevant documents (i.e. the

documents in c), while N is the set of vectors for irrelevant documents (i.e. the documents not in c). Each doc-ument vector is built by the TF–IDF (Term Frequency–Inverse Document Frequency) technique, which gives

http://www.yahoo.com


each feature (term) w a weight of TF(w,d) · log2 IDF(w), where TF(w,d) is the times w appears in the docu-ment d, and IDF(w) is [total number of training documents/number of documents that contain w]. Moreover,the parameters g1 and g2 govern the weights for relevant documents and irrelevant documents, respectively. Inthe experiment, g1 = 16 and g2 = 4, since previous studies show that such a setting was promising (e.g. Wuet al., 2002). For more complete evaluation, the effects of different settings for g1 and g2 will be investigatedas well (ref. Section 4.4.3).

On the other hand, NB was frequently employed and evaluated with respect to various techniques, includ-ing text filtering (e.g. Kim, Hahn, & Zhang, 2000), non-hierarchical text classification (e.g. Larkey & Croft,1996; Yang & Lin, 1999) and hierarchical text classification (e.g. Koller & Sahami, 1997; Dhillon, Mallela,& Kumar, 2002; McCallum et al., 1998). It was shown to be competitive (and even better) when comparedwith various state-of-the-art text classification techniques, such as neural networks and support vectormachines (Dhillon et al., 2002; Yang & Lin, 1999). In particular, it pre-estimates the conditional probabilityP(wjc) for every feature w and category c (with standard Laplace smoothing to avoid the probabilities of zero).The ‘‘similarity’’ between a document d and a category c is estimated by P ðcÞ �

Qw2dP ðw j cÞTFðw;dÞ

=½P ðcÞ�Qw2dP ðw j cÞTFðw;dÞ þ Pðnot cÞ �

Qw2dP ðw j not cÞTFðw;dÞ�, where TF(w,d) is the times a feature w appears in

d, and notc is a dummy category covering all documents irrelevant to c.All the classifiers require a fixed (predefined) feature set, which is built using the documents for classifier

building. Each term that is not a stop word may be a candidate feature. No phrases extraction routine isinvoked. Features are selected according to their weights, which are estimated by the v2 (chi-square) weightingtechnique. The technique has been investigated and shown to be more promising than others (Yang, 1999;Yang & Pedersen, 1997). As noted above, there is no perfect way to determine the size of the feature set. Set-ting a proper feature set size was often an experimental issue in previous studies (e.g. McCallum et al., 1998;Yang & Pedersen, 1997). Therefore, to conduct more thorough investigation, we try five feature set sizes,including 1000, 5000, 10,000, 15,000, and 20,000, since there are about 20,000 different features in the fivefoldtraining data.

To make filtering and classification decisions, both RO and NB require a thresholding strategy to set athreshold for each category. As in many previous studies (e.g. Callan, 1998; Chai et al., 2002; Lewis, Schapire,Callan, & Papka, 1996; Schapire et al., 1998; Yang, 2001; Yang & Lin, 1999; Zhang & Callan, 2001), RO andNB tune a relative threshold for each category by analyzing document-category similarities. The thresholdtuning documents are used to tune each relative threshold. As suggested by many studies (e.g. Yang, 2001),the thresholds are tuned in the hope to optimize the system’s performance with respect to F1. Moreover,we also design a version of NB that employs a fixed threshold of 0.5 for each category (i.e. no threshold tun-ing). This version of NB is named NBFixed05. It was tested in several previous studies as well (e.g. Chai et al.,2002).

4.4. Result and discussion

We separately discuss the results in text classification (i.e. classification of in-space test data) and text fil-tering (filtering of out-space test data). The results show that DP4FC significantly promotes all the classifiers’performances in both filtering and classification under different environments.

4.4.1. Results on in-space test data

Figs. 2 and 3 illustrate the contribution of DP4FC to RO in classifying in-space documents. The data isaveraged across all the seven folds (two 50–50% folds + five 80–20% folds). The figures show micro-averageand macro-average results, respectively. In micro-average performances, DP4FC provides significant improve-ment to RO under all different feature set sizes. When comparing the average performances, it provides 14.8%improvement in F1 (0.7032 vs. 0.6127). Moreover, in macro-average performances, DP4FC provides more sig-nificant improvement to RO under all different feature set sizes. When comparing the average performances, itprovides 25.2% improvement in F1 (0.6487 vs. 0.5182).

Fig. 4 shows the performances of RO and RO + DP4FC under different folds (i.e. two 50–50% folds andfive 80–20% folds). The results are average performances under all features set sizes. The results show thatDP4FC stably provides significant improvement to RO in all folds.

Micro-average Recall (averaged across all folds)

0.4

0.5

0.6

0.7

0.8

1000 5000 10000 15000 20000Feature Set Size

Mic

ro-A

vera

ge R

ecal

l

RO

RO+DP4FC

Micro-average Precision (averaged across all folds)

0.4

0.5

0.6

0.7

0.8

1000 5000 10000 15000 20000Feature Set Size

Mic

ro-A

vera

gePr

ecis

iobn

RO

RO+DP4FC

Micro-average F1 (averaged across all folds)

0.4

0.5

0.6

0.7

0.8

1000 5000 10000 15000 20000Feature Set Size

Mic

ro-A

vera

ge F

1

RO

RO+DP4FC

Fig. 2. Contributions of DP4FC to RO in text classification: micro-average results.


We are also concerned with the contributions of DP4FC to NB and NBFix05. Figs. 5 and 6 illustrate themicro-average results and macro-average results, respectively. In micro-average performances, DP4FC pro-vides significant improvement to both NB and NBFix05 under all different feature set sizes. When comparingaverage performances, it provides 396.3% (0.7370 vs. 0.1485) and 31.3% (0.7855 vs. 0.5984) improvements inF1 to NB and NBFix05, respectively. Moreover, in macro-average performances, DP4FC provides significantimprovement to both NB and NBFix05 under all different feature set sizes as well. When comparing averageperformances, it provides 102.5% (0.6380 vs. 0.3150) and 62.8% (0.5802 vs. 0.3564) improvements in F1 to NBand NBFix05, respectively.

It is interesting to note that, DP4FC provides significant improvements in macro-average precision andrecall as well. DP4FC successfully helps RO (ref. Fig. 3) and NB (ref. Fig. 6) to achieve both better and morestable precision and recall under different features set sizes. Moreover, it also significantly promotes the pre-cision rates achieved by both NB and NBFix05, which are quite poor (about 0.2).

Fig. 7 shows the contributions of DP4FC to NB and NBFix05 under different folds. The results are averageperformances under all features set sizes. Again, the results show that DP4FC stably provided significantimprovement to NB and NBFix05 in all folds.

The results together show that DP4FC may significantly promote the performances of RO, NB, andNBFix05 in processing in-space documents. Moreover, the performance improvements are stable in the sensethat they occurred under various circumstances, including different feature set sizes and cross-validation folds.The contributions justify the design of DP4FC. With the help of DP4FC, the classifiers may achieve both bet-ter and more stable performances under various circumstances.

It is also interesting to investigate the contributions of DP4FC under the environmental settings in whichthe underlying classifiers achieve their best performances in micro-average F1. Table 5 summarizes the bestsettings and the improvements provided by DP4FC under the settings. DP4FC successfully promotes the

Macro-average Precision (averaged across all folds)

0.3

0.4

0.5

0.6

0.7

0.8

1000 5000 10000 15000 20000Feature Set Size

Mac

ro-A

vera

gePr

ecis

ion

RO

RO+DP4FC

Marco-average Recall (averaged across all folds)

0.3

0.4

0.5

0.6

0.7

0.8

1000 5000 10000 15000 20000Feature Set Size

Mac

ro-A

vera

ge R

ecal

l

RO

RO+DP4FC

Macro-average F1 (averaged across all folds)

0.3

0.4

0.5

0.6

0.7

0.8

1000 5000 10000 15000 20000Feature Set Size

Mac

ro-A

vera

ge F

1

RO

RO+DP4FC

Fig. 3. Contributions of DP4FC to RO in text classification: macro-average results.

Micro-average F1 (averaged across all features set sizes)

0.4

0.5

0.6

0.7

0.8

1/2fold

2/2fold

1/5fold

2/5fold

3/5fold

4/5fold

5/5fold

Fold

Mic

ro-A

vera

ge F

1

RO

RO+DP4FC

Macro-average F1 (averaged across all features set sizes)

0.4

0.5

0.6

0.7

0.8

1/2fold

2/2fold

1/5fold

2/5fold

3/5fold

4/5fold

5/5fold

Fold

Mac

ro-A

vera

ge F

1

RO

RO+DP4FC

Fig. 4. Contributions of DP4FC to RO in text classification under different folds.


performances of all the well-tuned classifiers. It also tends to provide more significant improvements in macro-average F1. This indicates that DP4FC may help the classifiers to uniformly have better performances on indi-

Micro-average Precision (averaged across all folds)

0

0.2

0.4

0.6

0.8

1

1000 5000 10000 15000 20000Feature Set Size

Mic

ro-A

vera

ge

Pre

cisi

on

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Micro-average Recall (averaged across all folds)

00.20.4

0.60.8

1

1000 5000 10000 15000 20000Feature Set Size

Mic

ro-A

ver

age

Rec

all

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Micro-average F1 (averaged across all folds)

00.20.40.60.8

1

1000 5000 10000 15000 20000Feature Set Size

Mic

ro-A

vera

ge F

1

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Fig. 5. Contributions of DP4FC to NB and NBFix05 in text classification: micro-average results.


vidual categories, rather than on larger categories only. Moreover, when comparing all the classifiers, no clas-sifiers may be the best one in both micro-average and macro-average performances. NBFix05 and RO mayonly perform better in micro-average F1 (0.7496) and macro-average F1 (0.5364), respectively. However, afterthe introduction of DP4FC, the best version became RO + DP4FC, which achieves 0.7952 micro-average F1

and 0.6903 macro-average F1.

4.4.2. Results on out-space test data

We are also concerned with the contributions of DP4FC in the filtering of out-space documents. Fig. 8shows the contributions of DP4FC to RO. For Reuters out-space documents, when comparing their averageperformances, DP4FC provides 20.5% reduction in MR (0.9412 vs. 1.1842). On the other hand, for Yahooout-space documents, DP4FC provides 10.0% reduction in MR (0.5105 vs. 0.5671).

Fig. 9 shows the contributions of DP4FC to NB and NBFix05. Without DP4FC, MR performances ofboth NB and NBFix05 dramatically oscillate. For Reuters out-space documents, when comparing theiraverage performances, DP4FC provides 92.2% (1.2100 vs. 15.5616) and 57.2% (0.8517 vs. 1.9879) MRreductions to NB and NBFix05, respectively. On the other hand, for Yahoo out-space documents, DP4FCprovides 91.4% (2.1376 vs. 24.8666) and 94.6% (0.9614 vs. 17.9442) MR reductions to NB and NBFix05,respectively.

Together with the results on in-space documents, the results on out-space documents further justify the con-tributions of DP4FC: it successfully promotes text classification performances, while at the same time, pre-vents the classifiers from over-fitting themselves to in-space documents. DP4FC achieves the task by basingits judgment on content overlapping, which is a general guideline to measure the relevance of each document.The contributions are particularly meaningful, since in practice, there should be much more out-space docu-ments than in-space documents.

Macro-average Precision (averaged across all folds)

00.20.40.60.8

1

1000 5000 10000 15000 20000Feature Set Size

Mac

ro-A

ver

age

Pre

cisi

on

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Macro-average Recall (averaged across all folds)

00.20.40.60.8

1

1000 5000 10000 15000 20000Feature Set Size

Mac

ro-A

vera

ge

Rec

all

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Macro-average F1 (averaged across all folds)

0

0.2

0.40.6

0.8

1

1000 5000 10000 15000 20000Feature Set Size

Mac

ro-A

vera

ge F

1

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Fig. 6. Contributions of DP4FC to NB and NBFix05 in text classification: macro-average results.

Micro-average F1 (averaged across all feature set sizes)

0

0.25

0.5

0.75

1

1/2fold

2/2fold

1/5fold

2/5fold

3/5fold

4/5fold

5/5fold

Fold

Mic

ro-A

vera

ge F

1

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Macro-average F1 (averaged across all feature set sizes)

0

0.25

0.5

0.75

1

1/2fold

2/2fold

1/5fold

2/5fold

3/5fold

4/5fold

5/5fold

Fold

Mac

ro-A

vera

ge F

1

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Fig. 7. Contributions of DP4FC to NB and NBFix05 in text classification under different folds.


4.4.3. Effects of different parameter settings

We are also interested in the effects of different parameter settings for the underling classifiers. Unlike NBand NBFix05, RO has two parameters: g1 and g2, which govern the weights for relevant documents and irrel-evant documents, respectively. In the experiments reported above, we follow the suggestions from previous

Table 5Contributions of DP4FC under the best settings of the classifiers

Classifier Setting to achieve the best micro-average F1 Improvement by DP4FCin micro-average F1

Improvement by DP4FCin macro-average F1

RO Twofold experiment: feature set size = 5000 in the first fold 11.6% (0.6997 vs. 0.6267) 33.3% (0.6730 vs. 0.5050)Fivefold experiment: feature set size = 5000 in the third fold 9.7% (0.7952 vs. 0.7246) 28.7% (0.6903 vs. 0.5364)

NB Twofold experiment: feature set size = 1000 in the first fold 120.1% (0.7375 vs. 0.3351) 130.9% (0.6586 vs. 0.2852)Fivefold experiment: feature set size = 1000 in the first fold 138.2% (0.7117 vs. 0.2988) 104.6% (0.6479 vs. 0.3167)

NBFix05 Twofold experiment: feature set size = 15,000 in the second fold 5.3% (0.7807 vs. 0.7416) 15.4% (0.5066 vs. 0.4389)Fivefold experiment: feature set size = 20,000 in the first fold 4.1% (0.7802 vs. 0.7496) 23.8% (0.5517 vs. 0.4455)

Misclassification Ratio for Reuters out-space documents (averagedacross all folds)

0

0.5

1

1.5

2

1000 5000 10000 15000 20000Feature Set Size

Mis

clas

sifi

catio

nR

atio

RO

RO+DP4FC

Misclassification Ratio for Yahoo out-space documents (averagedacross all folds)

0

0.5

1

1.5

2

1000 5000 10000 15000 20000Feature Set Size

Mis

clas

sifi

catio

nR

atio

RO

RO+DP4FC

Fig. 8. Contributions of DP4FC to RO in text filtering: reducing misclassifications.


studies and set their ratio to 4:1 (i.e. g1 = 16 and g2 = 4). We now turn to investigate the effects of differentsettings under the best environments for RO (i.e. feature set size = 5000 in the third fold).

The result is shown in Fig. 10. It indicates that when the ratio becomes smaller, RO tends to misclassify moreYahoo out-space documents. When the ratio ranges from 1:1 to 1:3, RO has much poorer performance forYahoo documents (ref. the dramatically increasing MR), although it achieves a little bit better performancefor in-space documents (ref. the performance in F1). Since Yahoo documents are less related to training doc-uments, this indicates that RO tends to over-fits itself to the training data when the ratio becomes too small.The over-fitting is due to the fact that, when the ratio becomes too small, irrelevant documents for a categoryc will dominate the computation of the vector for c. In that case, those features that appear in irrelevant doc-uments tend to get negative weights in the vector for c. Unfortunately, when compared with the less related out-space documents, threshold tuning documents have a higher probability of having these features, making theirDOA values smaller. This leads to a threshold that is too low to reject Yahoo out-space documents.

As noted above, in practice, there should be much more input documents that are less related to trainingdata. Therefore, when considering the overall performances of RO for all kinds of data (i.e. Reuters in-space,Reuters out-space, and Yahoo out-space data), the best setting for the ratio should range from 2:1 to 5:1,including the one suggested by previous studies and employed in the above experiments (i.e. 4:1). As shownin Fig. 10, in these settings, DP4FC consistently promotes the performances of RO in classifying Reuters in-space documents and filtering out Reuters out-space documents.

Misclassification ratio for Reuters out-space documents (averaged across all folds)

0

5

10

15

20

25

1000 5000 10000 15000 20000Feature Set Size

Mis

clas

sifi

catio

nR

atio

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Misclassification ratio for Yahoo out-space documents (averaged across all folds)

0

10

20

30

40

1000 5000 10000 15000 20000Feature Set Size

Mis

clas

sifi

catio

nR

atio

NB

NB+DP4FC

NBFix05

NBFix05+DP4FC

Fig. 9. Contributions of DP4FC to NB and NBFix05 in text filtering: reducing misclassifications.

Micro-average F1 (in the 3rd fold with feature set size = 5000)

0.6

0.7

0.8

0.9

6:1 5:1 4:1 3:1 2:1 1:1 1:2 1:3Weight for Positives : Weight for Negatives

Mic

ro-A

vera

ge F

1

RO

RO+DP4FC

MR (in the 3rd fold with feature set size = 5000)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

6:1 5:1 4:1 3:1 2:1 1:1 1:2 1:3Weight for Positives : Weight for Negatives

MR

RO for Reuters

RO+DP4FC for Reuters

RO for Yahoo

RO+DP4FC for Yahoo

Fig. 10. Effects of different parameter settings for RO.


5. Conclusion

Given an information space spanned by a set of categories, misclassification of documents into the infor-mation space may deteriorate the management, dissemination, and retrieval of information. We thus present atechnique DP4FC to complement and promote various classifiers’ performance in text filtering and classifica-tion. Instead of aiming at distinguishing a category from other categories, DP4FC aims at measuring whether


a document d contains too much information not in a category c, or vice versa. If so, d should not be classifiedinto c, even though d mentions some content of c. DP4FC helps the underlying classifier to create dynamiccategory profiles with respect to each individual document. It then works with the classifier to set properthresholds, and accordingly make proper filtering and classification decisions. Empirical results show thatDP4FC may significantly promote different classifiers’ performances under different circumstances. The con-tributions are of both theoretical and practical significance to the automatic classification of suitable informa-tion into suitable categories.

Acknowledgement

This research was supported by the National Science Council of the Republic of China under the grantNSC 94-2213-E-320-001.

References

Arampatzis, A., Beney, J., Koster, C. H. A., & van der Weide, T. P. (2000). Incrementality, half-life, and threshold optimization foradaptive document filtering. In Proceedings of the 9th text retrieval conference, Gaithersburg, Maryland (pp. 589–600).

Callan, J. (1998). Learning while filtering documents. In Proceedings of the 21st annual international ACM SIGIR conference on research

and development in information retrieval, Melbourne, Australia (pp. 224–231).Chai, K. M. A., Ng, H. T., & Chieu, H. L. (2002). Bayesian online classifiers for text classification and filtering. In Proceedings of the 25th

annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland (pp. 97–104).Cohen, W. W., & Singer, Y. (1996). Context-sensitive mining methods for text categorization. In Proceedings of the 19th annual

international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland (pp. 307–315).Dhillon, I. S., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In Proceedings of the 8th

ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada (pp. 191–200).Iwayama, M. (2000). Relevance feedback with a small number of relevance judgments: incremental relevance feedback vs. document

clustering. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval,Athens, Greece (pp. 10–16).

Iyengar, V. S., Apte, C., & Zhang, T. (2000). Active learning using adaptive resampling. In: Proceedings of the 6th ACM SIGKDD

international conference on knowledge discovery and data mining, Boston, Massachusetts (pp. 91–98).Kim, Y.-H., Hahn, S.-Y., & Zhang, B.-T. (2000). Text filtering by boosting Naive Bayes classifiers. In Proceedings of the 23rd annual

international ACM SIGIR conference on research and development in information retrieval, Athens, Greece (pp. 168–175).Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the 14th international

conference on machine learning, Nashville, Tennessee (pp. 170–178).Larkey, L. S., & Croft, W. B. (1996). Combining classifiers in text categorization. In Proceedings of the 19th annual international ACM

SIGIR conference on research and development in information retrieval, Zurich, Switzerland (pp. 289–297).Lewis, D. D. (1995). Evaluating and optimizing autonomous text classification systems. In Proceedings of the 18th annual international

ACM SIGIR conference on research and development in information retrieval, Seattle, Washington (pp. 246–254).Lewis, D. D. (1997). Reuters-21578 text categorization test collection distribution 1.0 README file (v 1.2). Available from http://

www.daviddlewis.com/resources/testcollections/reuters21578.Lewis, D. D., Schapire, R. E., Callan, P., & Papka, R. (1996). Training algorithms for linear text classifiers. In Proceedings of the 19th

annual international ACM SIGIR conference on research and development in information retrieval, Zurich, Switzerland (pp. 298–306).Liu, R.-L., & Lin, W.-J. (2003). Mining for interactive identification of users’ information needs. Information Systems, 28(7), 815–833.Liu, R.-L., & Lin, W.-J. (2004). Adaptive sampling for thresholding in document filtering and classification. Information Processing and

Management, 41(4), 745–758.Liu, R.-L., & Lin, W.-J. (2005). Incremental mining of information interest for personalized web scanning. Information Systems, 30(8),

630–648.McCallum, A., Rosenfeld, R., Mitchell, T., & Ng, A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In

Proceedings of the 15th international conference on machine learning, Madison, Wisconsin (pp. 359–367).Mladenic, D., Brank, J., Grobelnik, M., & Milic-Frayling, N. (2004). Feature selection using linear classifier weights: interaction with

classification models. In Proceedings of the 27th annual international ACM SIGIR conference on research and development in information

retrieval, Sheffield, South Yorkshire, UK (pp. 234–241).Rocchio, J. (1971). Relevance feedback in information retrieval. In G. Salton (Ed.), The SMART retrieval system: experiments in automatic

document processing (Chapter 14) (pp. 313–323). Englewood Cliffs, New Jersey: Prentice-Hall.Schapire, R. E., Singer, Y., & Singhal, A. (1998). Boosting and Rocchio applied to text filtering. In Proceedings of the 21st annual

international ACM SIGIR conference on research and development in information retrieval, Melbourne, Australia (pp. 215–223).Singhal, A., Mitra, M., & Buckley, C. (1997). Learning routing queries in a query zone. In Proceedings of the 20th annual international

ACM SIGIR conference on research and development in information retrieval, Philadelphia, Pennsylvania (pp. 25–32).




Wu, H., Phang, T. H., Liu, B., & Li, X. (2002). A refinement approach to handling model misfit in text categorization. In Proceedings of

the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada (pp. 207–216).Yang, Y. (1999). An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1–2), 69–90.Yang, Y. (2001). A study of thresholding strategies for text categorization. In Proceedings of the 24th annual international ACM SIGIR

conference on research and development in information retrieval, New Orleans, Louisiana (pp. 137–145).Yang, Y., & Lin, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR

conference on research and development in information retrieval, Berkeley, California (pp. 42–49).Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th international

conference on machine learning, Nashville, Tennessee (pp. 412–420).Zhang, Y., & Callan, J. (2001). Maximum likelihood estimation for filtering thresholds. In Proceedings of the 24th annual international

ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana (pp. 294–302).