error detection in mechanized classification systems

Infonation Prowsing & Monsgcmcnl, Vol. 12, pp. 333-337. Pergamon Press 1976. Prinled in Great Britain

ERROR DETECTION IN MECHANIZED SYSTEMS

W. G. HOYLE

CLASSIFICATION

Network Planning and Research, Scientific and Technical, Information Services, National Research Council, Ottawa, KIA OR6 Canada

(Receioed 8 January 1976)

Abstract-When documentary material is indexed by a mechanized classification system, and the results judged by trained professionals, the number of documents in disagreement, after suitable adjustment, defines the error rate of the system. In a test case disagreement was 22% and, of this 22%, the computer correctly identified two thirds of the decisions as doubtful. Professional examination of this doubtful group could further improve performance. The characteristics of the classification system, and of the material being classified, are mainly responsible for disagreement, and the size of the computer-identified, doubtful, group is a basic measure of the suitability of the system for the test material being processed. If is further suggested that if two professionals were compared on the same material then their disagreements would be mainly over the same documents.

Documents entering a classification system are directed by the indexer to the various categories of the system. When the classification is done by computer algorithm, performance is measured by comparing results with those of professional indexers. If we take a suitable sample of documents from the various categories and make adjustment for errors of judgement, number of categories, document distribution, etc., then the number of disagreements between the machine and the professionals is a fundamental measure of the error rate of the mechanized system. The paper describes a method whereby the computer extimates this disagreement, and, in a test, it did correctly forecast two thirds of the cases of disagreement with professionals and identified the documents concerned. If the cases of disagreement were re-examined, then improved agreement should result. More important, the procedure provides a non-subjective measure of system performance and a better understanding of the difficulties involved in classification.

The categories of a classification system (whether it be manual or mechanical) can be represented by weighted vocabulary lists [ 11. The first stage in forming the lists consists of taking sample documents from each category and counting the number of documents which each word occurs. (The number of occurrences is ignored[2].) If the document count for each word is divided by the number of documents in which the word occurs in the total sample, then we have a measure of the probability that a word occurs in the documents of any category (it may be zero).

With the above procedure, then, we obtain the probability for any word when the category is specified. What we would like to know, however, is the probability for any category when the word is specified, i.e. the inverse probability. Fortunately there is a theorem in mathematical probability (Bayes’s theorem) which tells us how to make this calculation[3]. In summary, from the probability of any word, when the category is specified, we compute the probability for each category when the word is specified. A list of such words, with the attached probabilities, is prepared to represent each category of the system. A sample part of such a list is shown in Table 1. There will be as many lists as there are categories.

We first prepared such lists for use in automatic indexingIll. When so used the words of an incoming word group, i.e. a document, are compared with each list in turn. The document is assigned to the category whose list indicates the highest probability for the word group.t

Comparison with manual indexing indicates accuracies of about 80% in this application[4] and it is assumed, therefore, that the lists well represent the categories.

tit is not difficult to find the probability for a group of words when the probability for each individual word is known. For example, for the two-word group “zero” and “root” from the above list, we obtain (ignoring a constant divisor) the value (I - (I - 0.216/9) (I - 0.720/9)) or 0.102 as the group probabihty for the category listed. If this were maximum the document containing the two words “zero” and “root” would be said to belong to this category. The value l/9 is the priori probability of the category; here it is l/9 because we have assumed equal probabilities for the 9 categories.

333

334 W. G. HOYLE

Table I. Sections of a typical weighted vocabulary list. The subject represented is “mathematics” under the main

heading “computers”

We have also used the category lists to measure overlap between the subject categories of a system[5]. In an ideal classification system the vocabulary of each list would be unique. Unfortunately, in any practical system, the lists overlap, though overlapping words are still useful. The vocabulary overlap can be given a measure and we showed in Ref. 5, that such a measure is a measure of the error probabilities of the system. The measure is obtained by considering each list in turn as a document and matching its words against the other lists in turn and also against itself. The result is a square matrix having a rank equal to the number of categories. Against itself, of course, each list gives a value of unity and ideally would give a value of zero against all the other lists. Typical values[S] for an actual case (after normalizing) are shown in Table 2. The off-diagonal terms are the error probabilities. Each row, of course, must sum to unity, indicating a probability of one, or certainty, that the document must go into one category or another, as rejection is not allowed.

It was realized that this classification matrix was identical to the channel matrix of communication theory, and that it also appears in papers on character recognition. This realization raised the question of whether it was possible to utilize any of the error-correction, or, at least, error-detection techniques from those fields so that performance could be still further improved. Perhaps the classifier, instead of being forced to decide, should have the option of “no decision”-that is, rejection, and, probably, reconsideration of some sort.

With rejection permitted, an additional matrix column (and category) would be needed for the rejects. Two possible types of error could then occur: (1) The classifier could fail to find a category for a document. (2) The classifier could put a document into a wrong category.

Obviously type 1 errors are less serious than type 2, and system performance could be evaluated accordingly. A system with no rejection has, of course, only type 2 errors. Hopefully, when the classifier is modified for rejection, some type 2 errors will be changed to type 1. Again, hopefully, there will be no unacceptable increase in total errors. How then can we obtain such a

classifier? In a manual system we can employ two people and, if they disagree, we can say we have a

doubtful document. Unfortunately we cannot afford two mechanized independent systems. Perhaps not even one! We shall, however, return to the idea of two classifiers later. There is no use submitting a document twice for classification, because mechanized systems are wonderfully self-consistent, even when wrong. One possibility, however, is to divide a document into two parts and classify the parts separately. If the classifier gives the same decision for both parts then we accept the decision. If we receive a split decision, then we reject the document; we say that such a document “splits”. An example may help.

Table 2. Probability that a document from a category number listed in the column at left will be put into a

category having the number in the row at the top of the table

1 ? 3 II c ? 7 h Ij 1 .77h .03Q .iIri .?I, .Plh .r 1’ .i,: Aill .1 iL 2 .“3” .i!ll .OLL . j’ .> s .UJd 13 .,I7 . ‘I& .-2*, 1 .I->! .c (I> .7<3r . j r, b, .1,4 .i,,i .<_I ?? . m- L‘, 4 .c11 .r 31 .1.4’, .I”.1 .Cl‘, .,.v . ‘.7 .“2.’ ., J” ‘1 . c 2 1 .l.i/“ . 3-m 1 ‘ . ‘II .7&C, . 111’ .’ ‘4“ ai .i(r h .c 14 .l?li .I’12 .~!,V .osi .i.fl” .LLIj . ‘NJ,> . //’ 1 7 .01Y .L’.’ .r 17 .d” .C!li .8”,‘, it’ . ‘ii . I:& 8 ,017 .137 .In‘. . /C’C .?I- .JC’l .“I_ .rn< .‘,I 0 .CZl .m?.” .?I, . $1 ,’ 7 c . ‘a .1 1, .:‘I’ .iu, .b.‘U

“latjordl Ydlllih: *‘*: -.7ii .,tJ. xdir! Il.-YLItl<lrl .1u11

Error detection in mechanized classi~cation systems 335

When document 7357 was originally classified, (see Ref. 4), the probabilities for the nine categories were (except for a constant multiplier):

1 2 3 4 5 6 7 8 9

4.65 5.11 3.29 0.44 1.11 2.40 4.06 1.58 1.23

where the m~imum indicates that the document belongs to category 2. When the document was halved, the probabilities for the two parts were:

1 2 3 4 5 6 7 8 9

4.66 5.03 3.92 0.00 0.80 3.11 5.67 0.00 0.94 4.66 5.20 2.69 0.90 1.55 1.66 2.60 3.00 1.50

indicating category 7 for the first part and category 2 for the second. The document “spiits” and is therefore rejected.

How to divide a document in two is not a trivial problem. First half and second half would be the obvious method. Another simple method (for the computer) is to divide the document by taking alternate words. The latter method was used in the example above. Surprisingly, perhaps, the two methods give different results. With a 124 document sample and the half and half method of division, 40% (49) of the documents split . With the other method (alternate words) only 25% (31) of the documents split. The latter method split far fewer (13% vs 28%) of the documents in agreement and is therefore preferable.

The classifier, operating on whole documents, had previously agreed with the professionals in classifying 97 of the 124 documents and has disagreed in the other 27 cases.

For the technique to be useful, a significant fraction of these 27 must split, whereas for the group of 97, the fraction splitting must be relatively small. Results with our 124 document group are summarized in Table 3.

We see then, that with the very simplest procedure, the computer classification results agree in 78% of the cases with professional judgement. Further, BORKO[~] has shown that the judges are not 100% infallible and his attenuation factor indicates that 78% agreement means about 90% correct classification by the computer. For those still hesitant to trust the machine completely, Table 3(b) indicates that the computer can, if requested, identify two thirds of those of its decisions with which professionals would disagree so that they can readily be re-examined manually.

For those interested solely in an operating system the above is perhaps sufficient. However for those interested in the study of the classification process itself, there are other things of interest-the group of 31 split documents in Table 3(b) for exampfe. We have indicated that these might be examined manually with a view to improving performance. The improvement, however, may be less than expected because these documents are not simply a random selection, wrongly classified because of poor indexer training, or ineptness, or laziness. These documents have

336 W. G. HOYLE

identifiable characteristics, and in fact, since the computer behaves so much like a human indexer, it seems plausible that if two humans were compared at this same indexing task, that their disagreements also would be found in this same group of “split” documents-a postulate which should be easy to verify.

Much of the criticism about indexer inconsistency may be unjustified. It is largely the characteristics of the documents, together with those of the classification system which result in disagreement and not lack of training or skill. It is also plausible that the number of “split” documents, for any particular sample, is a function of the classification system, i.e. of the fuzziness of the category definitions and, in fact, may be a useful measure of this unwelcome quality. Due regard, of course, must be paid to the subject area covered, the number of categories, the document distribution, and so on. Examination of the groups of split and unsplit documents should permit assessment of changes needed in the category descriptions, or of the need for additions or deletions of categories in order to improve the classification system. The unsplit documents in Table 3(b) are 90% in agreement, an indication of what might be achieved.

We chose the procedure of Table 3(b), rather than 3(a), for the above discussion because it gives the smaller number of split documents. (Actually both procedures split more documents than we would like.) A second reason for the choice is that the procedure of 3(b) involves a lengthwise splitting of the whole document. The procedure for 3(a), on the other hand, splits a document into first and second halves, and it could well be that some documents discuss one subject, or aspect, in the first half and some other in the second. For example the first half might discuss some mathematical theory and the second half a specific application; or a document might describe a new hardware development and then deal with its effect in some specific subject area. In such cases document splitting is inevitable and not related to weakness in the classification system.

In the case of the procedure in Fig. 3(b), however, where alternate words are taken, no such argument can account for the splitting. The most reasonable argument to explain the splitting is that the 31 documents deal with borderline subjects, i.e. they do not fit cleanly into any one of the nine available categories. The group of 49 documents from Fig. 2(a). of course, includes splits from both causes and should be expected to include the 31 of Fig. 2(b), plus those documents splitting because they deal with different subjects in their first and second halves. On examination, the group of 49 documents did include 22 from the group of 3 I. The probability of such a large common group occurring by chance is less than 1%. It might be interesting also to examine the group of I8 documents present in the group of 49, but not in the group of 31, and see if it did, in fact, consist of documents dealing with two subjects. As an interesting sidelight, the above analysis would appear to justify the well-known practice of indexers of examining both the beginning and the end of a document before deciding on its subject category.

A rejection rate of 2S%, let alone 40%, is probably unacceptably high in many applications because of the cost of re-examination. The rate may be even higher (or lower) with other collections. Ideally, we should have control of the rejection rate. To attain this control, the split documents must be graded, somehow or other, from strongly split to weakly split. To find a suitable measure is difficult. We made two unsuccessful attempts.

We examined the probabilities for the two halves of the split documents. Looking again at the data for document 7357 for example, we had values of 5.67 and 5.20 as maximums for the two halves. What we seek is some way of interpreting these two maximums 5.67 and 5.20 as a measure of the degree of split between the two parts of the document. We took the ratio 5.67/5.20 = 1.09 and computed a similar value for every document in the collection. The values ranged from 2.13 to I .02. Those documents having this ratio above the median and those having it below were tested to see if the two groups correlated with the groups of documents in agreement and disagreement. Fisher’s test was applied but the results were not significant at the 5% level. Apparently the ratio of the two probabilities for document halves is of little importance.

We next computed the deviation of the two maximum probability values from the mean of the 9 readings in terms of the standard deviation (i.e. standard scores). Typically, document 7357 gave values of 1.40 and 1.85, a ratio of 1.32. Again the documents were divided into two groups having ratios above and below the median, and Fisher’s test employed. but again the results were not significant at the 5% level.

We are compelled, then, to accept the rejection ratio without control, if we employ our error

Error detection in mechanized classification systems 331

detection technique. There are, of course, expedients, such as varying the amount of word overlap in the two parts of the document. For example, we could divide the document by selecting words in the following order:

1,2-4,5 -7,8- for the first part 2,3 - 5,6 - 8,9 - for the second part

Such an artifice is purely arbitrary, and pragmatic. With our technique we obtained an improvement in agreement, between machine and

professionals, from 78 to 90% at the cost of re-examining 25% of the material. Whether the improvement is worth it depends on cases but, in any case, the procedure does offer a non-subjective means of measuring the performance of a classification system, and, for a given system, does identify the “difficult” documents. Keywords of the two document parts can be listed separately to furnish clues as to why the document split. Hopefully these factors will lead to improvements. The whole subject of classification, or rather misclassification, already has an extensive literature [7].

REFERENCES

[l] W. G. HOYLE, In Mezhdunarodny forum po informatike, Proceedings, Vol. 2. International Federation for Documentation, 7 Hofweg, The Hague, Netherlands (Sept. 1%8).

[2] H. P. LUHN, A statistical approach to mechanized encoding and searching of literary information. IBM 1. Res. Deuel. (1957).

[3] A. BIRNBAUM and A. E. MAXWELL, Classification procedures based on Bayes’ formula. Appl. Statis. I960,9,152-169. [4] W. G. HOYLE, Automatic indexing and the generation of classification systems by algorithm. Inform. Star. Retr. 1973,9,

233-243. [5] W. G. HOYLE, A measure of overlap in classification systems. Proc. of the 3rd It& Study Conf on Classification

Research. Bombay, India, pp.6I 1 (Jan. 1975). [6] HAROLD BORKO, Measuring the reliability of subject classification by men and machines. Am. Docum. 1964, p. 268. [7] G. T. TOUSSAINT, Biblography on estimation of misclassification. IEEE Trans. Inform. Theory 1974, IT-20(4).