knowledge-based indexing of the medical literature: the ... · indexers’ awareness of being...

13
Knowledge-Based Indexing of the Medical Literature: The Indexing Aid Project Susanne M. Humphrey and Nancy E. Miller National Library of Medicine, Bethesda, MD 20894 This article describes the Indexing Aid Project for con- ducting research in the areas of knowledge representa- tion and indexing for information retrieval in order to develop interactive knowledge-based systems for computer-assisted indexing of the periodical medical literature. The system uses an experimental frame-based knowledge representation language, FrameKit, imple- mented in Franz Lisp. The initial prototype is designed to interact with trained MEDLINE indexers who will be prompted to enter subject terms as slot values in filling in document-specific frame data structures that are de- rived from the knowledge-base frames. In addition, the automatic application of rules associated with the knowledge-base frames produces a set of Medical Sub- ject Heading (MeSH) keyword indices to the document. Important features of the system are representation of explicit relationships through slots which express the relations; slot values, restrictions, and rules made avail- able by inheritance through “is-a” hierarchies; slot val- ues denoted by functions that retrieve values from other slots; and restrictions on slot values displayable during data entry. I. Introduction This report describes the Indexing Aid Project. This on- going research is part of the Automated Classification and Retrieval Program (ACRP) and is being conducted in the Computer Science Branch of the National Library of Medi- cine’s (NLM’s) Lister Hill National Center for Biomedical Communications. The objective of the ACRP is to conduct research leading to development of automated systems for identifying, representing, and retrieving relevant informa- tion from medical documents. Specifically, the research problem is how to build systems to aid indexing and re- trieval of documents from the vast and continually growing (300,000 new citations added per year) MEDLINE data- base. Accordingly, the theory and methodology of two major research disciplines of information/computer science are being applied to this problem: information retrieval and artificial intelligence (AI). Specific subdisciplines of AI that Received December 2, 1985; revised March 18, 1986; accepted May 9, 1986. Not subject to copyright within the United States. Published by John Wiley & Sons, Inc. are relevant to work in the ACRP are knowledge representa- tion and natural-language understanding. The Indexing Aid Project entails research in two areas, indexing for information retrieval and knowledge represen- tation, manifested in the construction and encoding of knowledge bases (collections of facts and the rules for ap- plying these facts to perform an intellectual task) in auto- mated systems that will assist indexers interacting with them. The general purpose of the project is to develop and test interactive knowledge-based systems for computer- assisted indexing of the medical literature currently indexed in the MEDLINE database. The following guidelines have been established for the domain of the ACRP and therefore apply to the Indexing Aid Project as well: In addition, we have stipulated that the documents to be Clinical practice of medicine in the United States. Of immediate and ongoing interest to the U.S. medical community. Information published in high-quality journals readily avail- able to U.S. practitioners. indexed be original reports that have abstracts in the MEDLINE database. This report describes research in progress. Next we will review the problem areas of indexing consistency and quality that the Indexing Aid System we are developing will address. Subsequently, we will characterize the system at its current stage of development. We will then state our re- search plan for the immediate future, followed by an enu- meration of future research areas in the long term. Appendix A furnishes background information about NLM indexing; Appendix B, the notion of frames in the disciplines of com- putational linguistics, cognitive psychology, and artificial intelligence. II. The Indexing Problem A, Purpose of the Indexing Aid System The Indexing Aid System will be evaluated in terms of improved indexing. Specifically, the MeSH indexing terms JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 38(3):184-196, 1987 CCC 0002-8231/871030184-13$04.00

Upload: truongkiet

Post on 04-Apr-2019

218 views

Category:

Documents


0 download

TRANSCRIPT

Knowledge-Based Indexing of the Medical Literature: The Indexing Aid Project

Susanne M. Humphrey and Nancy E. Miller National Library of Medicine, Bethesda, MD 20894

This article describes the Indexing Aid Project for con- ducting research in the areas of knowledge representa- tion and indexing for information retrieval in order to develop interactive knowledge-based systems for computer-assisted indexing of the periodical medical literature. The system uses an experimental frame-based knowledge representation language, FrameKit, imple- mented in Franz Lisp. The initial prototype is designed to interact with trained MEDLINE indexers who will be prompted to enter subject terms as slot values in filling in document-specific frame data structures that are de- rived from the knowledge-base frames. In addition, the automatic application of rules associated with the knowledge-base frames produces a set of Medical Sub- ject Heading (MeSH) keyword indices to the document. Important features of the system are representation of explicit relationships through slots which express the relations; slot values, restrictions, and rules made avail- able by inheritance through “is-a” hierarchies; slot val- ues denoted by functions that retrieve values from other slots; and restrictions on slot values displayable during data entry.

I. Introduction

This report describes the Indexing Aid Project. This on- going research is part of the Automated Classification and Retrieval Program (ACRP) and is being conducted in the Computer Science Branch of the National Library of Medi- cine’s (NLM’s) Lister Hill National Center for Biomedical Communications. The objective of the ACRP is to conduct research leading to development of automated systems for identifying, representing, and retrieving relevant informa- tion from medical documents. Specifically, the research problem is how to build systems to aid indexing and re- trieval of documents from the vast and continually growing (300,000 new citations added per year) MEDLINE data- base. Accordingly, the theory and methodology of two major research disciplines of information/computer science are being applied to this problem: information retrieval and artificial intelligence (AI). Specific subdisciplines of AI that

Received December 2, 1985; revised March 18, 1986; accepted May 9, 1986. Not subject to copyright within the United States. Published by John Wiley & Sons, Inc.

are relevant to work in the ACRP are knowledge representa- tion and natural-language understanding.

The Indexing Aid Project entails research in two areas, indexing for information retrieval and knowledge represen- tation, manifested in the construction and encoding of knowledge bases (collections of facts and the rules for ap- plying these facts to perform an intellectual task) in auto- mated systems that will assist indexers interacting with them. The general purpose of the project is to develop and test interactive knowledge-based systems for computer- assisted indexing of the medical literature currently indexed in the MEDLINE database.

The following guidelines have been established for the domain of the ACRP and therefore apply to the Indexing Aid Project as well:

In addition, we have stipulated that the documents to be

Clinical practice of medicine in the United States. Of immediate and ongoing interest to the U.S. medical community. Information published in high-quality journals readily avail- able to U.S. practitioners.

indexed be original reports that have abstracts in the MEDLINE database.

This report describes research in progress. Next we will review the problem areas of indexing consistency and quality that the Indexing Aid System we are developing will address. Subsequently, we will characterize the system at its current stage of development. We will then state our re- search plan for the immediate future, followed by an enu- meration of future research areas in the long term. Appendix A furnishes background information about NLM indexing; Appendix B, the notion of frames in the disciplines of com- putational linguistics, cognitive psychology, and artificial intelligence.

II. The Indexing Problem

A, Purpose of the Indexing Aid System

The Indexing Aid System will be evaluated in terms of improved indexing. Specifically, the MeSH indexing terms

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 38(3):184-196, 1987 CCC 0002-8231/871030184-13$04.00

that have been assigned to a document as a result of inter- actions between indexers and the system will serve to test the system by providing a basis for comparison with the current system (MEDLINE), in which there is minimal pro- cedural knowledge provided by the computer in support of indexing (see Appendix A for an overview of MEDLINE indexing at NLM).

However, a more direct result of the interaction will be a set of data structures, known as frames (discussed in detail later in this paper), which are derived from a computerized knowledge base. The frames of the knowledge base may be viewed as an extension of the records for indexing terms in a controlled vocabulary. The important difference is that many different kinds of relationships between knowledge entities can be expressed explicitly using frames, whereas typically the relations that may be expressed explicitly in a controlled vocabulary are limited to those involving generalization-specialization. Furthermore, a system which uses frames can encode the rules for building other frames that are linked to specific documents. This procedural knowledge can be used to assist indexers in interactively building these document-specific frames, as well as con- tribute to the automation of this task.

We will now review the issues of indexing consistency and quality, defining the latter in terms of an expert standard for evaluating the system. Then, before describing the In- dexing Aid System in detail, we will explain briefly how the system we are developing uses computer prompts for pro- moting more complete indexing as well as completely auto- matically adds certain types of MeSH terms by virtue of inferencing rules in the knowledge base. These rules in effect encode the sorts of indexing rules furnished by exist- ing tools used by indexers, which are currently in the form of instructions, definitions, and illustrations. Some of these indexing rules are cited in the discussion.

B. Indexing Consistency Studies

Consistency studies measure the degree to which search- ers ultimately may rely on the assignment of like headings to like concepts. Numerous studies on indexing consistency/ variability of the 1960s and 1970s have been cited by White [l] and Leonard [2]. White reviewed 41 reports pub- lished from 196 1 - 197 1. Leonard reviewed and summarized 34 studies resulting in one 1954 report and 31 reports published from 1961-1975.

In their MEDLINE indexing study Funk and Reid [3] discussed the studies cited by Leonard and noted that almost all had assumed that highly consistent indexing was inher- ently good and therefore could be used as a measure of indexing quality, an exception being Cooper’s opinion paper [4] (which began with a quote from Emerson, “A foolish consistency is the hobgoblin of little minds.“). Cooper’s mathematical argument, based on a counterexample, was that because bad indexing could have high levels of consis- tency, measuring these levels would tell us nothing of a system’s performance. Funk and Reid stated that Leonard’s

experiment [5] seemed to have ended this debate. Leonard tested the effect of indexing consistency levels on retrieval effectiveness and concluded that

Indexer consistency and retrieval effectiveness exhibit a ten- dency toward a direct, positive relationship, i.e., high inter- indexer consistency in assignment of terms appears to he associated with a high retrieval effectiveness of the docu- ments indexed. The concern for maintaining high consis- tency in indexing seems to have been valid and the work of developing and applying indexing rules and vocabulary con- trol seems justified.

(Note: Cooper also concluded that, based on an equation that he derived, it was a “plausable conjecture” that there were circumstances under which an increase in interindexer consistency must necessarily result in an increase in retrieval effectiveness.)

In addition, Funk and Reid reviewed previous indexing studies of the NLM system [5-71. All used Hooper’s mea- sure of indexing consistency [8].

The first study was performed by Lancaster as part of his evaluation of the first-generation MEDLARS, a batch re- trieval system predating the existence of MEDLINE (MED- lars onLINE). Sixteen documents were each reindexed by three NLM indexers. Two sets of measurements were per- formed: one for consistency in assignment of check tags (like HUMAN, ANIMAL, REVIEW, IN VITRO) and heading/subheading combinations; the other for consistency in assignment of check tags and headings disregarding sub- headings. Consistency percentages were 34.4% and 46.1% , respectively.

In his study relating interindexer consistency and retrieval effectiveness Leonard applied the same mea- surements as Lancaster. Ten indexers under contract to NLM indexed a total of ten document groups; each indexer was randomly assigned to index five document groups, five indexers to a document group. His consistency results were 36.5% for check tags and heading/subheading combina- tions, and 48.2% for check tags and headings disregarding subheadings.

Marcetich and Schuyler studied whether indexing consis- tency might be improved by using the AID (Associative Interactive Dictionary) system, developed by Doszkocs at NLM [9], which suggests terms based on occurrence fre- quencies. Fifty English-language documents were indexed by four indexers using AID and four indexers in the usual unAIDed way. Consistency for assignment of headings, excluding check tags and disregarding subheadings, was 39% for routine indexing and 43% for indexing using the AID system. When only printed Index Medicus headings (representing central concepts of documents) were con- sidered, consistency rose to 48% for routine indexing and 55% for indexing using the AID system.

Funk and Reid noted the following limitations of these studies:

Small samples of documents, which may therefore not be representative of overall indexing consistency.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987 185

Indexers’ awareness of being “tested,” which distinguished the situation from normal working conditions. Measurement of only a few categories of index terms.

here may correspond to a heading or a heading/subheading combination.

In contrast, the number of documents in their indexing consistency study was 760, and consisted of documents that had been indexed twice by NLM. (In some cases the dupli- cate indexing was inadvertent due to an error during serial check in; in others, the identical documents appeared in separate publications. Problems leading to inadvertent dupli- cation in the NLM system have since been resolved.) Using Hooper’s measurement, consistency for nine categories of index terms was calculated, resulting in the following percentages:

Indexing of a concept with an inappropriate term. Failure to index a concept. Indexing of a concept that should be ignored.

These may occur because of lack of awareness of the concept, conceptual analysis failure, conceptual translation failure, unawareness of one or more indexing rules, mis- understanding of one or more indexing rules, or simple oversight.

Categories of Index Terms

Check tags Central-concept headings ignoring

subheadings Geographies Check tags, geographies, and headings

ignoring subheadings Central-concept subheadings Subheadings Headings excluding check tags and

geographies and ignoring subheadings Central-concept heading/subheading

combinations Heading/subheading combinations

Percentages

74.7

61.1 56.6

55.4 54.9 48.7

48.2

43.1 33.8

The following example shows a case of inconsistent indexing of a document that was inadvertently indexed twice. We will highlight only two areas of inconsistency, and therefore limit the display to the indexing needed for illustrative purposes. We will also describe the implications of inexpert indexing for retrieval, but our primary criterion for judging the indexing is conformity with a standard.

The following is the citation to the document in question:

AU AU AU TI

so

Syed AM; Futhawala A; Tansey LA; Shanberg AM Neblett D; Mendez R; Barloon JW; Ingram JE Naftel WT; McNamara C Management of prostate carcinoma. Combination of pelvic lymphadenectomy, temporary h-192 implantation, and external irradiation. Radiology 1983 Dee; 149(3): 829-33

Note that the other studies corroborate Lancaster’s findings of lower consistency for categories considering heading/ subheading combinations than for categories including headings but ignoring subheadings, indicating difficulty in the consistent use of more complicated thesauri.

The following are versions of indexing by Indexer X and Indexer Y:

Indexer X

Combined Modality Therapy Lymph Node Excision Prostatic Neoplasms/*RADIOTHERAPY/SURGERY

C. The Need for Expert Indexing Indexer Y

In addition to interindexer consistency, our concerns are somewhat along the line of Cooper’s, mentioned earlier, namely, suppose a concept is being indexed wrongly most of the time. In our research we are introducing the element of a standard for indexing. The standard we are using is the version of indexing that conforms to the thesaurus, the in- dexing rules, and to the practice of those indexers who write the rules and train and revise others in their application.

Lymph Node Excision Pelvis Prostatic Neoplasms/*RADIOTHERAPY

If one accepts the notion of expert indexing (as one might accept expert diagnosis or expert treatment), an appropriate measurement of an indexing system with respect to con- sistency would be consistency with expert indexing. Fur- thermore, since consistency is a binary relation exhibiting transitive properties (a binary relation R in a set S is transi- tive if and only if whenever aRb and bRc, then aRc, where a, b, and c are elements of S), then by demonstrating indexing consistent with a standard, we also show that conformity to expert indexing perforce leads to consistent indexing.

The first area for comparison involves the disease management aspect. Only Indexer X recognized the concept of combined modality therapy, as evidenced by assignment of the corresponding heading. Furthermore, Indexer X applied the rule for assigning the subheading “surgery” in combination with the disease heading. The portion of the MeSH scope note for “surgery” that applies here is “Used for operative procedures on organs, regions, or tissues in the treatment of disease. , .“.

The rule is illustrated by the following example in the MEDLARS indexing Manual:

Hepatectomy in cancer of the liver: LIVER NEO- PLASMSurgery HEPATECTOMY

The following are types of indexing inconsistent with Both indexers assigned the subheading “radiotherapy” to the expert indexing (also known as errors) [IO]. “Concept” disease heading and expressed the emphasis in the document

166 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1967

on radiotherapy by preceding this subheading with an aster- isk, which is a central-concept indicator. They both also recognized the concept of lymph node excision and assigned the heading for this concept. However, Indexer Y did not assign the subheading “surgery” in combination with the disease heading. Perhaps Indexer Y misapplied the following rule in the Manual:

/surgery is used for surgical intervention upon an organ involved in a given disease with the intention of curing that disease. When surgery is performed. . . in a patient with an unrelated disease, the subheading/surgery cannot apply.

The example shown in the Manual is:

Mastectomy in a diabetic: DIABETES MELLITUS MASTECTOMY

Following this rule in this instance would result in the second type of error, namely, failure to index a concept (in this case a heading/subheading combination). If this rule was indeed considered (and it might not have been), the reason for the error might have been conceptual analysis failure or misunderstanding of the rule. Another possible explanation for this error is oversight.

From the retrieval standpoint, omission of the sub- heading “surgery” might cause difficulties. One should be able to retrieve all citations referring to surgery for prostatic cancer using the search expression PROSTATIC NEO- PLASMSlsurgery. A wise searcher might anticipate omis- sion of the subheading “surgery” in isolated cases and compensate by also searching the logical intersection of PROSTATIC NEOPLASMS with specific surgery headings that might apply, for instance PROSTATIC NEOPLASMS AND LYMPH NODE EXCISION. However, this alternate strategy places a burden on the searcher to know or find out what types of surgery are performed for this disease and furthermore to determine appropriate headings for them.

Feedback during interactive retrieval is frequently used for finding appropriate concepts and terms. In this case the searcher might enter PROSTATIC NEOPLASMSsurgery and notice citations in the retrieval-set display indexed under LYMPH NODE EXCISION (as in the Indexer-X version). Other surgery headings might be discovered as well, such as PROSTATECTOMY or ORCHIECTOMY, and, as an alternative to PROSTATIC NEOPLASMSisurgery, the searcher would intersect PROSTATIC NEOPLASMS with the union of the surgery headings, i.e., PROS- TATIC NEOPLASMSiSURGERY OR PROSTATIC NEOPLASMS AND (LYMPH NODE EXCISION OR PROSTATECTOMY OR ORCHIECTOMY).

If the search query were for citations referring to all nondrug treatment of prostatic cancer, the burden of alter- nate strategies would be increased, since one might surmise that application of the subheading “radiotherapy” may also be unreliable. The searcher might therefore feel compelled to discover specific radiotherapy headings and intersect them with the disease heading as well. If the radiotherapy

headings include individual radionuclides, this search could become time consuming.

An even worse case would be a search query for citations referring to treatment of prostatic cancer (including chemo- therapy). If one were afraid that the subheading “drug therapy” might not have been assigned in combination with the disease, i.e., PROSTATIC NEOPLASMSldrtrg therapy, imagine having to, in addition to the previous efforts, inter- sect PROSTATIC NEOPLASMS with all appropriate drug headings, including all the specific estrogen compounds that might be used. The chances of false coordinations would be greatly increased with such a strategy.

We may justifiably conclude that indexing rules for assigning subheadings are designed to facilitate retrieval (both computerized and manual; Index Medicus is still maintained by the NLM) and should be consistently applied to save searchers time, money, and aggravation.

The second area of indexing policy is specification of the location of the lymph nodes that were excised. Indexer Y assigned the heading PELVIS as a coordinate to the heading LYMPH NODE EXCISION. Within the context of this document, the location of the lymph nodes might not be necessary. However, the context of search queries cannot be assumed.

Indexing policy does cover the locational aspect, but not in a straightforward way; that is, there is no section in the Manual on indexing for location of tissues and organs that occur in many places of the body, nor is there a rule for coordinating precoordinate organ-surgery headings (like LYMPH NODE EXCISION) with location of the organ, which would be just the rule needed.

In this case, help may be found in the indexing anno- tations in Alphabetic MeSH. The applicable annotations are as follows:

LYMPH NODES ‘axillary lymph nodes’ = LYMPH NODES (IM) +

AXILLA (NIM, no qualif); ‘cervical lymph nodes’ = LYMPH NODES (IM) +

NECK (NIM, no qualif)

AXILLA ‘axillary lymph nodes’ = LYMPH NODES (IM) +

AXILLA (NIM, no qualif)

NECK ‘cervical lymph nodes’ = LYMPH NODES (IM) +

NECK (NIM, no qualif)

Note, however, the absence of reference to “pelvic lymph nodes” in the LYMPH NODES annotation, nor does MeSH include this reference in the PELVIS annotation. However, one might draw the appropriate analogy from the LYMPH NODES annotation and thereby apply the heading PELVIS. [Simply by way of explanation, (IM) means that LYMPH NODES should be indexed as a central concept, (NIM) that the locational headings AXILLA and NECK and, by anal- ogy, PELVIS should not be indexed as central concepts, and no qualif that the locational headings should not be qualified by a subheading.]

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987 187

D. Role of the Indexing Aid System

The Indexing Aid System is designed to provide inter- active assistance to indexers. A script will be presented further on that will demonstrate prompting the indexer for MeSH headings by displaying certain relations terms, for instance, body-part, symptom, and procedure, in rela- tion to a disease heading that the indexer has already entered for describing the current document. The script will also demonstrate completely automatic assignment of sub- headings by the system, based on inferencing rules encoded in the system’s knowledge base. These features should serve to mitigate the problems discussed previously, namely, the inadvertent omission of subheadings and of headings for which the system would prompt, such as body-area headings.

In addition, the system is designed to automatically assign certain MeSH headings which are not unit concepts (i.e., do not correspond to frames, which will be described shortly) in the system’s knowledge base. For instance, the system might add the heading COMBINED MODALITY THER- APY based on a rule which states that this heading is needed when the document has already been indexed for two or more types of procedure used adjunctively to treat the same disease. An example of another rule for adding a heading would be, if the purpose of the procedure is diagnosis and the age of the patient is fetus, then add the heading PRENATAL DIAGNOSIS.

III. The Indexing Aid System

A. Overview of the Frame Data Structure

The Indexing Aid System uses frames as data structures (see Appendix B for a discussion of the abstract notion of frames). The system is written in an experimental Lisp-based knowledge representation language, FrameKit, developed by Carnegie-Mellon University [ 111, and an ex- tension package based on this language. An initial prototype has been developed which operates on NLM’s VAX 11/780 minicomputer under the Unix BSD 4.2 operating system. The University of Maryland Window Package (written in C) has been adapted for the user interface.

Three types of entities are represented in the system: the document, knowledge, and the journal. There are only one generic document frame and one generic journal frame. The generic knowledge frames comprise the knowl- edge base.

The name of a frame identifies the type of entity. A frame is linked to another frame by its slots; that is, a slot name identifies a relation. The following is the generic document frame with its relations, unique-id, author, title, and so on:

(document (unique-id) (author) (title) (abstract) (source) (contained-in) (indices) (contents))

When a document is selected for indexing, it is stored in the system as a document-specific document frame. Document- specific frames are known as instantiation frames, and are

said to have been instantiated from their generic parent frames. The slots of the generic document frame do not have values. However, slots of frames may have values, and these would appear in the value facet of the slot. The following is a document instantiation frame corresponding to a MEDLINE citation:

(document-85 140959 (is-a (value document)) (unique-id (value 85 140959)) (author

(value (Wilbur AC I;I Woelfel GF l;I Meyer JP I;I Flanigan DP I;I Spigos DG)))

(title (value (Adventitial cystic dis- ease of the popliteal artery.)))

(abstract (value (Adventitial cystic dis- ease of the popliteal artery is an unusual condition of uncertain etiology in which a mucin- containing cyst forms in the wall of the popliteal artery and causes symptoms of intermittent claudi- cation I,I typically in young adults whose arteries are otherwise normal. Arteriog- raphy characteristically shows a smooth-walled I, I curvilinear narrowing. In the case de- scribed III a combination of findings from arteriography I, I computed tomography I, I and ultrasound resulted in a highly specific preoperative di- agnosis.)))

(source (value (Radiology 1985 Apr I;I 155 I(1 1 I)I: 63-4)))

(contained-in (value journal- rad8504)))

The name of the preceding frame identifies it as a docu- ment instantiation frame. It is linked to the generic document frame by the is-a relation, that is, (docu- ment -85140959 (is-a (value document)) . . . ) . Each slot has a value facet with the value filled in. A value may serve as a pointer to another frame. For instance, the contained-in slot points to a journal instantiation frame: (document- 85140959 (contained-in (value journuLrad8504))), where journal-rad8.504 refers to the April 1985 issue of the journal Radiology.

The knowledge-base frames represent indexable knowl- edge entities in the medical domain for processes, pro- cedures, biological structures, and chemical substances. These are the same sorts of entities as appear in the Medical Subject Headings (MeSH) thesaurus, which contains the controlled vocabulary for indexing the MEDLINE database. Unlike MeSH, however, the frames of the knowledge base encode the factual knowledge (indexable knowledge enti- ties) as a semantic network whereby the relationships

188 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987

between entities are expressed explicitly. The following are three frames from the knowledge base representing disease, neoplasm, and cyst. One may follow the is-a relation to display the relationships as cyst is-a neoplasm is-a disease:

(disease (is-a (value medical-subject)) (instances (value neoplasm)) (body-part

(procedure

(symptom

(restrictions ((Lisp function))) (if-added ((Lisp function))))

(restrictions ((Lisp function))) (if-added ((Lisp function )))

(restrictions ((Lisp function))) (if-added ((Lisp function))))

(neoplasm (is-a (value disease)) (instances (value cyst)))

(cyst (is a (value neoplasm)))

A slot may have facets other than the value facet. Two of these, restrictions and if-added, are shown in the preceding disease frame. The restrictions facet of the body-part slot in the disease frame contains Lisp functions that define the allowable values for the current slot. The if-added facet contains rules which are fired when a value has been added to a slot. Rules in these facets are known as demons, and if-added rules are also known as if-added demons. The following summarizes the slot facets currently used in the knowledge base of the Indexing Aid System:

value: restrictions:

if-added:

if-needed:

stores the value of tbe slot contains restrictions on what can be stored in the value facet of the slot contains demons to be fired if a slot value is added contains demons to be fired to retrieve a slot value from somewhere else

B. Overview of the Use of Procedural Knowledge and Inheritance

A feature of the system is the encoding of procedural knowledge which enables the system to interact with in- dexers. This code is in a program known as Process 2, or Indexer Interface, and is described later in this paper as software that is specific to the system. During this inter- action, controlled by the Process-2 program, frames are instantiated from the knowledge base, meaning that they are copied from the knowledge base and linked to a par- ticular MEDLINE citation by a unique number. These document-specific knowledge-base frames are referred to as instantiation frames, and are instantiations of the cor- responding knowledge-base frame just as document-specific document frames are instantiations of the generic document frame. The following is an example of an instantiation frame of the cyst frame corresponding to document-85140959. The completion of this frame is demonstrated in a script of an indexer-computer interaction displayed later.

(cyst-85 140959 (is-a (value cyst)) (body-part (value popliteal-artery)) (procedure (value angiography x-ray-com-

puted-tomography ultra- sonography))

(symptom (value intermittent-claudi- cation))

(contained-in (value document- 85 140959)))

The interaction under the control of the Process-2 pro- gram guides the indexers through the network of entities and relationships in the knowledge base for purposes of de- scribing the subject content of the document. Knowledge entities, encoded as instantiation frames, are systematically brought to the indexers’ attention. The system actively sug- gests associated relations, which are the slots of the frames. Indexers enter values in response, which in turn cause the system to display additional instantiation frames, and so on.

A second way to encode procedural knowledge would be as procedural attachments. A procedural attachment is a function in a slot facet used for controlling the interface. In the current design of the Indexing Aid System, the only facet that contains procedural attachments is the restrictions facet. If the restrictions are not met by the value that is entered by the indexer, the value will not be added to the slot, and the system will display an error message. Indexers may enter a command to display the restrictions for a slot during the indexer-computer interaction. The other facets do not contain procedural attachments but contain functions that are used only for automatically generating MeSH key- word indices in accordance with the rules of current MEDLINE indexing. We are currently investigating designing the system to make it more “frame driven,” that is, to use procedural attachments in the if-needed facets of slots to control the interface from the knowledge base rather than from a separate program.

Another feature of frame-based systems is inheritance through the relations, in particular the is-a relation. Inheri- tance means that when frames are related to each other through a relation defined by the system as exhibiting inheri- tance, the slots and their contents of the parent frame are defaults in the children frames. In fact, inheritance is exhibited in the MEDLINE indexing system: The basis of allowability of heading/subheading combinations is the hierarchical relation of the MeSH Tree Structures; that is, a subheading may form a heading/subheading combination according to whether the heading is a member of a MeSH category or subcategory (see Appendix A for background information) _

After the script showing the indexer-computer inter- action, we will illustrate procedural attachments, facet rules, and inheritance through the knowledge-base hier- archy, using selected portions of a set of knowledge-base frames.

Some of the top-level frames in the knowledge-base hierarchy of the Indexing Aid System may ultimately corre- spond to the major divisions of two well-known separately developed thesauri, MeSH and SNOMed: anatomical struc-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987 189

tures, substances, procedures, and biological processes. The correspondences may be displayed as follows:

Major Divisions MeSH SNOMed

Anatomical Structures Category A- The Topography Anatomy Field

Procedures Category E- The Procedure Analytical, Field Diagnostic and Thera- peutic Tech- niques

Biological Processes Category G- The Function Field, Biological the Morphology Sciences, Field, and the and Disease Field Category C- Diseases

Substances Category D- The Etiology Field: Chemicals Sections 5 & 6 and Drugs (Chemicals, Chemi-

cal Elements, Chemical Products), and Sections 7 & 8 (Drugs and Bio- logicals)

C. Script of Indexer-Computer Interaction

We will now display a script of an indexer-computer interaction for indexing the following document (MeSH headings provided for comparison with the system’s output):

AU Wilbur AC; Woelfel GF; Meyer JP; Flanigan DP AU Spigos DG TI Adventitial cystic disease of the popliteal artery. AB Adventitial cystic disease of the popliteal artery is

an unusual condition of uncertain etiology in which a mucin-containing cyst forms in the wall of the popliteal artery and causes symptoms of intermit- tent claudication, typically in young adults whose arteries are otherwise normal. Arteriography char- acteristically shows a smooth-walled, curvilinear narrowing. In the case described, a combination of findings from arteriography, computed tomog- raphy, and ultrasound resulted in a highly specific preoperative diagnosis.

UI 85140959 MH Adult; Angiography; Case Report; Cysts/COM- MH PLICATIONS/DL4GNOSIS/*RADIOGRAPHY MH Human; Intermittent ClaudicationiETIOL- MH OGY; Male; Popliteal Artery/*RADIOGRAPHY MH Tomography, X-Ray Computed; Ultrasonic MH Diagnosis so Radiology 1985 Apr; 155(l): 63-4

In the interest of brevity, the following script shows com- pletion of only the disease-type frame, and does not reflect the window package, which is used in the actual system. After this frame is completed, the system would, in turn,

display and prompt for completion of other instantiation frames corresponding to the slot values of previously com- pleted instantiation frames. The final result would be a set of linked knowledge-base instantiation frames, each linked to the document instantiation frame of the document being indexed. The system will also build the MeSH indices, which will comprise the value of the indices slot of the document instantiation frame. The parts of the script are enumerated for clarity. The script is annotated with com- ments in brackets. The - > is the computer prompt for a command.

(1) [The askindexer routine whereby indexer enters MeSH terms as prompted by the program. Computer prompts for slot values are in boldface.]

- > (askindexer) disease: cysts body-part: value: Popliteal Artery procedure: value: Angiography + Tomography X-Ray

Computed f Ultrasonic Diagnosis symptom: value: Intermittent Claudication

(2) [Display of computer-generated cysr-85140959 frame resulting from askindexer interaction. Shows substi- tution of MeSH terms Cysts, Tomography X-Ray Com- puted, and Ultrasonic Diagnosis with preferred terms of knowledge-base (KB) frames cyst, x-ruy-computed-tomog- raphy, and ultrasonography, respectively.]

- > (ppf cyst-85 140959) (setq frame:

‘(cyst-85140959 (is-a (value cyst)) (body-part (value popliteal-artery)) (procedure

(value angiography x-ray-computed-tomography ultrasonography))

(symptom (value intermittent-claudication)) (contained-in (value document-85140959))))

(3) [Display of knowledge base cyst frame. Shows pointer to instantiated cyst frame cyst.-85140959 as value of instance slot. KB cyst frame inherits restrictions and rules from the KB disease frame through is-u link (cyst is-a neoplasm is-u disease). (Portion of the KB disease frame will be shown later in this paper.)]

-+ (ppf cyst) (setq frame:

‘(cyst (is-a (value neoplasm)) (instances (value cyst-85140959))))

(4) [Display of portion of computer-generated document-85140959 frame resulting from askindexer inter- action. The lists of MeSH terms in the indices slot indicate that MeSH subheadings (diag, compl, radiogr, and etiol) have been completely automatically assigned by rules in KB disease frame. I

190 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987

+ (ppf document-85 140959) (setq frame:

‘(document-85 140959 (indices (value (Cysts diag compl radiogr)

(Popliteal-Artery radiogr) (Angiography) (Tomography-X-Ray-Computed) (Ultrasonic-Diagnosis) (Intermittent-Claudication etiol)))

(is-a (value document)) (contents (value cyst-85 140959))))

D. Inheritance, Restrictions, and If-Added Demons

In order to illustrate inheritance, procedural attachments, and firing of demons (facet rules), we will display portions of knowledge-base frames for disease, neoplasm, prostatic- disease, and prostatic-neoplasm and focus on only two slots, is-a and body-part and the value, restrictions, and if-added facets. When a document-specific instantiation frame is created, this frame assumes the slots, values, pro- cedural attachments, and demons of its parent knowledge- base frame. This form of inheritance is due to the is-a relation linking the instantiation frame to its parent frame as illustrated in the preceding script by (cyst-85140959 (is-a (value cyst)) . . . ).

(disease (body-part

(restrictions (member !filler

(append1 (get-tree-with-root ‘body-part ‘instances) nil)))

(if-added (cond (!filler

(add-term (get-mesh-term !filler) ‘document !document-number))))))

(neoplasm (is-a

(value disease))) (prostatic-disease

(is-a (value disease))

(body-part (value prostate)))

(prostatic-neoplasm (is-a

(value prostatic-disease neoplasm)))

Note the is-a hierarchies, namely, prostatic-neoplasm is-a prostatic-disease is-a disease and prostatic-neoplasm is-a neoplasm is-a disease.

The restrictions facet defines the allowable values for the slot. In this example, the restrictions are that the filler of the body-part slot in the disease frame be either a member of a body-part tree or nil:

(restrictions (member !filler

(append1 (get-tree-with-root ‘body-part ‘instances) nil)))

Restrictions are inherited through the is-a link. In other words, the restrictions apply automatically to the neoplasm frame by virtue of this hierarchy even though the body-part slot is not explicit in the neoplasm frame. It would be possible to override the restrictions by explicitly repeating the body-part slot in the neoplasm frame with different restrictions. Where there are always-true values for a knowledge-base entity, they appear as slot values in the frame. For instance, is-a relationships shown here are al- ways true. Furthermore, the body-part value for prostatic- disease is always prostate. Although the body-part slot for prostatic-neoplasm is absent, the slot and its value are inherited through the is-a link to the prostatic-disease frame. In other words, the prostatic-neoplasm frame al- ways behaves as if its body-part value were prostate by virtue of this hierarchy.

The if-added facet embodies the rules for adding MeSH headings to the indices list, which is the value of the indices slot of the document instantiation frame, and for auto- matically assigning MeSH subheadings to the indices list, as illustrated in the script displayed earlier. The if-added shown is inherited from the disease frame down through the prostatic-neoplasm frame. In addition, it includes a condi- tion that there be a non-nil slot value:

(cond (!filler . . . ))

The slot value is added to the MeSH list by calling the function add-term:

(add-term (get-mesh-term !filler)‘document !document-number)

If the indexer entered an official MeSH heading equivalent such as an entry term (see reference) or entry version (ab- breviation or other alternate form) instead of the MeSH heading, the heading would still be added to the indices list because of the get-mesh-term function which accesses a wordlist program containing the entry vocabulary and MeSH heading equivalents. MeSH subheadings are auto- matically appended to appropriate headings in the indices list by another if-added rule, the function insert-subhead (not shown in this example).

Normally the system does not fire an if-added rule unless a value has been physically added to the slot. However, we have modified the code to fire the rule when a slot value is available some other way, either inherited or retrieved by a function in the if-needed facet of the slot. When a value is inherited or retrieved, it is not physically added to the slot. For instance, here we would want the add-term function of the body-part slot to be performed when a prostatic- neoplasm instantiation frame has been completed; that is, we want the MeSH heading for prostate to be added to the indices list. However, the value for this slot will not have been added by the indexer, but rather inherited from the knowledge-base prostatic-disease frame through the knowledge-base prostatic-neoplasm frame, as explained previously.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987 191

E. Components of the System

We will now describe the software, inputs, and outputs of the Indexing Aid System.

Software that Exists Independently of the System. FrameKit, a knowledge representation language for building knowledge bases as frame data structures. University of Maryland Window Package, providing a character-based buffered approach to window displays. Possible operations with this software: creation and ini- tialization, scrolling, overlays, text display and dele- tion, and cursor control. Software Specific to the System. The Knowledge Base, consisting of frames representing medical subjects. These frames have been defined and entered using FrameKit. The Wordlist, containing MeSH terms and frame name equivalents. Process 1 -Journal Assignment. Controls the assign- ment of Journal Source Files, consisting of MEDLINE bibliographic citations with abstracts, to indexers. The prototype system allows the Project Officer to allocate an individual Journal Source File to multiple indexers in order to conduct experiments associated with the project. Assignments may be made, deleted, or merely displayed using this utility. When a Journal Source File is assigned for the first time, it is evaluated according to a character string search as to its acceptability for the available Knowledge Base. Process 2 -Indexer Interface (the extension package to FrameKit mentioned earlier). Provides users (indexers) a copy of instantiated document frames, corresponding to citations in Journal Source Files that have been as- signed according to Process 1; uses a fill-in form to prompt for values of slots in subject frames which have been instantiated from the Knowledge Base; and prompts indexers to fill in additional frames that have been instantiated. Process 3 -MeSH Indices Report Generator. Produces the report needed by the Project Officer to evaluate the MeSH indices generated by rules in the Knowledge Base. System Inputs. Journal Source File, consisting of MEDLINE citations with abstracts. Expert MeSH Indices File, consisting of expert MeSH indices corresponding to the Journal Source File. System Outputs. Indexer Frame File, consisting, for each indexer, of a database file of instantiated frames for each Journal Source File indexed. Citations Accepted Report, consisting of citations from the Journal Source File that were accepted when assign- ments were made during Process 1. Citations Rejected Report, consisting of citations from the Journal Source File that were rejected when assign- ments were made during Process 1. MeSH Indices Report, consisting of, for each document assigned, the MEDLINE citation with abstract, the ex- pert MeSH indices, and the respective indexers’ MeSH indices.

IV. Research Plan

We have designed and implemented a pilot version of the Indexing Aid System. Activities for the near future are to analyze the design of this pilot version and make recom- mended changes, develop and expand the knowledge base, and continue designing the computer programs to imple- ment the new rules. The fully developed knowledge repre- sentation scheme should contain rules that are generalizable enough to produce reasonably correct and complete MeSH indices for documents in a test database. When we have established the validity of the system against an expert stan- dard, we will test indexers’ application of the system.

V. Future Research

We will now enumerate areas of future research that may be engendered by the Indexing Aid Project.

1. Enhancement of the Indexer-Computer Inter- face. Since the Indexing Aid System is designed to inter- act with indexers, programs might be developed to facilitate filling in the slots, make changes to previously completed instantiation frames, and simultaneously display sets of re- lated frames. Techniques to be considered in the interface design include interaction style (menu selection, form fill in, command language, direct manipulation) [ 121, error mes- sages, screen considerations, help and explanatory mes- sages, and alternate versions for novice and experienced indexers.

2. The Cognitive Process of Indexing. Indexing in- volves making numerous decisions during the process. Al- though manuals and training provide some basic guidelines, the approach an indexer takes in going from one topic to the next and becoming cognizant of relationships expressed in the document may vary from one indexer to another. The proposed design of the Indexing Aid System will limit the freedom indexers now have as they navigate a document. Perhaps the system can be modified to accommodate some individual differences, or it can be partly user driven rather than prompting the indexer for all inputs. Feedback from indexers as they use the system might stimulate future re- search in this area.

3. Retrieval from Representation Frames. Frames produced by the Indexing Aid System will express re- lationships explicitly, thereby providing opportunity for precise retrieval in response to queries that express re- lationships. Retrieval algorithms and programs will be de- veloped which take advantage of this potential for precise retrieval. Emphasis should be placed upon the efficiency and practicality of computer algorithms for matching docu- ment representations with query representations.

4. Computerized Natural-Language Understanding. This particular area pertains to the ultimate objective of the Automated Classification and Retrieval Program: to auto- matically classify documents by having the machine read the text. With regard to the Indexing Aid Project, one might wish to investigate whether the frames representing docu- ments generated by the Indexing Aid System can interface with a natural-language understanding program.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987

Fillmore’s case grammar [13] has been proposed and explored as the linguistic basis for analysis of textual mate- rial of databases and queries in information-retrieval sys- tems [ 14-161. Montgomery outlined a cooperative research venture between linguists and information scientists for de- veloping a metalanguage based on Fillmore’s case frame formalism which would entail elaboration of the meta- language by a natural-language thesaurus she defined as an extension of a document-classification schema. She sug- gested this metalanguage as the basis for an automated in- dexing application. Lewis investigated linguistic analysis for studying relevance of documents to queries using Fill- more’s case grammar as the theoretical basis for indicating functional relations between keyterms in query statements and in abstracts from various databases in the DIALOG Information Services retrieval system. Fillmore’s case frames in conjunction with a knowledge representation scheme for describing the subject content of MEDLINE titles and abstracts in the domain of rheumatic disorders have been investigated this past year by Carbonell et al. in the NLM-CMU Contract Program.

Sager and Kosaka analyzed linguistically lipoprotein texts from the periodical literature in order to determine the basic relations in the material and to provide a formal repre- sentation of the sentences that carry one or more of these relations [ 171. They employed Sager’s method of computer- ized information formatting of text in a domain of discourse based on the concept of sublanguage [ 181, in this instance establishing a lipoprotein metabolism sublanguage includ- ing rules of sentence wellformedness that distinguish what is sayable in lipoprotein texts from that in other English sentences.

5. Searcher-Computer Interface. This would be highly dependent upon the retrieval strategies developed in connection with item 3. Similar techniques to those de- scribed in item 1 are applicable here as well.

6. Automatic Text Generation. The frame representa- tions of medical documents produced by the Indexing Aid System might serve as a basis for computer-generated in- dicative abstracts of the documents.

7. Automatic Updating of Thesaurus or Knowledge Base (Machine Learning). One can imagine how the In- dexing Aid System might facilitate automatically updating MeSH. For instance, if a document discussed the radio- nuclide iridium-192, the indexer-computer interaction would result in MeSH indices containing the headings RADIOISOTOPES and IRIDIUM since there is no heading “iridium radioisotopes.” However, there exist in MeSH nu- merous other specific radioisotope headings (CALCIUM RADIOISOTOPES, PHOSPHORUS RADIOISOTOPES, etc.). The system might note this situation during the processing of this document and by analogy suggest that IRIDIUM RADIOISOTOPES be added to MeSH.

Furthermore, failed restrictions might serve to update the knowledge base. For instance, if the specific value for a substance slot failed the restrictions for that slot, the failure may signify that the restrictions should be broadened by

adding to the restrictions facet the class to which that sub- stance belongs.

8. Indexing Rules. We think it likely that the Index- ing Aid experiment would suggest possible extension or revision of some of the current expert indexing rules and guidelines, such as expansion of rules for indexing location in the body mentioned earlier in the example of pelvic lymph node excision. Research might be performed that would investigate the rules in question, and suggestions regarding such rules might be solicited from indexers using the system.

Appendix A: NLM Indexing and Thesaurus

Indexing at NLM is described by Bachrach and Charen [ 191. Indexing is the process of assigning to documents the MeSH terms, maintained by NLM’s MeSH Section, which best describe their content and substance. A seven-step method is outlined for indexers to follow involving a combi- nation of reading and scanning the document. Interestingly, the abstract is not scanned until nearly the end of the process (followed only by scanning author-supplied keywords) for the purpose of discovering items missed in the text; their existence in the text must, however, be confirmed in order for them to be considered.

NLM uses a system of coordinate indexing; that is, con- cepts in the text are expressed by the combination of two or more terms. Coordination is achieved by two or more head- ings assigned to a document, heading/suheading combina- tions, and precoordinate headings.

For instance, LIVER GLYCOGEN (coordinating LIVER + GLYCOGEN) is a precoordinate heading. PANCREAS/radiography is a heading/subheading combi- nation. If LIVER GLYCOGEN were not in MeSH, and a document were indexed by the coordination of LIVER + GLYCOGEN, one could only infer that “glycogen in the liver” is a subject in the document. Similarly, “radiography of the pancreas” could only be implied from the coor- dination PANCREAS + RADIOGRAPHY assigned to a document if the heading/subheading combination were not available.

In addition to coordinate indexing two other general prin- ciples should be mentioned: specificity and multiplicity. Indexers are committed to the greatest specificity. For in- stance, a document specifically on leukocytes merits the heading LEUKOCYTES, and it is wrong to assign a general heading like BLOOD CELLS instead. Indexers are also committed to providing for each document as many head- ings as necessary to index the article adequately from all its multiple aspects.

The Indexing Aid research will require intimate knowl- edge about how to index. Published sources for this knowledge are the MEDLARS Indexing Manual, indexing annotations displayed with individual terms in MeSH Annotated Alphabetic List, Technical Notes, and Technical Notes Supplements.

The Manual outlines general indexing policy. An- notations in the Alphabetic MeSH reflect this policy, but are

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987 193

specific to the individual MeSH terms with which they ap- pear. The amplified indexing instructions of Technical Notes are of intermediate specificity. These tools are essentially the work of Thelma Charen of NLM’s Index Section. Tech- nical Notes Supplements, published in specialized subject areas, are the work of indexing personnel with expertise in various fields.

An absolutely vital resource for knowledge of indexing is provided by expert indexers at NLM. We will interact with these experts as we develop and test our Indexing Aid System.

A thesaurus is defined by [20] as a “compilation of words and phrases showing synonymous, hierarchical, and other relationships and dependencies, the function of which is to provide a standardized vocabulary for information storage and retrieval.”

The relationships in thesauri may be prescriptive or suggestive [21]. The MeSH see reference is a prescriptive indicator. For instance, the reference IMPLANT RADIOTHERAPY see BRACHYTHERAPY prescribes that documents on implant radiotherapy be indexed to the term BRACHYTHERAPY. If unmechanized, a system would require indexers to use the referred-to term in a see reference. In NLM’s system, this prescription is automatic; that is, the system substitutes the referred-to form when the referred-from term is used by indexers. In this case, the indexer may enter IMPLANT RADIOTHERAPY, and the system will substitute BRACHYTHERAPY.

The MeSH see related reference is a suggestive indi- cator, referring indexers to terms in addition to the one under consideration. The relations are not explicit but are sub- sumed under this general indicator. Sample see related ref- erences in Annotated Alphabetic MeSH are

ABNORMALITIES, DRUG INDUCED see related TERATOGENS

ACCIDENT PREVENTION see related PROTECTIVE DEVICES

AFTER CARE see related HALFWAY HOUSES

CATARACT see related LENSES, INTRAOCULAR

CELL COMMUNICATION see related INTERCELLULAR JUNCTIONS

DERMATOLOGY see related SKIN DISEASES

FEVER see related ANALGESICS, ANTI-INFLAMMATORY

SPORTS see related ATHLETIC INJURIES

MeSH tree numbers indicating hierarchical relationships are also used as suggestive indicators. IS-A relationships [22] are the primary hierarchical relationships in MeSH, for instance, the following hierarchy displayed in MeSH Tree Structures:

PNEUMOCONIOSIS C8.381.655 ASBESTOSIS C8.381.655.115 BERYLLIOSIS C8.381.655.205

BYSSINOSSIS C8.381.655.287 SIDEROSIS C8.381.655.773 sILIcosIs C8.381.655.845

ANTHRACOSILICOSIS C8.381.655.845.201 SILICOTLJBERCULOSIS C8.381.655.845.752

However, MeSH hierarchies include other types of re- lations as shown by the following superordinate-subordinate pairs incorporated in the MeSH Tree Structures:

HAND-FINGERS JOINTS-SYNOVIAL MEMBRANE RESPIRATORY SYSTEM-LUNG ABORTION, HABITUAL-CERVIX INCOMPETENCE ARTHROPLASTY-JOINT PROSTHESIS IMMUNOGENETICS-ANTIBODY DIVERSITY NUTRITION-NUTRITIVE VALUE PHYSICS-OPTICS COMPUTERS-ARTIFICIAL INTELLIGENCE

The notion of what constitutes a hierarchical relationship appears to be a functional one. A term is subordinate to another term if the relationship provides a candidate for applying the specificity rule of indexing, that is, if it is generally not desirable for documents to be “double- indexed” under both if the subordinate term applies.

MeSI-I contains 76 subheadings naming frequently dis- cussed aspects of subjects. A subheading, when attached to one of the 14,000 headings, forms a precoordinate unit known as a heading/subheading combination. The rules for forming combinations depend on the hierarchical catego- rization of MeSH. Headings are classified into categories (Category A = Anatomy; B = Organisms; C = Diseases. . . ); these may be further subdivided into sub- categories (Subcategory Al = Body Regions; A2 = Musculoskeletal System; A3 = Digestive System. . . ). The MeSH record for a subheading specifies headings with which it may form a combination according to membership of headings in categories and subcategories. For instance, the subheading “cytology” is allowed in combination with headings from only Categories A (Anatomy) or B (Or- ganisms) . LUNG/cytology is thereby acceptable, whereas LUNG DISEASES/cytology is not.

MeSH contains a few precoordinate headings which may be expressed as seemingly allowable heading/subheading combinations. For instance, HEART SURGERY is a pre- coordinate heading. Since “surgery” is an allowable sub- heading in combination with headings from the anatomy category, the heading/subheading combination HEART/ surgery would appear to be allowed. However, the MeSH record for HEART points to the official form HEART SURGERY.

There is nothing in the definition of thesaurus, presented earlier, that prevents imagining a rather elaborate thesaurus which would form a semantic network [23] that explicitly names many relations and indicates specifically the words and phrases related to each other accordingly. However, the main limitation that prevents thesauri from being regarded as knowledge bases is that they typically contain only de-

194 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987

clarative knowledge (facts), whereas a knowledge base typically is not limited in this way. As a computerized thesaurus, MeSH provides very little procedural knowl- edge (active assistance in applying the facts) in support of the indexing process. Computerized help during indexing is confined to the following areas.

(1) Verification that the indexer’s entry is a MeSH heading, entry term (referred-from term of see reference), or entry version (official abbreviation or other alternate form of MeSH heading).

(2) Replacement of entry term or entry version by cor- responding MeSH heading.

(3) Verification of allowable MeSH heading/ subheading combination.

(4) Replacement of MeSH heading/subheading combi- nation by corresponding precoordinate heading.

(5) Simple checking for assignment of check tags (for instance, verification that if an age-group heading, such as CHILD or ADULT, was used, the check tag HUMAN was assigned as well).

In using MeSH, indexers usually consult hardcopy ver- sions: MeSH Annotated Alphabetic List, MeSH Tree Struc- tures, and Permuted MeSH. Interestingly, the indexing annotations cause the published thesaurus to operate more as a knowledge base than the computerized version, which is rarely, if ever, accessed for the annotations.

Appendix B: Frames and Related Notions

The following notions are related to one another: frames, scripts, schemas, and semantic networks. These notions have their origin in the fields of computational linguistics, cognitive psychology, and artificial intelligence, where they are now frequently discussed along with knowledge bases. In speaking of the basic idea of these similar notions as applied to research at NLM, we have used terminology that refers to the artificial-intelligence term “knowledge repre- sentation,” such as knowledge base, knowledge-based sys- tem, and knowledge representation scheme.

Fuller discusses scripts, frames, and schemas in relation to the representation and analysis of text [24]. In linguistics van Dijk defines a frame as a “unit of conventional knowl- edge, according to which mutual expectations and inter- actions are organized” and applies this notion to the analysis of text [25]. In cognitive psychology Schank and Abelson use the “script” formalism, defining a script as a “structure that describes appropriate sequences of events in a particular context,” and apply it to the use of restaurant scripts for understanding stories about visiting restaurants [26]. Minsky’s frames [27] are defined in terms of a

network of nodes and relations. The “top levels” of a frame are fixed and represent things that are always true about the supposed situation. The lower levels have many terminals - “slots” that must be filled by specific instances or data. Collections of related frames are linked together in frame systems. Different frames of a system share the same terminals; this is the critical point that makes it possible to coordinate information gathered from different viewpoints.

Winograd discusses the notion of schema in association with the cognitive processes of language production and comprehension [28]. He defines schema as “description of a complex object, situation, process, or structure” contain- ing a “body of related knowledge to be used in reasoning.” Graesser defines schemas as “generic knowledge structures that guide the comprehender’s interpretations, inferences, expectations, and attention when passages are compre- hended” [29]. Although Graesser discusses these functions in relation to everyday sentences, fairy tales, and news stories, it is not hard to imagine their application as guiding the indexing process.

The notion of semantic networks was mentioned earlier in the context of expanding thesauri to encompass numerous relationships. Ritchie and Hanna present a general definition of this kind of structure and use it as a basis for a survey of semantic net systems [30]. It appears that a semantic net- work is not as easy to characterize as specific systems on which it is based, for at the end of their lengthy and far- reaching paper, the authors conclude: “It follows that to describe a system as ‘using a semantic network’ says very little about it.”

Acknowledgments

We would like to thank Mike Dorsey, Daniel Mendez, and Anil Kapoor of Online Computer Systems, Inc., for their work in adapting the University of Maryland Window Package as an interface for the Indexing Aid System.

References

1.

2.

3.

4.

5.

6.

I.

8.

9.

10.

White, L. “Minimizing variabilities in indexing.” Research Report No. NTZS PB-237-989. Tucson, AZ: Arizona University; November 1973. Leonard, L. E. Inter-Indexer Consistency Studies, 1954-1975: A Re- view of the Literature and Summary of Study Results. Champaign, IL: University of Illinois Graduate School of Library Science; December 1977 (Occasional Papers, No. 131). Funk, M. E.; Reid, C. A. “Indexing consistency in MEDLINE.” Bul- letin of the Medical Library Association. 71(2): 176-183; April 1983. Cooper, W. S. “Is interindexer consistency a hobgoblin?” American Documentation. 20(3):268-78; July 1969. Leonard, L. E. “Inter-indexer consistency and retrieval effectiveness: measurement of relationships.” Ph.D. thesis. Champaign, IL: Uni- versity of Illinois: 1975. Lancaster, F. W. Evaluation of the MEDLARS Demand Search Ser- vice. Bethesda, MD: National Library of Medicine; January 1968: 178-180. Marcetich, J.; Schuyler, P. “The use of AID to promote indexing consistency at the National Library of Medicine.” Paper presented at the Eighty-first Annual Meeting of the Medical Library Association, Montreal, Quebec, Canada: June 1981. Hooper, R. S. indexer Consistency Tests: Origin, Measurement, Re- sults, and Utilization. Bethesda, MD: IBM Corporation; 1965 (TR95-56). Doszkocs, T. E. “An Associative Interactive Dictionary (AID) for online bibliographic searching.” In: Proceedings of the ASIS Annual Meeting, 1978, Volume 15, 4Ist Annual Meeting, New York, NY, November 13%17,197s. White Plains, NY: Knowledge Industry Pub- lications: 1978:105-109. Humphrey, S. M.; Melloni, B. J. Databases: A Primerfor Refrieving Information by Computer. Englewood Cliffs, NJ: Prentice-Hall; 1986217-218.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987 195

11. Carbonell, J. G.; Evans, D. A.; Scott, D. S.; Thomason, R. H. Final Report on the Automated Classification and Retrieval Project: the MedSORT Project, Carnegie-Mellon University. Pittsburgh, PA: De- partments of Philosophy and Computer Science, Carnegie-Mellon University; December 1985: 16-19, Appendix C.

12. Shneidennan, B. Designing the User Interface: Strategies for Effec- tive Human-Computer Interaction. Reading, MA: Addison-Wesley: 1987:83-223.

13. Fillmore, C. J. “The case for case.” In: E. Bach and R.T. Harms, Eds. Universals in Linguistic Theory. New York: Holt, Rinehart and Winston; 1968.

14. Montgomery, C. A. “Linguistics and information science.” Journal of the American Society for Information Science. 23(3):195-219; May-June 1972.

15. Lewis, D. E. “Case grammar and functional relations in aboutness recognition and relevance decision-making in the bibliographic re- trieval environment.” Ph.D. thesis. London, Ontario, Canada: The University of Western Ontario; 1984.

16. Carbonell, J. G.; Evans, D. A.; Scott, D. S.; Thomason, R. H. Final Report on the Automated Classification and Retrieval Project: the MedSORTProject, Carnegie-Mellon University. Pittsburgh, PA: De- partments of Philosophy and Computer Science, Carnegie-Mellon University; December 1985: 14-15, Appendix D. 1.

17. Sager, N. and Kosaka, M. “A database of literature organized by relations.” In: Proceedings of the Seventh Annual Symposium on Computer Applications in Medical Care, Washington, DC, October 23-26, 1983. Silver Spring, MD: IEEE Computer Society Press; 1983: 692-695.

18. Sager, N. Natural Language Information Processing: A Computer Grammar of English and its Applications. Reading, MA: Addison- Wesley: 1981: 213-231.

19. Bachrach, C. A.; Charen, T. “Selection of MEDLINE contents, the

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

development of its thesaurus, and the indexing process.” Medical Informatics 3(3):237-254; September 1978. American National Standards Institute (ANSI) Subcommittee 25 on Thesaurus Rules and Conventions. American National Standard Guidelines for Thesaurus Structure, Construction, and Use. New York: ANSI; 1980. Slamecka, V. “Classificatory, alphabetical, and associative schedules as aids in coordinate indexing.” American Documentation. 14(3):223-228; July 1963. Brachman, R. I. “What IS-A is and isn’t: An analysis of taxonomic links in semantic networks.” Computer. 16(10):30-36; October 1983. Sowa, J. F. Conceptual Structures: Information Processing in Mind and Machine. Reading, MA: Addison-Wesley; 1984: 76-83. Fuller, S. S. “Schema theory in the representation and analysis of text.” Ph.D. dissertation. Los Angeles, CA: University of Southern California; 1984. van Dijk, T. A. Some Aspects of Text Grammars. The Hague, The Netherlands: Mouton; 1972: 17. Schank, R.C.; Abelson, R.P. Scripts, Plans, Goals and Under- standing: An Inquiry into Human Knowledge Structures. Hillsdale, NJ: Erlbaum; 1977: 41. Minsky, M. “A framework for representing knowledge.” In: P. Win- ston, Ed. The Psychology of Computer Vision. New York: McGraw- Hill; 1975: 211. Winograd, T. “A framework for understanding discourse.” In: M. A. lust and P. A. Carpenter, Eds. Cognitive Processes in Comprehen- sion. Hillsdale, NJ: Erlbaum; 1977: 72. Graesser, A. C. Prose Comprehension Beyond the Word. New York: Springer-Verlag; 1981: 29. Ritchie, G. D.; Hanna, E K. “Semantic networks-a general detini- tion and a survey.” Information Technology. Z(4): 187-231; October 1983.

196 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-May 1987