the automatic building and expressions of complex concepts

Jennifer Norris

s

The automatic building and expressions of complex concepts: the generation of novel compund nominals to express the 'aboutness' concepts of a text

ITRI-93-2

September 1993

Information Technology Research Institute Technical Report Series

ITRI, University of Brighton, Lewes Road, Brighton, East Sussex BN2 4AT

1

The automatic building and expression ofcomplex concepts: the generation of novelcompound nominals to express the'aboutness' concepts of a text

Jennifer Norris

Rediffusion Simulation Research Centre58-64 Grand Parade, Brighton, BN2 2JYandInformation Technology Research InstituteLewes Road, Brighton, BN2 4AT

10 September, 1993

Abstract

The work in progress described here concerns the problem of generating compound nominals (such as

'electronic games industry growth', 'electronic games company advertising budgets') in an appropriate

context. Past approaches to the problems presented to linguists by compound nominals (CNs) have

had limited success. This report presents a new way of looking at CNs, with the emphasis on their

construction and expression from a piece of text. The means of construction is via the construction of

a network of heads and premodifiers, incorporating nominal forms of verbs and establishing links via

semantic relatedness. The linkage between constituent items of a CN constructed in this manner is

based on information contained within the dictionary definitions of terms which occur within the text.

This linkage is exploited here in the generation of novel CNs for use as search terms in the field of

information retrieval.

1 Introduction: hypothesis and motivation

There are two main strands to this work: the 'theoretical' and the 'practical'. It is the intention to pursue a specific

theoretical viewpoint within the context of a practical system, with the twofold aims of producing a workable

system which is useful in its own right, as well as providing a tool for the subsequent study of the constraints

which govern the use of compound nominal expressions within the context of abstracts.

1.1 Theoretical motivation

The theoretical thrust of this work concerns the following hypothesis:

most of the information necessary for the generation of appropriate compound nominals is present within

the dictionary definitions of its key composite terms. On the basis of commonalities within the

2

dictionary definitions of key terms occurring within the same piece of text, semantic links may be

assumed to hold between items.

There is a weak version of this hypothesis which states that:

some interesting compound nominals may be produced by linking key terms solely on the basis of the

semantic information existing within their dictionary definitions.

It is an aim of this project to investigate the extent to which the strong hypothesis holds, bearing in mind the

alternative weak form. In practical terms, the system under development takes as input an abstract and generates

as output a number of novel compound nominals which represent the 'aboutness concepts' (ie the main concepts

developed within the text, which reflect the essence of what the text is about). The ultimate research aim is to

use this system as a tool for investigating the appropriateness of the compound nominals generated by this

process, and for the subsequent specification of general pragmatic constraints on the production of appropriate

compound nominals.

Since this project grew out of the very general aim of tackling the problematic field of compound nominals,

it has become a relevant theoretical aim to investigate a new approach to the problems presented by compound

nominals to computational linguists.

Previous work has shown both the inadequacy of a syntactic approach and the need for extensive semantic

and pragmatic knowledge (eg Downing 1977). In addition, there are problems associated with the semantic

representations used, which commonly centre around features and frames. Their shortcomings are briefly

discussed in section 4.

The essence of the new approach adopted here is that, at least from the perspective of the generation of

compound nominals, semantic information can remain implicit. Whilst ambiguities may arise at the surface

level, they are not necessarily problematic, because they are reflected at the conceptual level.

This is contrary to popular methods of representing semantic and pragmatic knowledge, which rely on the

explicit specification of all relationships that hold between items which are linked in a network. It may be that

this dispostion for precision can be traced back to Schank's 'Conceptual Dependency' (Schank, 1972), which

requires that concepts differing in meaning have different (and therefore unambiguous) conceptual

representations. The fact that ambiguity can be tolerated within this implementation is not based on any

cognitive claims about the nature of conceptual representations in humans: in other words, I do not intend to

imply that human cognitive representations of complex concepts are, or are not, ambiguous. The method of

deliberately leaving information implicit (rather than explicitly specified) is thus not necessarily cognitively

motivated and is, I believe, novel within the field of computational linguistics.

1.2 The application area: practical motivation

3

Most of us are often frustrated by the amount of information that we do not have time to gather from the wealth

of articles in journals, newspapers and other sources now available.

If it is specific information that we need for a particular purpose, hours may be spent fruitlessly searching

through library books and databases for appropriate sources.

Databases which previously comprised bibliographic details only, are now beginning to include abstracts,

with the result that users can save hours previously wasted on looking up articles with likely-sounding titles, but

inappropriate actual content.

Current procedures for accessing the information in databases leave a lot to be desired. They rely on the

Boolean combination of sets, each of which is formed by searching for a direct match with the search term given

by the user. Searching is not limited to individual words, as terms may include multi-word expressions, but the

form of such multi-term search expressions must match directly with the text.

In many cases the user has a highly specific concept about which they wish to access information, and it is

likely that many articles which are highly relevant to the user's query will fail to be accessed because they do not

contain a direct match for the precise query term, or combination of terms, used. Similarly, the user may have a

very general query and type in a correspondingly general search term, but fail to access articles which are

relevant but whose constituent terms are too specific. Articles (and their abstracts) whose coverage is very

specific may not necessarily contain the more general terminology to be matched against a user's query term.

The facility of 'key terms' may overcome some of these problems. These are terms, usually used for the

purpose of indexing, which abstractors (whether or not they are the author of the original article) are required to

provide alongside the abstract. They typically include one or two terms indicating the general subject area, and

may include a specific term or two which indicate the actual concepts being discussed, although these are often

restricted to terms already present in the text of the abstract.

From the point of view of a user with a specific query, it would be useful to have a facility which

automatically processes the texts of abstracts within a database, and produces novel 'aboutness' terms which

refer to the main concepts developed within them. The 'aboutness' of an article or abstract may be represented

by a collection of terms representing the concepts which are referred to, and which are developed throughout the

course of the text. Such terms are well expressed by compound nominal expressions which comprise the main

noun being referred to, preceded by a modifying phrase which may be very complex.

It is the aim of this project to automate the process of generating novel, appropriate compound nominals

from abstracts, in order to give a more accurate linguistic representation of concepts which the abstract, and

therefore the original article, are about. Such a facility would thus:

• enable specific 'aboutness' terms (generated by the system) to be matched against specific user-defined search

terms corresponding to the 'aboutness' concepts of an abstract;

4

• decrease the amount of reliance on current 'direct match' techniques and Boolean operations on sets (the

Boolean Combination Problem) which currently holds.

There is currently much interest in the abstracting process, particularly in regard its automation, although

this interest is not new (eg Luhn,1958; Baxendale, 1958; Rau et al, 1989; Paice, 1990; Gladwin et al, 1991). Other

work is concerned with specifying the discourse structure of abstracts (eg Liddy, 1991) or using discourse

modelling as the basis for the automation of summarising (eg Sparck Jones, 1993). The current work may be seen

as a continuaton of the summarising / abstracting process, in that it takes an abstract as the starting point for the

production of more succinct aboutness expressions.

1.3 Outline of this report

Section 2 of this report discusses the linguistic phenomenon of compound nominals. It emphasises the

importance of a broad conceptual view, and gives examples of restricted definitions used by a variety of

authors. It discusses the particular subset of compound nominals dealt with in the implementation, giving the

reasons behind this restriction.

Section 3 of this report gives a general description of the system under development, with specific examples

of the input and desired output.

Section 4 contains a discussion of other approaches to compound nominals, a description of the general

approach taken, and a specification of the assumptions which underlie this approach.

In Section 5, the methodology is explained using a worked example for clarity. This section also discusses

the overall aim for which this tool is being developed: as an aid in the specification of constraints on the

generation of compound nominals.

The problems which are anticipated are discussed in section 6.

The report is summarised in section 7.

2 Compound nominals

2.1 The conceptual viewpoint

From the conceptual point of view, the approach adopted in this work relies on the assumption that information

about a specific entity can be incorporated into its conceptual representation, yielding increasingly larger

representations1 which have individual conceptual status as 'entities'2. The linguistic corollary is that

information which modifies a particular noun, or nominal expression, can be incorporated into the nominal

1 The complex conceptual structures described here are similar to the 'Macrostructures' described by Van

Dijk (1980.

2 The notion of conceptual entities is similar to the stance taken by Langacker (1987, 1990).

5

expression to form a larger nominal expression. It has been argued (eg Halliday, 1988) that there are occasions

(such as arise in the language of Physical Science) in which nominalisation is actually necessary rather than an

option. There are different means of representing such nominal expressions linguistically: thus we can have the

'loose' nominal clause: 'growth of the industry relating to electronic games'; or a compacted version: 'electronic

games industry growth'. There has been much discussion recently (eg Cumming, 1991) regarding the definition

and classification of nominalizations. In regard to compound nominal expressions, however, the focus of interest

is on the degree of compaction of information: the greater the degree of compaction, the more compounded is

the nominal expression.

2.2 Different definitions of the phenomenon

As a general phenomenon seen from the point of view of compact expression of conceptual entities, the term

'compound nominal' has enormous coverage. Some examples of the kinds of things we want a definiton to cover

appear below:

chicken and egg situation

what you see is what you get approach

a 'you scratch my back and I'll scratch yours' attitude

it's 'speak now or forever hold your tongue' time again

fan belt drive motor

no quibble 30 day money back guarantee

hand bag plastic bag toilet bag

shoulder bag shopping bag

Previous researchers working on specific types of compounds have described and defined particular

subsets of the more general phenomenon, for example:-

• Winograd (1972) refers to a main noun preceded by classifiers, which may often be other nouns;

• Downing (1977) considers "the simple concatenation of any two or more nouns functioning as a third

nominal".

• Sparck Jones (1983) restricts her study to strings of nouns, adopting the term 'compound nouns'.

• Levi (1978) states that a "complex nominal [is] a head noun preceded

by a modifier which is either another noun or a nominal adjective". This definition has subsequently been

adopted by other authors (eg Finin (1986)).

• Quirk et al (1985) make the more general point that "a compound [is] a lexical unit consisting of more

than one base and functioning both grammatically and semantically as a single word."

It is this unitary nature of compounds, referred to by Quirk, that this work aims to reflect.

6

What is required is a definition which is sufficiently broad to include the range of examples given above,

but which also eliminates those nominal expressions which consist merely of a head noun preceded by one or

more adjectives. In the absence of any existing adequate definition, we can offer the following suggestion:

A compound nominal (CN) is a complex linguistic nominal expression comprising a head

noun preceded by a modifying phrase of any (well-formed) syntactic description. A CN is

distinguished from a simple 'ADJ-NOUN' nominal phrase by the presence of some complexity,

which can exist within the premodifying phrase or in the relationship that holds between the

premodifier/s and head.

There is a problem with this definition in regard to the notion of complexity. The topics of predication and

compositionality (eg van Deemter, 1991; Hintikka, 1980; Katz, 1981; Partee, 1984; Montague, 1970; Sager, 1990)

are relevant here and have featured in previous treatments of compound nouns (eg Levi 1978 and Leonard 1984

on using the distinction between predicating and non-predicating adjectives as a means of distinguishing

'complex nominals'). These are, however, controversial areas of research in themselves (eg Lahav, 1989 on

compositionality; Beardon & Turner, 1993 on predication).

2.3 The precise area of coverage

Whilst it is important from the theoretical perspective to recognise the broader category included in the coverage

of the above definition, the practical side of this work only deals with a subset of CNs, in which nouns,

adjectives, adverbs and nominalised verbs are the only premodifiers. There are two main reasons for this

restriction:-

1) The domain of abstracting (particularly the genre of expository text) is sufficiently formal in writing style

as to render the use of the relatively informal, sententially-premodified CNs, inappropriate. This notion of

formality has not been the subject of any rigour in the current work, but merely reflects the views of an

individual professional abstractor (Heather Downy, personal communication) who has kindly given her

opinions based on her own experience;

2) As discussed in section 4.2 below, an essential part of the hypothesis underlying this work is to rely as

far as possible on semantic and pragmatic information within dictionary definitions of terms occurring in a piece

of text. If it were an aim to encompass sententially premodified CNs within the practical implementation,

syntactic parsing would be required. This would be perfectly feasible, but would present problems of time and

resources if implemented within the course of this project. In addition, the inclusion of a strong syntactic element

at this point would be counter to the main thrust of the work. It is, of course, entirely feasible that future work

could extend the implementation to include this subset of CNs.

3 An overview of the system under development

7

3.1 Overall aim

The overall aim of this work is the implementation of a system which will take as input a section of naturally

produced text, and generate as output a number of novel compound nominals (ie that do not necessarily appear

in the original text) which give comprehensive expression to the main concepts developed within the course of

the text. The acceptability of the output expressions will be used to assess the degree to which the strong

hypothesis holds, and thus the degree to which the approach can be usefully pushed.

The system could subsequently be used in two areas: firstly, as a tool for studying and specifying the

constraints which govern the generation of compound nominals within a particular genre (such as expository

writing); secondly, it could be integrated within working bibliographic databases, as an actual aid to information

retrieval.

3.2 Domain

The domain selected for use is that of abstracts written by professional abstractors. If an abstract is seen as

comprising the essence of the article from which it is taken, then we may see the subsequent construction of

'aboutness' terms for abstracts as an elaborate indexing facility. The practical aim, in effect, is to produce the

ultimate in compaction of meaning into 'unitary' representations, and to express these linguistically as

compound nominals.

The particular database used here is the General Academic Index, which is part of a larger set produced by

the Information Access Company. Research into a variety of sources of abstracts has shown this source to be the

most appropriate in the following aspects:-

• the abstracts are produced by professional abstractors rather than authors of the original texts. This is

considered appropriate in the current work, since abstracts produced by authors very often comprise 'cut and

paste' sections from the original text, rather than a more systematic 'objectively' produced presentation of the

salient points. Whereas professional abstractors may be seen as beginning the compaction process, author-

abstractors often tend to merely 'chunk' sections of the text together, a process which is less related, if at all, to

that of compaction;

• the constraints applying to the abstracting done by employees of this particular company are specified

explicitly;

• the database covers articles from a variety of different genres, which will be advantageous should the

project develop further into a study on genre-related constraints;

• the abstracts already form part of a database which is used on-line in several institutions.

3.3 Using 'real-life' input to produce useful output

8

One of the aims of this work is to try to avoid the kinds of problems typically faced by computational linguists regarding

the precise specification of the content and representation of their input source. McDonald (1993) makes some pertinent

comments in this regard. As far as the practical implementation is concerned, the intention from the start has been to use

unadulterated input to produce useful output. The actual input is thus the text of an abstract, such as the example which

appears below:

An Example Input Abstract3:

The video games industry is growing fast and will dominate the toy market and become an

established part of home entertainment. The 1991 computer games market was worth 275 million

pounds sterling growing to 500 million in 1992, half the toy market. Hardware sales will rise from

261 to 635 million pounds sterling in 1994. Associated software sales are forecast at 645 million

pounds sterling in 1993. The compact disc market is worth 345 million pounds sterling. The main

competitors in the market are Sega and Nintendo. Nintendo will spend 15 million pounds sterling

on advertising over Oct-Dec 1992.

The professional abstractors who produce abstracts such as the above from original text are required to

specify indexing terms which reflect the subject matter of the article. These terms are constrained in that they

must refer to specified sub-indexing headings; they are generally fairly short, and they tend to be sequences of

words which actually occur in the abstract itself. Thus, from the point of view of providing additional search

terms, they are usually redundant. For the example abstract given above, the corresponding 'key terms'

provided by the abstractor are as follows:-

Key Term equivalents currently accompanying abstract:

Subjects: Video game industry

Market share

Companies: Nintendo Company Ltd.

From the point of view of a user searching a large database for relevant articles, the kind of output that

would be far more useful would be terms referring to larger, more complex concepts developed over the course

of the text. These may be said to represent the 'aboutness' of the abstract concerned.

The kinds of compound nominal terms which would be useful as output from the above example abstract

are as follows:-

• electronic games industry / market

• electronic games industry growth

• computer games toy market domination

• rising electronic games sales

3 This example, and the associated key terms listed, are taken from the General Academic Index databse.

The original article appeared in The Observer of October 11th, 1992.

9

• electronic games market competitors.

4 General approaches adopted and assumptions made in this approach

4.1 Other approaches

One of the biggest problems in Natural Language Generation (NLG) concerns the representation of semantic

knowledge: what is the knowledge that needs to be represented for the task under consideration, and what is the

best way of representing it.

Semantic / pragmatic approaches commonly adopt a frame-based perspective in which semantic

knowledge must be explicitly stated within what is normally a feature-based system. In such a system, a set of

(binary) features must be selected and ascribed associated feature values. There are two main (well recognised)

problems with adopting this approach:

• the selection, specification and values of features, with the associated problem of deciding on the

granularity of the representational level;

• the 'small world' problem, in which there is potentially so much semantic and pragmatic knowledge that

is relevant to the problem domain that it must be limited to a tiny microworld.

A further problem associated with frame - based systems in general is the underlying assumption that there

are a number of predefined categories (in this case of CNs) according to which a particular CN can be

categorised. For example, Gay & Croft (1990) have implemented a system which analyses what they term

'nominal compounds'. An assumption underlying their system is that there are categories of nominal

compounds (such as 'instrument', 'transitive event') according to which they can be classified. In this system each

concept in the knowledge base has roles associated with it (eg 'agent', 'object', 'instrument', location', 'time'), these

roles varying according to the category of a particular compound. The roles for each concept in the knowledge

base indicate the type of relationships in which that concept can occur, with each role having semantic

preferences that limit the categories of concepts that may occur in each relationship.

The underlying problem with this kind of approach is its reliance on the basic assumption of there being a

set of specifiable categories, according to which all nominal compounds may be classified. A closed set of

relations (semantic roles) are prespecified for each type of category, and nominal compounds may be formulated

according to the specifications. It is this kind of prespecification of features, relations, roles and so on which is

inevitable in frame-based systems, and which the work described in this report aims to avoid.

4.2 The approach adopted here

One of the aims of the work presented here is to avoid the knowledge specification and representation problems

associated with the use of frames. The claim here is that there is an enormous amount of semantic and pragmatic

10

information implicitly present within the definitions of a normal dictionary of contemporary language (in this

case English), and that this can be exploited in the processing of an existing piece of text to generate a linguistic

representation of the salient concepts which are developed within the course of the text. An on-line dictionary

thus constitutes, in effect, a rich knowledge base which is task- and domain- independent.

The general approach adopted here is to use commonalities in the dictionary definitions of distinct terms to

build up a network which links key terms, their modifiers, and associated verbal information. The links acquire

weightings according to the salience of the linked items within the text, and the type of link involved. Section 4.3

elaborates on the notion of salience, and section 4.4 specifies the different kinds of links involved in the

construction of such networks.

It is expected that the strong version of the hypothesis will not be completely supported by the results

obtained from (ie the acceptability of the compound nominals generated by) the system. As mentioned briefly

above, the approach is to explore the extent to which the strong hypothesis holds, and use the expected

shortcomings of this hypothesis to formulate constraints which can be used to improve the system. It may be that

some of these constraints will be implementable using information within the dictionary listings, but additional

knowledge may be required.

4.3 Underlying assumptions

There are a number of basic assumptions that have been adopted in this work, and these are specified below:-

• for any piece of coherent text (such as an abstract) it is possible to construct a number of conceptual

entities of varying complexity to express the 'aboutness' (main content) of the text. These may be used

recursively to build up 'unitary' representations of increasingly more complex 'concepts';

• for any sentence / phrase which refers either explicitly or implicitly to some complex referent (entity), it is

possible to construct a compound nominal expression as its linguistic representation;

• compound nominals offer a linguistic means of compacting large amounts of information into

premodified nominal form, so that complex conceptual 'entities' may be expressed. Compaction may be

optional or necessary, and may be conceptually or editorially motivated, or related to a particular style (such as

expository text);

• there is a direct relationship between the salience of a term (in regard to the 'aboutness' of the overall text)

and its frequency of occurrence within a piece of text. To some extent the salience of an item is reflected in its

position in the text, particularly when it occurs in the first sentence (see eg Kieras, 1980 on the signalling of the

thematic content by items appearing in the first sentence).This is specifically true for abstracts occurring in the

database used here, since one of the specific constraints on the abstractors is to make the first sentence as

indicative as possible of the overall content of the article. The salience of a complex concept is assumed to be

directly related to the combined strengths of the linkages leading to it in the network.

11

4.4 Types of linkage between items in a network

As will be seen in section 5 , which describes the methodology used in this implementation, terms are linked

into a network both at the pre-lookup stage (ie before any definitional lookup has occurred) and after up to two

levels of definitional lookup.

There are different types of linkage in a network, which reflect either different types or different strengths of

relationships.

4.4.1 Types of relationships

As far as compound nominals are concerned, there are two distinct types of relationships that can be described:

• Semantic relations.

These are relations of the type discussed by many authors, and typically associated with the work of, for

example, Lees (1960) and Levi (1978). They concern the relationship that holds between a head noun and its

premodifier/s and are typically things like 'cause' (eg 'flu virus'), 'part-of' (eg 'table leg') etc. It should be

emphasised that although this work specifically aims not to make these types of relations explicit (but rather

leave them as implicit), the fact that such relationships do hold between heads and premodifiers is not disputed

here. The area of dispute would be rather with suggestions that there is a closed class of such relations. Looking

at data from a variety of sources strongly suggests that this is not the case at all, and that the relationships that do

hold between items are highly context dependent, and do not form a closed set.

An example of this kind of relation in the example abstract considered in this report could be labelled

'dealing in', ie that holding between the terms 'computer games' and 'market', or 'toy' and 'market'. There is

clearly a further semantic relation holding between 'computer' and 'games' in the first example, which could be

labelled 'to be played on'.

• Synonymy or semi-synonymy relations of meaning.

The subsumption of the meaning of specific terms within more general terms (eg both 'computer' and 'video'

within the meaning of the more general 'electronic'), and the means of representing both direct property

inheritance and less distinct relations of meaning are well-established problems in the field of Artificial

Intelligence. From the point of view of the work reported here, we will want to recognise the more general terms,

such as the 'parent' term 'electronic' which subsumes 'video games', 'computer games' and 'compact disc'. Such

generic terms, which do not appear at all in the input text, will be required in order to generate generalised

aboutness terms. The approach taken in this work is to assume that the dictionary definitions of related specific

terms will each contain the more general term, and that the latter will thus be selected for inclusion in the

network on the basis of the matching between the common terms.

12

We are clearly dealing with degrees of synonymy, so that if we consider elements in a hierarchical

representation of meaning, two terms may be:-

• partially synonymous, where either a parent term may subsume more specific daughter terms (eg parent

'electronic' of 'computer', 'video', 'compact disc'), or an uncle / aunt may partially subsume more specific neice /

nephew terms;

• directly synonymous, where sister terms are judged as having the same meaning.

To some extent it is expected that the degree of synonymy will be reflected in the depth of the lookup level

required to give a match between definitions of terms.

4.4.1.1 Relationships within and between compound nominals

In regard to the current work, we should specify the kinds of relationships that hold both within and between

compound nominals. Consider two compound nominals of the form P1-H1 and P2-H2 respectively, where P1

and P2 are the premodifiers, and H1 and H2 are the heads of the respective CNs: it is clear that we are in fact

dealing with a mixture of the two types of linkage, which can be expressed as follows:-

• Relationship/s between a head noun and its premodifier/s: ie those holding between P1 and H1, or

between P2 anf H2. These are the typical 'semantic relations' attributed to compound nouns, and remain implicit

in this approach;

• Meaning relationships of the synonymy type, holding between distinct heads, (ie synonymy relation s

between H1 and H2). Linkage is by direct word match or via matching of dictionary lookup terms;

• Meaning relationships of the synonymy type, holding between premodifiers of distinct heads (ie between

P1 and P2 of distinct CNs). The heads (H1 and H2) may, but need not, be related;

• Semantic relationships holding between the head of one compound nominal and the premodifier/s of

another (ie between H1 and P2 or H2 and P1), based on there being a synonymy type of meaning relation

established between the two heads (ie based on the establishment of H1 - H2 synonymy linkage).

4.4.2 Strength of relationships

The strength of the relationship that holds between two nodes in a network is reflected in the weighting value

associated with the link between them. The particular relative and actual values of weighting factors remain to

be specified, but the factors involved in the ascription of weights are discussed here. There are two principle

factors which affect the strength of links, these being the proximity between items in the original text (or the

definitional descendents of these items) which become linked, and the depth of definitional lookup that has been

involved in the assignment of the link. This latter factor will be referred to as the lookup level.

13

4.4.2.1 Proximity

It seems reasonable to assume that to some extent the distance (in the original text) between items which become

linked in some way in a network is relevant to the strength (weighting) ascribed to that link. Items which occur

in adjacent sentences are likely to be more strongly linked than those which occur in sentences which are

separated by one or more other sentence. It may be that related work, such as that done on collocations by

Smadja (1993) can be of some help in determining the likely strengths of relations between separated words.

Smadja's work refers to 'neighbourhoods' of 5 words length: if words recurrently occur within 5 words of one

another then they can be designated as collocates. Whether or not this inter-word distance may be appropriate in

the generation of compound nominals in the current work remains to be seen.

Although this suggestion seems over simplistic if applied to extended text, if we consider the claim in regard

to the type of short, concise expository text that we see in the abstracts used in this work, it does not seem too

unreasonable. The exception is where we find multiple occurrences of the same term within a fairly short piece

of text. In this case it seems reasonable to assume that the referent is either the same, or that the reference is to a

similar kind of entity, and thus that there is justification for combining the modifiers to form a larger referrring

expression.

4.4.2.2 Lookup Level

We need to be able to distinguish between terms which occur in the original abstract text and those which are

inserted into the network by virtue of their appearing in the dictionary definitions of the original terms. There

are also instances where the terms are from a secondary level of lookup. Linkages which are formed on the basis

of information obtained from successively 'deeper' levels of lookup must reflect the assumption that the

justification for forming a link decreases with successive lookup levels. Linkages must therefore become

successively weaker with increasing depth of dictionary lookup level.

For the purposes of distinguishing the origin of terms and links, the following terminology is used:-

• terms which occur in the original abstract are referred to as L0 terms (because zero lookup procedures

have occurred to obtain them);

• terms which come from the first level of dictionary lookup are referred to as L1 terms (since they have

been yielded by one lookup level);

• terms which come from the second level of lookup are referred to as L2 terms (since they have been

yielded by a second lookup level).

Links are labelled according to the labelling of the terms they join, so that a link between two terms

occurring in the original text of an abstract would be designated an [L0 ->L0] link, whereas one which linked a

term in the original text to one occurring in the definition of a definition would be designated an [L0 -> L2] link,

and so on.

14

5 Methodology

5.1 General methodology used in the implementation

In very general terms, the methodology used follows the steps outlined below. In effect, an on-line dictionary is

used as the main knowledge base. It is hoped that an on-line thesaurus will also be available to provide more

information on synonymy than typically appears in a standard dictionary. If it does become available, the

thesaurus will be consulted before each dictionary lookup procedure.

The system takes as input the text of an abstract, deletes all closed-class words, and builds up a network of

heads and premodifiers. There is an initial stage of linkage which occurs before any lookup procedures at all,

and it is at this stage that syntactic information is taken into consideration. In this stage the aim is to incorporate

all 'key' verbal information (ie from lexical verbs) into the relevant CNs which can initially be constructed from

the abstract. This is achieved by converting all lexical verbs into noun form, and then using syntactic

information (regarding the items which have been deleted), to compact strings of nouns and nominalised verbs

so as to construct additional compound nominals. Thus, in the example abstract which appears in section 5.2, the

system would initially form the compound nominal 'video games industry growth' from the actual input 'the

video games industry is growing ...'. The set thus formed is therefore an addition to the initial list of compound

nominals which appeared in the unprocessed abstract.

It should be noted that some of the deleted items are expected to act as 'blocks' to the formation of

compound nominals (ie across the boundaries indicated by the deletion tags), whereas others (such as

prepositions) seem to be universally deletable in the formation of CNs. Although some initial generalisations

about such 'enhancers' and 'blocks' to CN formation have been made, a more precise, widespread analysis

remains to be done.

After this initial pre-definitional lookup stage, the main linking stages begin, with items being linked by

virtue of either direct matching of terms, or implied semantic, or synonymy relationships (as discussed in section

4.4.1 above).

The links between items have associated weighting values, which are ascribed according to the factors of

proximity and lookup level, as discussed above in section 4.4.2. The linkage processes are repeated until all items

have been pursued for possible linkage, to the L2 level of lookup. Note that it is only if L1 linkage fails that L2

linkage is investigated.

Once all the potential linkages have been pursued, the resulting network, consisting of weighted links

between different heads (including novel terms) and their associated modifiers must be partitioned into a

number of discrete (possibly overlapping) representations of complex concepts. Although the methodology of

this stage is still to be determined, it is clear that the weighting values associated with 'clusters' will be taken as

reflecting the relative salience of the concepts.

15

It then remains to express each grouping in compound nominal form, as the main head preceded by an

ordered list of the modifiers. The factors which determine this ordering are complex, and remain to be specified

at this stage.

5.2 A more specific description of implementation methodology: a worked example

This section contains a more detailed account of the steps taken in the processing of an individual abstract. It

consists of a worked example, analysed by hand, for which the Collins English Dictionary (2nd edition, 1986)

was used. For the puposes of clarity this section is presented in note form, with the addition of some explanatory

sections of text where necessary. It should be noted that the methodology described here makes no mention of

reference to an on-line thesaurus. If this facility is added, it would entail an additional lookup procedure (ie in

the thesaurus) to precede each dictionary lookup. Figure 1: 'establishing a new link', shows this ordering in the

form of a flow chart.

The example input abstract:

The video games industry is growing fast and will dominate the toy market and become an established part

of home entertainment. The 1991 computer games market was worth 275 million pounds sterling growing to 500

million in 1992, half the toy market. Hardware sales will rise from 261 to 635 million pounds sterling in 1994.

Associated software sales are forecast at 645 million pounds sterling in 1993. The compact disc market is worth

345 million pounds sterling. The main competitors in the market are Sega and Nintendo. Nintendo will spend 15

million pounds sterling on advertising over Oct-Dec 1992.

16

Check new potential

link. Set L=0 *

Do terms

match

directly?

Establish link

with weightings

Check thesaurus for

synonymy link.

Direct

match?

YES NO

YES Have all

potential links

been

checked?

YES NO

NO

L=2?

END

END

Check next

potential link

YES

NO

Dictionary

lookup

L+1 -> L

* L = Lookup Level.

Figure one: establishing a new link

17

£ sterling

275m S2

500mS2

261 to 635mS3

S4645m

345mS5

15mS6

market

toy

computer

games

compact_discS1

S2S2

S5

S6

* need to look up stem 'sale'

hardware softwareS3

associatedS4

*sales

Sn : sentence number in which term occurs in abstract.

Figure two

5.2.1 Methodology

• Look up dictionary entry for each word in input abstract, and label it according to syntactic word class.

18

• Delete all closed class words, leaving appropriate tag. Thus, all articles, prepositions, determiners,

pronouns, conjunctions, and auxiliary verbs are deleted.

• Go through text, building up a list of all existing nominal groups. For the example abstract, this gives:

video games industry

toy market (*2)

home entertainment

41991 computer games market

4 n million pounds sterling (*4)

hardware sales

(associated) software sales

compact disc market

(main) competitors

In fact, the items listed above are all the premodified noun groups, rather than specifically compound

nominal expressions. The brackets indicate those items which are 'simply' nouns preceded by an adjective, and

thus those which are not considered as compound nominals in themselves. It is relevant to include these in the

first stages of processing, because they can potentially be used in the initial construction of compound nominal

expressions before any dictionary lookup occurs.

• Translate verb forms into corresponding noun forms then repeat the last stage. For the example abstract,

this gives additional CNs:

video games industry growth

toy market domination

(home entertainment part establishment)

( 275m pounds sterling growth)

hardware sales rise

software sales forecast

(main) market competitors

Note the problem of specifying the correct ordering.

4 It is anticipated that cardinal numbers and dates will require special treatment: a) to facilitate their

recognition; b) to prevent their combination as multiple premodifiers of the same head (eg 'the 1990 1991 1992

computer games market').

19

£ sterling

275m S2

500mS2

261 to 635mS3

S4645m

345mS5

15mS6

market

toy

computer

games

compact_discS1

S2S2

S5

S6

GOODS

EXCHANGE

MONEY


SERVICES

hardware softwareS3

associatedS4

*sales

Sn : sentence number in which term occurs in abstract.

Figure three

20

hardware softwareS3

associatedS4

£ sterling

275m S2

500mS2

261 to 635mS3

S4645m

345mS5

15mS6

market

toy

computergames

compact_discS1

S2S5

S6

GOODS

EXCHANGE

MONEY

L1

L1

L1

L1

L2

L1 L1

L1

L2 (via

'sell')L2

L2(via

'm'dise')


*sales

L2

L2

S2

(via

m'dise)

L2(via sell)

L2->L0 (*2)

L2->L2

L2

(via sell)

SERVICES

L1

L2 (via

m'dise)

Figure four: first linkage

• Using original abstract, make a frequency list for all head nouns and label them L0 (because zero

definitional lookups have occurred). This gives:

Head noun Frequency

industry 1

21

market 5

part 1

entertainment 1

sterling 5

sales 2

competitors 1

advertising 1

• Identify L0 heads having frequency >1. (ie 'market', 'sterling', 'sales').

Represent these in an initial network, as shown in Figure 2.

• Look up the first entry for each term in the dictionary.

(note the assumption is that 1st entry is more representative of the meaning than are subsequent entries).

This gives:

'market': an event or occasion where people meet ... buying and selling merchandise.

'sterling': British money

'sales' (lookup 'sale'): exchange of goods, property or services for

... money or credit.

• Eliminate all closed class words from definitions and store remaining terms as L1 definitions for the

respective L0 terms.

• Go through L1 definitions, searching for linkage with L0 and then other L1 terms. Identify links and label

accordingly (eg 'money' in L1 definitions from both 'sterling' and 'sale' gives an [L1 -> L1] link between the L0

parents).

• For L0 heads which show no linkage (here only 'market'), look for L2 terms, as follows:

Go to the definition of the L0 head (ie 'market') and count the occurrences of all the open class words.

(NB count the words occurring for all entry numbers listed in the definition - in the case of 'market', there

are 6.) Make a list of all those words with frequency >1 : these are for further (L2) lookup.

Here we get L1 terms: merchandise 3

sell (stem) 5

Definitions (first entry of each) :

'merchandise' (L1 term): commercial goods, commodities.

to engage in commercial purchase &

sale of goods or services. (L2 terms).

'sell' (L1 term): to dispose of or transfer ... to a purchaser in

exchange for money etc.;

22

put or be on sale. (L2 terms).

• Store all these L2 lookup definitions in lists. Identify intermediate matches, ie those which enable linkage

between L0 heads to be made. Relevant intermediate links here are 'goods', 'exchange' and 'services'. Add these

intermediate terms to the network: see Figure 3.

• Analyse lists L0, L1 and L2 for presence of links, noting type of linkage: ie [L1-> L1] (ie money -> money

as above); [L2 -> L1] etc.

• Express linkage in a network, with associated weightings.

Figure 4 shows the state of this network at this stage.

• Search for linkage between modifiers in the same way as that described above for linkage between heads,

incorporating data.

• Partition the network - details of the precise methodology involved in this stage remain to be clarified,

although clearly the weighting values of the links will be heavily involved.

• Order modifiers.

• Express 'clusters' in compound nominal form.

5.3 Assessment of output expressions

Given that the overall aim is the implementation of a system which can be used in the production of appropriate

compound nominals, there must be some assessment of the output compound nominal expressions. It is

expected that this evaluation will occur in two stages: an initial stage in which the output expressions are judged

informally, in a crude sense, for 'understandability'; and a second, final stage involving the appraisal of the

degree to which the output expressions are judged to be truly representative of the main aboutness of the

abstract. It is expected that the former stage will not require expert opinion, but that the latter stage will need to

be judged by professional abstractors.

There are a number of possible formats that the final evaluation could take, such as:

• simply asking willing abstractors to give acceptability judgements on the output expressions from

particular input abstracts;

• asking professional abstractors to generate a number of compound nominals to express the main

aboutness of particular abstracts;

• presenting professional abstractors with a multiple choice questionnaire which gives alternative possible

aboutness terms.

23

The initial evaluation, which will not require professional judgement, is likely to be based on intuitive

judgements of the output expressions. The precise form of the final evaluation selected will depend partly on the

range of output expressions actually generated by the system, and the degree to which the strong hypothesis

appears to be supported, as judged by the initial assessment.

6 Problems envisaged

• Lexical ambiguity.

The problem of one lexical item having multiple meanings (eg 'bank' as in 'river bank' versus 'high street

bank') is not a new one. It is expected that such ambiguity will lead to the formation of unjustified linkages.

However, it may be that, in cases where such polysemy exists, the dictionary definitions of words occurring in

the same sentence as each other could be used as an indication of which meaning should be assumed.

• Ambiguity of word class.

This is an implementation problem, the effect of which is not likely to be as serious as might at first be

envisaged. If we consider the fact that, after deletion, we are left with nouns, verbs, adjectives and adverbs, then

the main ambiguities of word class after the deletions are going to be between noun and verb status. However,

since all lexical verbs undergo conversion into their noun forms, it seems reasonable to state that where cases of

noun / verb ambiguity arise, the default assignment of word class should be that of noun.

• Nominalisation of verb forms.

The ideal in this regard would be to use a dictionary which lists the noun forms of verbs. If this were the

case, then the problem of nominalising the verb form would be restricted to selecting the appropriate one in

those cases where multiple noun forms exist. Since it is likely, however, that the dictionary to be used does not

list the noun forms, it has been necessary to specify a method for the conversion. The following method has been

adopted:- for a given verb, identify its stem and search for an item which is listed as a noun, which has the

same stem as the verb. In cases where there is no success, assume that the noun form comprises the verb stem

plus the suffix 'ing'. The main problem with this crude method is the fact that the verb form and noun form

often do not share a common stem, resulting in the selection of unrelated noun forms. It may be that this

problem proves prohibitively large, in which case the default method (stem + 'ing') would need to be adopted

universally. Although this method has its shortcomings, a better one remains to be found.

• Synonymy.

This problem arises largely because of the lack of objectivity involved in the particular terms employed by

different lexicographers in dictionary definitions. Different synonyms may be employed by different

lexicographers, with the result that some linkages which should be formed are not in fact formed, even at the L2

lookup stage. The clearest solution to this problem would be to include an on-line thesaurus, to be consulted at

the pre-dictionary lookup stage, as discussed above.

24

• Ordering of sequences of modifiers.

Once the 'clusters' within a network have been identified, the premodifiers must be ordered so as to

produce meaningful, well-formed compound nominal expressions. The identification and specification of the

constraints which govern this ordering is another stage of the research that remains to be done. It may be that

the initial lists of compound nominals (labelled according to the sentence in which they occurred) are helpful in

this regard.

7 Summary

This report describes the work in progress which is scheduled for completion at the end of July 1994. It describes

the general aims of the project, which are: the implemention of a system which generates novel compound

nominal expressions from free text; and the use of that system as a tool in studying the constraints governing the

acceptability of such expressions.

The motivation behind the system can be seen from both practical and theoretical viewpoints. From the

practical point of view, the general motivation is that of improved access to relevant database material. From

the theoretical standpoint, the incentive is to test the hypothesis that the semantic information implicit in the

dictionary lookup terms of individual words is sufficient to produce meaningful compound nominal expressions

representing the aboutness of the text.

The current work also aims to produce a system that takes 'real life' input (abstracts) and generates useful

novel output expressions (representing the aboutness of the abstract) in a variety of domains, rather than

requiring the detailed specification of a highly specific microworld. The disadvantages of this latter shortcoming

are discussed by McDonald (1983).

Bibliography

Baxendale, P.B. (1958) Machine-Made Index for Technical Literature - An Experiment. IBM Journal of

Research and Development, 2 (4), pp. 354-361.

Beardon, C. & Turner, K. (1993) An analysis of the problems involved in understanding complex nominals.

Research Report (RSRC-93001), RSRC, University of Brighton.

Cumming, S. (1991) 'Nominalization in English and the organization of grammars'. In IJCAI 1991, pp 42-51.

van Deemter, K. (1991) On the Compositionality of Meaning: Four variations on the Theme of

Compositionality Iin Natural Language Processing. PhD Thesis. University of Amsterdam.

van Dijk, T.A. (1980) Macrostructures. Lawrence Erlbaum Associates, New Jersey.

Downing, P. (1977) On the creation and use of English Compound Nouns.

Language, 44, pp 810 - 842.

25

Endres-Niggemeyer, B. (1990) A Procedural Model of an Abstractor at work. International Forum of

Information and Documentation (IFID), 15 (4).

Finin, T.W. (1986) The Semantic Interpretation of Compound Nominals. PhD Thesis, University of Illinois,

Urbana, Illinois.

Gay, L.S. & Croft, W.B. (1990) Interpreting Nominal Compounds for Information Retrieval. Information

Processing and Management, 26 (1), pp 21-38.

Gladwin, P., Pulman, S. & Sparck Jones, K. (1991) Shallow Processing and Automatic Summarising: a First

Study. Technical Report No. 223. Computer Laboratory, University of Cambridge, Cambridge.

Halliday, M.A.K. (1988) 'On the language of physical science' In Ghadessy, M. (ed) (1988) 'Registers of

Written English: situational factors and linguistic features. London, Pinter. pp 162-178.

Hintikka, J. (1980) 'Theories of Truth and Learnable Languages' In S. Kanger & S. Ohman (eds.) Philosophy

and Grammar, D. Reidel. Dordrecht, pp 37-57.

Kieras, D.E. (1980) Initial Mention as a Signal to Thematic Content in Technical Passages. Memory and

Cognition, 8, 345-353.

Katz, J.J. (1981) Language and Other Abstract Objects. Rowman & Littlefield. Totowa, New Jersey.

Lahav, R. (1989) 'Against Compositionality: the case of adjectives' . Philosophical Studies, 57 (3) pp 261-

279.

Langacker, R.W. (1987) Foundations of Cognitive Grammar, Volume 1: Theoretical Prerequisites. Stanford

University Press. Stanford, Ca.

Langacker, R.W. (1990) Concept, Image and Symbol: The Cognitive Basis of

Grammar. Mouton de Gruyter. New York.

Lees, R.B. (1960) The Grammar of English Nominalizations. Indiana University, Bloomington, IN.

Leonard, R. (1984) The Interpretation of English noun sequences on the computer. North-Holland.

Amsterdam.

Levi, J. (1978) The Syntax and Semantics of Complex Nominals.

Academic Press. New York.

Liddy, E.D. (1991) The Discourse-Level Structure of Empirical Abstract: an Exploratory Study. Information

Processing and Management, 27 (1), pp. 55-81.

Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IBM Journal of Research and

Development, 2 (2), pp. 159-165.

McDonald, D.D. (1993) Issues in the Choice of a Source for Natural Language Generation. Association for

Computational Linguistics, 19 (1), pp 191-197.

Montague, R. (1970) Universal Grammar. Theoria, 36 , pp.373-398.

Paice, C.D. (1990) Constructing Literature Abstracts by Computer: Techniques and Prospects. Information

Processing and Management, 26 (1), pp. 171-186.

Partee, B.H. (1984) 'Compositionality' In F. Landman & F. Veltman (eds.), Varieties in Formal Semantics,

Foris, Dordrecht, pp 281-311.

Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985) A Comprehensive Grammar of the English

Language. Longman. London.

26

Rau, L.F., Jacobs, P.S. & Zernik, U. (1989) Information Extraction and Text Summarization using Linguistic

Knowledge Acquisition. Information Processing and Management, 25 (4), pp. 419-428.

Sager, J.C. (1990). A Practical Course in Terminology Processing. John Benjamins Pub. Co., Amsterdam.

Schank, R.C. (1972) Conceptual Dependency: A Theory of Natural Language Understanding. Cognitive

Psychology, 3 (4), pp 552-631.

Smadja, F. (1993) Retrieving Collocations from Text: Xtract. Association for Computational Linguistics, 19

(1), pp 143-176.

Sparck Jones, K. (1983) Compound Noun Interpretation Problems. Technical Report No. 45, Computer

Laboratory, University of Cambridge, Cambridge.

Sparck Jones, K. (1983a) 'So What about Parsing Compound Nouns?' In Sparck Jones, K. & Wilks, Y.

Automatic natural language parsing. Ellis Horwood. Chichester. pp 164-168.

Sparck Jones, K. & Tait, J. (1984) Automatic Search Term Variant Generation. Journal of Documentation, 40,

pp 50-66.

Sparck Jones, K. (1993) Discourse Modelling for Automatic Summarising. Technical Report No. 290,

Computer Laboratory, University of Cambridge, Cambridge.

Winograd, T. (1972) Understanding Natural Language. Academic Press. New York.

the automatic building and expressions of complex concepts

Documents