the automatic building and expressions of complex concepts
TRANSCRIPT
Jennifer Norris
s
The automatic building and expressions of complex concepts: the generation of novel compund nominals to express the 'aboutness' concepts of a text
ITRI-93-2
September 1993
Information Technology Research Institute Technical Report Series
ITRI, University of Brighton, Lewes Road, Brighton, East Sussex BN2 4AT
1
The automatic building and expression ofcomplex concepts: the generation of novelcompound nominals to express the'aboutness' concepts of a text
Jennifer Norris
Rediffusion Simulation Research Centre58-64 Grand Parade, Brighton, BN2 2JYandInformation Technology Research InstituteLewes Road, Brighton, BN2 4AT
10 September, 1993
Abstract
The work in progress described here concerns the problem of generating compound nominals (such as
'electronic games industry growth', 'electronic games company advertising budgets') in an appropriate
context. Past approaches to the problems presented to linguists by compound nominals (CNs) have
had limited success. This report presents a new way of looking at CNs, with the emphasis on their
construction and expression from a piece of text. The means of construction is via the construction of
a network of heads and premodifiers, incorporating nominal forms of verbs and establishing links via
semantic relatedness. The linkage between constituent items of a CN constructed in this manner is
based on information contained within the dictionary definitions of terms which occur within the text.
This linkage is exploited here in the generation of novel CNs for use as search terms in the field of
information retrieval.
1 Introduction: hypothesis and motivation
There are two main strands to this work: the 'theoretical' and the 'practical'. It is the intention to pursue a specific
theoretical viewpoint within the context of a practical system, with the twofold aims of producing a workable
system which is useful in its own right, as well as providing a tool for the subsequent study of the constraints
which govern the use of compound nominal expressions within the context of abstracts.
1.1 Theoretical motivation
The theoretical thrust of this work concerns the following hypothesis:
most of the information necessary for the generation of appropriate compound nominals is present within
the dictionary definitions of its key composite terms. On the basis of commonalities within the
2
dictionary definitions of key terms occurring within the same piece of text, semantic links may be
assumed to hold between items.
There is a weak version of this hypothesis which states that:
some interesting compound nominals may be produced by linking key terms solely on the basis of the
semantic information existing within their dictionary definitions.
It is an aim of this project to investigate the extent to which the strong hypothesis holds, bearing in mind the
alternative weak form. In practical terms, the system under development takes as input an abstract and generates
as output a number of novel compound nominals which represent the 'aboutness concepts' (ie the main concepts
developed within the text, which reflect the essence of what the text is about). The ultimate research aim is to
use this system as a tool for investigating the appropriateness of the compound nominals generated by this
process, and for the subsequent specification of general pragmatic constraints on the production of appropriate
compound nominals.
Since this project grew out of the very general aim of tackling the problematic field of compound nominals,
it has become a relevant theoretical aim to investigate a new approach to the problems presented by compound
nominals to computational linguists.
Previous work has shown both the inadequacy of a syntactic approach and the need for extensive semantic
and pragmatic knowledge (eg Downing 1977). In addition, there are problems associated with the semantic
representations used, which commonly centre around features and frames. Their shortcomings are briefly
discussed in section 4.
The essence of the new approach adopted here is that, at least from the perspective of the generation of
compound nominals, semantic information can remain implicit. Whilst ambiguities may arise at the surface
level, they are not necessarily problematic, because they are reflected at the conceptual level.
This is contrary to popular methods of representing semantic and pragmatic knowledge, which rely on the
explicit specification of all relationships that hold between items which are linked in a network. It may be that
this dispostion for precision can be traced back to Schank's 'Conceptual Dependency' (Schank, 1972), which
requires that concepts differing in meaning have different (and therefore unambiguous) conceptual
representations. The fact that ambiguity can be tolerated within this implementation is not based on any
cognitive claims about the nature of conceptual representations in humans: in other words, I do not intend to
imply that human cognitive representations of complex concepts are, or are not, ambiguous. The method of
deliberately leaving information implicit (rather than explicitly specified) is thus not necessarily cognitively
motivated and is, I believe, novel within the field of computational linguistics.
1.2 The application area: practical motivation
3
Most of us are often frustrated by the amount of information that we do not have time to gather from the wealth
of articles in journals, newspapers and other sources now available.
If it is specific information that we need for a particular purpose, hours may be spent fruitlessly searching
through library books and databases for appropriate sources.
Databases which previously comprised bibliographic details only, are now beginning to include abstracts,
with the result that users can save hours previously wasted on looking up articles with likely-sounding titles, but
inappropriate actual content.
Current procedures for accessing the information in databases leave a lot to be desired. They rely on the
Boolean combination of sets, each of which is formed by searching for a direct match with the search term given
by the user. Searching is not limited to individual words, as terms may include multi-word expressions, but the
form of such multi-term search expressions must match directly with the text.
In many cases the user has a highly specific concept about which they wish to access information, and it is
likely that many articles which are highly relevant to the user's query will fail to be accessed because they do not
contain a direct match for the precise query term, or combination of terms, used. Similarly, the user may have a
very general query and type in a correspondingly general search term, but fail to access articles which are
relevant but whose constituent terms are too specific. Articles (and their abstracts) whose coverage is very
specific may not necessarily contain the more general terminology to be matched against a user's query term.
The facility of 'key terms' may overcome some of these problems. These are terms, usually used for the
purpose of indexing, which abstractors (whether or not they are the author of the original article) are required to
provide alongside the abstract. They typically include one or two terms indicating the general subject area, and
may include a specific term or two which indicate the actual concepts being discussed, although these are often
restricted to terms already present in the text of the abstract.
From the point of view of a user with a specific query, it would be useful to have a facility which
automatically processes the texts of abstracts within a database, and produces novel 'aboutness' terms which
refer to the main concepts developed within them. The 'aboutness' of an article or abstract may be represented
by a collection of terms representing the concepts which are referred to, and which are developed throughout the
course of the text. Such terms are well expressed by compound nominal expressions which comprise the main
noun being referred to, preceded by a modifying phrase which may be very complex.
It is the aim of this project to automate the process of generating novel, appropriate compound nominals
from abstracts, in order to give a more accurate linguistic representation of concepts which the abstract, and
therefore the original article, are about. Such a facility would thus:
• enable specific 'aboutness' terms (generated by the system) to be matched against specific user-defined search
terms corresponding to the 'aboutness' concepts of an abstract;
4
• decrease the amount of reliance on current 'direct match' techniques and Boolean operations on sets (the
Boolean Combination Problem) which currently holds.
There is currently much interest in the abstracting process, particularly in regard its automation, although
this interest is not new (eg Luhn,1958; Baxendale, 1958; Rau et al, 1989; Paice, 1990; Gladwin et al, 1991). Other
work is concerned with specifying the discourse structure of abstracts (eg Liddy, 1991) or using discourse
modelling as the basis for the automation of summarising (eg Sparck Jones, 1993). The current work may be seen
as a continuaton of the summarising / abstracting process, in that it takes an abstract as the starting point for the
production of more succinct aboutness expressions.
1.3 Outline of this report
Section 2 of this report discusses the linguistic phenomenon of compound nominals. It emphasises the
importance of a broad conceptual view, and gives examples of restricted definitions used by a variety of
authors. It discusses the particular subset of compound nominals dealt with in the implementation, giving the
reasons behind this restriction.
Section 3 of this report gives a general description of the system under development, with specific examples
of the input and desired output.
Section 4 contains a discussion of other approaches to compound nominals, a description of the general
approach taken, and a specification of the assumptions which underlie this approach.
In Section 5, the methodology is explained using a worked example for clarity. This section also discusses
the overall aim for which this tool is being developed: as an aid in the specification of constraints on the
generation of compound nominals.
The problems which are anticipated are discussed in section 6.
The report is summarised in section 7.
2 Compound nominals
2.1 The conceptual viewpoint
From the conceptual point of view, the approach adopted in this work relies on the assumption that information
about a specific entity can be incorporated into its conceptual representation, yielding increasingly larger
representations1 which have individual conceptual status as 'entities'2. The linguistic corollary is that
information which modifies a particular noun, or nominal expression, can be incorporated into the nominal
1 The complex conceptual structures described here are similar to the 'Macrostructures' described by Van
Dijk (1980.
2 The notion of conceptual entities is similar to the stance taken by Langacker (1987, 1990).
5
expression to form a larger nominal expression. It has been argued (eg Halliday, 1988) that there are occasions
(such as arise in the language of Physical Science) in which nominalisation is actually necessary rather than an
option. There are different means of representing such nominal expressions linguistically: thus we can have the
'loose' nominal clause: 'growth of the industry relating to electronic games'; or a compacted version: 'electronic
games industry growth'. There has been much discussion recently (eg Cumming, 1991) regarding the definition
and classification of nominalizations. In regard to compound nominal expressions, however, the focus of interest
is on the degree of compaction of information: the greater the degree of compaction, the more compounded is
the nominal expression.
2.2 Different definitions of the phenomenon
As a general phenomenon seen from the point of view of compact expression of conceptual entities, the term
'compound nominal' has enormous coverage. Some examples of the kinds of things we want a definiton to cover
appear below:
chicken and egg situation
what you see is what you get approach
a 'you scratch my back and I'll scratch yours' attitude
it's 'speak now or forever hold your tongue' time again
fan belt drive motor
no quibble 30 day money back guarantee
hand bag plastic bag toilet bag
shoulder bag shopping bag
Previous researchers working on specific types of compounds have described and defined particular
subsets of the more general phenomenon, for example:-
• Winograd (1972) refers to a main noun preceded by classifiers, which may often be other nouns;
• Downing (1977) considers "the simple concatenation of any two or more nouns functioning as a third
nominal".
• Sparck Jones (1983) restricts her study to strings of nouns, adopting the term 'compound nouns'.
• Levi (1978) states that a "complex nominal [is] a head noun preceded
by a modifier which is either another noun or a nominal adjective". This definition has subsequently been
adopted by other authors (eg Finin (1986)).
• Quirk et al (1985) make the more general point that "a compound [is] a lexical unit consisting of more
than one base and functioning both grammatically and semantically as a single word."
It is this unitary nature of compounds, referred to by Quirk, that this work aims to reflect.
6
What is required is a definition which is sufficiently broad to include the range of examples given above,
but which also eliminates those nominal expressions which consist merely of a head noun preceded by one or
more adjectives. In the absence of any existing adequate definition, we can offer the following suggestion:
A compound nominal (CN) is a complex linguistic nominal expression comprising a head
noun preceded by a modifying phrase of any (well-formed) syntactic description. A CN is
distinguished from a simple 'ADJ-NOUN' nominal phrase by the presence of some complexity,
which can exist within the premodifying phrase or in the relationship that holds between the
premodifier/s and head.
There is a problem with this definition in regard to the notion of complexity. The topics of predication and
compositionality (eg van Deemter, 1991; Hintikka, 1980; Katz, 1981; Partee, 1984; Montague, 1970; Sager, 1990)
are relevant here and have featured in previous treatments of compound nouns (eg Levi 1978 and Leonard 1984
on using the distinction between predicating and non-predicating adjectives as a means of distinguishing
'complex nominals'). These are, however, controversial areas of research in themselves (eg Lahav, 1989 on
compositionality; Beardon & Turner, 1993 on predication).
2.3 The precise area of coverage
Whilst it is important from the theoretical perspective to recognise the broader category included in the coverage
of the above definition, the practical side of this work only deals with a subset of CNs, in which nouns,
adjectives, adverbs and nominalised verbs are the only premodifiers. There are two main reasons for this
restriction:-
1) The domain of abstracting (particularly the genre of expository text) is sufficiently formal in writing style
as to render the use of the relatively informal, sententially-premodified CNs, inappropriate. This notion of
formality has not been the subject of any rigour in the current work, but merely reflects the views of an
individual professional abstractor (Heather Downy, personal communication) who has kindly given her
opinions based on her own experience;
2) As discussed in section 4.2 below, an essential part of the hypothesis underlying this work is to rely as
far as possible on semantic and pragmatic information within dictionary definitions of terms occurring in a piece
of text. If it were an aim to encompass sententially premodified CNs within the practical implementation,
syntactic parsing would be required. This would be perfectly feasible, but would present problems of time and
resources if implemented within the course of this project. In addition, the inclusion of a strong syntactic element
at this point would be counter to the main thrust of the work. It is, of course, entirely feasible that future work
could extend the implementation to include this subset of CNs.
3 An overview of the system under development
7
3.1 Overall aim
The overall aim of this work is the implementation of a system which will take as input a section of naturally
produced text, and generate as output a number of novel compound nominals (ie that do not necessarily appear
in the original text) which give comprehensive expression to the main concepts developed within the course of
the text. The acceptability of the output expressions will be used to assess the degree to which the strong
hypothesis holds, and thus the degree to which the approach can be usefully pushed.
The system could subsequently be used in two areas: firstly, as a tool for studying and specifying the
constraints which govern the generation of compound nominals within a particular genre (such as expository
writing); secondly, it could be integrated within working bibliographic databases, as an actual aid to information
retrieval.
3.2 Domain
The domain selected for use is that of abstracts written by professional abstractors. If an abstract is seen as
comprising the essence of the article from which it is taken, then we may see the subsequent construction of
'aboutness' terms for abstracts as an elaborate indexing facility. The practical aim, in effect, is to produce the
ultimate in compaction of meaning into 'unitary' representations, and to express these linguistically as
compound nominals.
The particular database used here is the General Academic Index, which is part of a larger set produced by
the Information Access Company. Research into a variety of sources of abstracts has shown this source to be the
most appropriate in the following aspects:-
• the abstracts are produced by professional abstractors rather than authors of the original texts. This is
considered appropriate in the current work, since abstracts produced by authors very often comprise 'cut and
paste' sections from the original text, rather than a more systematic 'objectively' produced presentation of the
salient points. Whereas professional abstractors may be seen as beginning the compaction process, author-
abstractors often tend to merely 'chunk' sections of the text together, a process which is less related, if at all, to
that of compaction;
• the constraints applying to the abstracting done by employees of this particular company are specified
explicitly;
• the database covers articles from a variety of different genres, which will be advantageous should the
project develop further into a study on genre-related constraints;
• the abstracts already form part of a database which is used on-line in several institutions.
3.3 Using 'real-life' input to produce useful output
8
One of the aims of this work is to try to avoid the kinds of problems typically faced by computational linguists regarding
the precise specification of the content and representation of their input source. McDonald (1993) makes some pertinent
comments in this regard. As far as the practical implementation is concerned, the intention from the start has been to use
unadulterated input to produce useful output. The actual input is thus the text of an abstract, such as the example which
appears below:
An Example Input Abstract3:
The video games industry is growing fast and will dominate the toy market and become an
established part of home entertainment. The 1991 computer games market was worth 275 million
pounds sterling growing to 500 million in 1992, half the toy market. Hardware sales will rise from
261 to 635 million pounds sterling in 1994. Associated software sales are forecast at 645 million
pounds sterling in 1993. The compact disc market is worth 345 million pounds sterling. The main
competitors in the market are Sega and Nintendo. Nintendo will spend 15 million pounds sterling
on advertising over Oct-Dec 1992.
The professional abstractors who produce abstracts such as the above from original text are required to
specify indexing terms which reflect the subject matter of the article. These terms are constrained in that they
must refer to specified sub-indexing headings; they are generally fairly short, and they tend to be sequences of
words which actually occur in the abstract itself. Thus, from the point of view of providing additional search
terms, they are usually redundant. For the example abstract given above, the corresponding 'key terms'
provided by the abstractor are as follows:-
Key Term equivalents currently accompanying abstract:
Subjects: Video game industry
Market share
Companies: Nintendo Company Ltd.
From the point of view of a user searching a large database for relevant articles, the kind of output that
would be far more useful would be terms referring to larger, more complex concepts developed over the course
of the text. These may be said to represent the 'aboutness' of the abstract concerned.
The kinds of compound nominal terms which would be useful as output from the above example abstract
are as follows:-
• electronic games industry / market
• electronic games industry growth
• computer games toy market domination
• rising electronic games sales
3 This example, and the associated key terms listed, are taken from the General Academic Index databse.
The original article appeared in The Observer of October 11th, 1992.
9
• electronic games market competitors.
4 General approaches adopted and assumptions made in this approach
4.1 Other approaches
One of the biggest problems in Natural Language Generation (NLG) concerns the representation of semantic
knowledge: what is the knowledge that needs to be represented for the task under consideration, and what is the
best way of representing it.
Semantic / pragmatic approaches commonly adopt a frame-based perspective in which semantic
knowledge must be explicitly stated within what is normally a feature-based system. In such a system, a set of
(binary) features must be selected and ascribed associated feature values. There are two main (well recognised)
problems with adopting this approach:
• the selection, specification and values of features, with the associated problem of deciding on the
granularity of the representational level;
• the 'small world' problem, in which there is potentially so much semantic and pragmatic knowledge that
is relevant to the problem domain that it must be limited to a tiny microworld.
A further problem associated with frame - based systems in general is the underlying assumption that there
are a number of predefined categories (in this case of CNs) according to which a particular CN can be
categorised. For example, Gay & Croft (1990) have implemented a system which analyses what they term
'nominal compounds'. An assumption underlying their system is that there are categories of nominal
compounds (such as 'instrument', 'transitive event') according to which they can be classified. In this system each
concept in the knowledge base has roles associated with it (eg 'agent', 'object', 'instrument', location', 'time'), these
roles varying according to the category of a particular compound. The roles for each concept in the knowledge
base indicate the type of relationships in which that concept can occur, with each role having semantic
preferences that limit the categories of concepts that may occur in each relationship.
The underlying problem with this kind of approach is its reliance on the basic assumption of there being a
set of specifiable categories, according to which all nominal compounds may be classified. A closed set of
relations (semantic roles) are prespecified for each type of category, and nominal compounds may be formulated
according to the specifications. It is this kind of prespecification of features, relations, roles and so on which is
inevitable in frame-based systems, and which the work described in this report aims to avoid.
4.2 The approach adopted here
One of the aims of the work presented here is to avoid the knowledge specification and representation problems
associated with the use of frames. The claim here is that there is an enormous amount of semantic and pragmatic
10
information implicitly present within the definitions of a normal dictionary of contemporary language (in this
case English), and that this can be exploited in the processing of an existing piece of text to generate a linguistic
representation of the salient concepts which are developed within the course of the text. An on-line dictionary
thus constitutes, in effect, a rich knowledge base which is task- and domain- independent.
The general approach adopted here is to use commonalities in the dictionary definitions of distinct terms to
build up a network which links key terms, their modifiers, and associated verbal information. The links acquire
weightings according to the salience of the linked items within the text, and the type of link involved. Section 4.3
elaborates on the notion of salience, and section 4.4 specifies the different kinds of links involved in the
construction of such networks.
It is expected that the strong version of the hypothesis will not be completely supported by the results
obtained from (ie the acceptability of the compound nominals generated by) the system. As mentioned briefly
above, the approach is to explore the extent to which the strong hypothesis holds, and use the expected
shortcomings of this hypothesis to formulate constraints which can be used to improve the system. It may be that
some of these constraints will be implementable using information within the dictionary listings, but additional
knowledge may be required.
4.3 Underlying assumptions
There are a number of basic assumptions that have been adopted in this work, and these are specified below:-
• for any piece of coherent text (such as an abstract) it is possible to construct a number of conceptual
entities of varying complexity to express the 'aboutness' (main content) of the text. These may be used
recursively to build up 'unitary' representations of increasingly more complex 'concepts';
• for any sentence / phrase which refers either explicitly or implicitly to some complex referent (entity), it is
possible to construct a compound nominal expression as its linguistic representation;
• compound nominals offer a linguistic means of compacting large amounts of information into
premodified nominal form, so that complex conceptual 'entities' may be expressed. Compaction may be
optional or necessary, and may be conceptually or editorially motivated, or related to a particular style (such as
expository text);
• there is a direct relationship between the salience of a term (in regard to the 'aboutness' of the overall text)
and its frequency of occurrence within a piece of text. To some extent the salience of an item is reflected in its
position in the text, particularly when it occurs in the first sentence (see eg Kieras, 1980 on the signalling of the
thematic content by items appearing in the first sentence).This is specifically true for abstracts occurring in the
database used here, since one of the specific constraints on the abstractors is to make the first sentence as
indicative as possible of the overall content of the article. The salience of a complex concept is assumed to be
directly related to the combined strengths of the linkages leading to it in the network.
11
4.4 Types of linkage between items in a network
As will be seen in section 5 , which describes the methodology used in this implementation, terms are linked
into a network both at the pre-lookup stage (ie before any definitional lookup has occurred) and after up to two
levels of definitional lookup.
There are different types of linkage in a network, which reflect either different types or different strengths of
relationships.
4.4.1 Types of relationships
As far as compound nominals are concerned, there are two distinct types of relationships that can be described:
• Semantic relations.
These are relations of the type discussed by many authors, and typically associated with the work of, for
example, Lees (1960) and Levi (1978). They concern the relationship that holds between a head noun and its
premodifier/s and are typically things like 'cause' (eg 'flu virus'), 'part-of' (eg 'table leg') etc. It should be
emphasised that although this work specifically aims not to make these types of relations explicit (but rather
leave them as implicit), the fact that such relationships do hold between heads and premodifiers is not disputed
here. The area of dispute would be rather with suggestions that there is a closed class of such relations. Looking
at data from a variety of sources strongly suggests that this is not the case at all, and that the relationships that do
hold between items are highly context dependent, and do not form a closed set.
An example of this kind of relation in the example abstract considered in this report could be labelled
'dealing in', ie that holding between the terms 'computer games' and 'market', or 'toy' and 'market'. There is
clearly a further semantic relation holding between 'computer' and 'games' in the first example, which could be
labelled 'to be played on'.
• Synonymy or semi-synonymy relations of meaning.
The subsumption of the meaning of specific terms within more general terms (eg both 'computer' and 'video'
within the meaning of the more general 'electronic'), and the means of representing both direct property
inheritance and less distinct relations of meaning are well-established problems in the field of Artificial
Intelligence. From the point of view of the work reported here, we will want to recognise the more general terms,
such as the 'parent' term 'electronic' which subsumes 'video games', 'computer games' and 'compact disc'. Such
generic terms, which do not appear at all in the input text, will be required in order to generate generalised
aboutness terms. The approach taken in this work is to assume that the dictionary definitions of related specific
terms will each contain the more general term, and that the latter will thus be selected for inclusion in the
network on the basis of the matching between the common terms.
12
We are clearly dealing with degrees of synonymy, so that if we consider elements in a hierarchical
representation of meaning, two terms may be:-
• partially synonymous, where either a parent term may subsume more specific daughter terms (eg parent
'electronic' of 'computer', 'video', 'compact disc'), or an uncle / aunt may partially subsume more specific neice /
nephew terms;
• directly synonymous, where sister terms are judged as having the same meaning.
To some extent it is expected that the degree of synonymy will be reflected in the depth of the lookup level
required to give a match between definitions of terms.
4.4.1.1 Relationships within and between compound nominals
In regard to the current work, we should specify the kinds of relationships that hold both within and between
compound nominals. Consider two compound nominals of the form P1-H1 and P2-H2 respectively, where P1
and P2 are the premodifiers, and H1 and H2 are the heads of the respective CNs: it is clear that we are in fact
dealing with a mixture of the two types of linkage, which can be expressed as follows:-
• Relationship/s between a head noun and its premodifier/s: ie those holding between P1 and H1, or
between P2 anf H2. These are the typical 'semantic relations' attributed to compound nouns, and remain implicit
in this approach;
• Meaning relationships of the synonymy type, holding between distinct heads, (ie synonymy relation s
between H1 and H2). Linkage is by direct word match or via matching of dictionary lookup terms;
• Meaning relationships of the synonymy type, holding between premodifiers of distinct heads (ie between
P1 and P2 of distinct CNs). The heads (H1 and H2) may, but need not, be related;
• Semantic relationships holding between the head of one compound nominal and the premodifier/s of
another (ie between H1 and P2 or H2 and P1), based on there being a synonymy type of meaning relation
established between the two heads (ie based on the establishment of H1 - H2 synonymy linkage).
4.4.2 Strength of relationships
The strength of the relationship that holds between two nodes in a network is reflected in the weighting value
associated with the link between them. The particular relative and actual values of weighting factors remain to
be specified, but the factors involved in the ascription of weights are discussed here. There are two principle
factors which affect the strength of links, these being the proximity between items in the original text (or the
definitional descendents of these items) which become linked, and the depth of definitional lookup that has been
involved in the assignment of the link. This latter factor will be referred to as the lookup level.
13
4.4.2.1 Proximity
It seems reasonable to assume that to some extent the distance (in the original text) between items which become
linked in some way in a network is relevant to the strength (weighting) ascribed to that link. Items which occur
in adjacent sentences are likely to be more strongly linked than those which occur in sentences which are
separated by one or more other sentence. It may be that related work, such as that done on collocations by
Smadja (1993) can be of some help in determining the likely strengths of relations between separated words.
Smadja's work refers to 'neighbourhoods' of 5 words length: if words recurrently occur within 5 words of one
another then they can be designated as collocates. Whether or not this inter-word distance may be appropriate in
the generation of compound nominals in the current work remains to be seen.
Although this suggestion seems over simplistic if applied to extended text, if we consider the claim in regard
to the type of short, concise expository text that we see in the abstracts used in this work, it does not seem too
unreasonable. The exception is where we find multiple occurrences of the same term within a fairly short piece
of text. In this case it seems reasonable to assume that the referent is either the same, or that the reference is to a
similar kind of entity, and thus that there is justification for combining the modifiers to form a larger referrring
expression.
4.4.2.2 Lookup Level
We need to be able to distinguish between terms which occur in the original abstract text and those which are
inserted into the network by virtue of their appearing in the dictionary definitions of the original terms. There
are also instances where the terms are from a secondary level of lookup. Linkages which are formed on the basis
of information obtained from successively 'deeper' levels of lookup must reflect the assumption that the
justification for forming a link decreases with successive lookup levels. Linkages must therefore become
successively weaker with increasing depth of dictionary lookup level.
For the purposes of distinguishing the origin of terms and links, the following terminology is used:-
• terms which occur in the original abstract are referred to as L0 terms (because zero lookup procedures
have occurred to obtain them);
• terms which come from the first level of dictionary lookup are referred to as L1 terms (since they have
been yielded by one lookup level);
• terms which come from the second level of lookup are referred to as L2 terms (since they have been
yielded by a second lookup level).
Links are labelled according to the labelling of the terms they join, so that a link between two terms
occurring in the original text of an abstract would be designated an [L0 ->L0] link, whereas one which linked a
term in the original text to one occurring in the definition of a definition would be designated an [L0 -> L2] link,
and so on.
14
5 Methodology
5.1 General methodology used in the implementation
In very general terms, the methodology used follows the steps outlined below. In effect, an on-line dictionary is
used as the main knowledge base. It is hoped that an on-line thesaurus will also be available to provide more
information on synonymy than typically appears in a standard dictionary. If it does become available, the
thesaurus will be consulted before each dictionary lookup procedure.
The system takes as input the text of an abstract, deletes all closed-class words, and builds up a network of
heads and premodifiers. There is an initial stage of linkage which occurs before any lookup procedures at all,
and it is at this stage that syntactic information is taken into consideration. In this stage the aim is to incorporate
all 'key' verbal information (ie from lexical verbs) into the relevant CNs which can initially be constructed from
the abstract. This is achieved by converting all lexical verbs into noun form, and then using syntactic
information (regarding the items which have been deleted), to compact strings of nouns and nominalised verbs
so as to construct additional compound nominals. Thus, in the example abstract which appears in section 5.2, the
system would initially form the compound nominal 'video games industry growth' from the actual input 'the
video games industry is growing ...'. The set thus formed is therefore an addition to the initial list of compound
nominals which appeared in the unprocessed abstract.
It should be noted that some of the deleted items are expected to act as 'blocks' to the formation of
compound nominals (ie across the boundaries indicated by the deletion tags), whereas others (such as
prepositions) seem to be universally deletable in the formation of CNs. Although some initial generalisations
about such 'enhancers' and 'blocks' to CN formation have been made, a more precise, widespread analysis
remains to be done.
After this initial pre-definitional lookup stage, the main linking stages begin, with items being linked by
virtue of either direct matching of terms, or implied semantic, or synonymy relationships (as discussed in section
4.4.1 above).
The links between items have associated weighting values, which are ascribed according to the factors of
proximity and lookup level, as discussed above in section 4.4.2. The linkage processes are repeated until all items
have been pursued for possible linkage, to the L2 level of lookup. Note that it is only if L1 linkage fails that L2
linkage is investigated.
Once all the potential linkages have been pursued, the resulting network, consisting of weighted links
between different heads (including novel terms) and their associated modifiers must be partitioned into a
number of discrete (possibly overlapping) representations of complex concepts. Although the methodology of
this stage is still to be determined, it is clear that the weighting values associated with 'clusters' will be taken as
reflecting the relative salience of the concepts.
15
It then remains to express each grouping in compound nominal form, as the main head preceded by an
ordered list of the modifiers. The factors which determine this ordering are complex, and remain to be specified
at this stage.
5.2 A more specific description of implementation methodology: a worked example
This section contains a more detailed account of the steps taken in the processing of an individual abstract. It
consists of a worked example, analysed by hand, for which the Collins English Dictionary (2nd edition, 1986)
was used. For the puposes of clarity this section is presented in note form, with the addition of some explanatory
sections of text where necessary. It should be noted that the methodology described here makes no mention of
reference to an on-line thesaurus. If this facility is added, it would entail an additional lookup procedure (ie in
the thesaurus) to precede each dictionary lookup. Figure 1: 'establishing a new link', shows this ordering in the
form of a flow chart.
The example input abstract:
The video games industry is growing fast and will dominate the toy market and become an established part
of home entertainment. The 1991 computer games market was worth 275 million pounds sterling growing to 500
million in 1992, half the toy market. Hardware sales will rise from 261 to 635 million pounds sterling in 1994.
Associated software sales are forecast at 645 million pounds sterling in 1993. The compact disc market is worth
345 million pounds sterling. The main competitors in the market are Sega and Nintendo. Nintendo will spend 15
million pounds sterling on advertising over Oct-Dec 1992.
16
Check new potential
link. Set L=0 *
Do terms
match
directly?
Establish link
with weightings
Check thesaurus for
synonymy link.
Direct
match?
YES NO
YES Have all
potential links
been
checked?
YES NO
NO
L=2?
END
END
Check next
potential link
YES
NO
Dictionary
lookup
L+1 -> L
* L = Lookup Level.
Figure one: establishing a new link
17
£ sterling
275m S2
500mS2
261 to 635mS3
S4645m
345mS5
15mS6
market
toy
computer
games
compact_discS1
S2S2
S5
S6
* need to look up stem 'sale'
hardware softwareS3
associatedS4
*sales
Sn : sentence number in which term occurs in abstract.
Figure two
5.2.1 Methodology
• Look up dictionary entry for each word in input abstract, and label it according to syntactic word class.
18
• Delete all closed class words, leaving appropriate tag. Thus, all articles, prepositions, determiners,
pronouns, conjunctions, and auxiliary verbs are deleted.
• Go through text, building up a list of all existing nominal groups. For the example abstract, this gives:
video games industry
toy market (*2)
home entertainment
41991 computer games market
4 n million pounds sterling (*4)
hardware sales
(associated) software sales
compact disc market
(main) competitors
In fact, the items listed above are all the premodified noun groups, rather than specifically compound
nominal expressions. The brackets indicate those items which are 'simply' nouns preceded by an adjective, and
thus those which are not considered as compound nominals in themselves. It is relevant to include these in the
first stages of processing, because they can potentially be used in the initial construction of compound nominal
expressions before any dictionary lookup occurs.
• Translate verb forms into corresponding noun forms then repeat the last stage. For the example abstract,
this gives additional CNs:
video games industry growth
toy market domination
(home entertainment part establishment)
( 275m pounds sterling growth)
hardware sales rise
software sales forecast
(main) market competitors
Note the problem of specifying the correct ordering.
4 It is anticipated that cardinal numbers and dates will require special treatment: a) to facilitate their
recognition; b) to prevent their combination as multiple premodifiers of the same head (eg 'the 1990 1991 1992
computer games market').
19
£ sterling
275m S2
500mS2
261 to 635mS3
S4645m
345mS5
15mS6
market
toy
computer
games
compact_discS1
S2S2
S5
S6
GOODS
EXCHANGE
MONEY
* need to look up stem 'sale'
SERVICES
hardware softwareS3
associatedS4
*sales
Sn : sentence number in which term occurs in abstract.
Figure three
20
hardware softwareS3
associatedS4
£ sterling
275m S2
500mS2
261 to 635mS3
S4645m
345mS5
15mS6
market
toy
computergames
compact_discS1
S2S5
S6
GOODS
EXCHANGE
MONEY
L1
L1
L1
L1
L2
L1 L1
L1
L2 (via
'sell')L2
L2(via
'm'dise')
* need to look up stem 'sale'
*sales
L2
L2
S2
(via
m'dise)
L2(via sell)
L2->L0 (*2)
L2->L2
L2
(via sell)
SERVICES
L1
L2 (via
m'dise)
Figure four: first linkage
• Using original abstract, make a frequency list for all head nouns and label them L0 (because zero
definitional lookups have occurred). This gives:
Head noun Frequency
industry 1
21
market 5
part 1
entertainment 1
sterling 5
sales 2
competitors 1
advertising 1
• Identify L0 heads having frequency >1. (ie 'market', 'sterling', 'sales').
Represent these in an initial network, as shown in Figure 2.
• Look up the first entry for each term in the dictionary.
(note the assumption is that 1st entry is more representative of the meaning than are subsequent entries).
This gives:
'market': an event or occasion where people meet ... buying and selling merchandise.
'sterling': British money
'sales' (lookup 'sale'): exchange of goods, property or services for
... money or credit.
• Eliminate all closed class words from definitions and store remaining terms as L1 definitions for the
respective L0 terms.
• Go through L1 definitions, searching for linkage with L0 and then other L1 terms. Identify links and label
accordingly (eg 'money' in L1 definitions from both 'sterling' and 'sale' gives an [L1 -> L1] link between the L0
parents).
• For L0 heads which show no linkage (here only 'market'), look for L2 terms, as follows:
Go to the definition of the L0 head (ie 'market') and count the occurrences of all the open class words.
(NB count the words occurring for all entry numbers listed in the definition - in the case of 'market', there
are 6.) Make a list of all those words with frequency >1 : these are for further (L2) lookup.
Here we get L1 terms: merchandise 3
sell (stem) 5
Definitions (first entry of each) :
'merchandise' (L1 term): commercial goods, commodities.
to engage in commercial purchase &
sale of goods or services. (L2 terms).
'sell' (L1 term): to dispose of or transfer ... to a purchaser in
exchange for money etc.;
22
put or be on sale. (L2 terms).
• Store all these L2 lookup definitions in lists. Identify intermediate matches, ie those which enable linkage
between L0 heads to be made. Relevant intermediate links here are 'goods', 'exchange' and 'services'. Add these
intermediate terms to the network: see Figure 3.
• Analyse lists L0, L1 and L2 for presence of links, noting type of linkage: ie [L1-> L1] (ie money -> money
as above); [L2 -> L1] etc.
• Express linkage in a network, with associated weightings.
Figure 4 shows the state of this network at this stage.
• Search for linkage between modifiers in the same way as that described above for linkage between heads,
incorporating data.
• Partition the network - details of the precise methodology involved in this stage remain to be clarified,
although clearly the weighting values of the links will be heavily involved.
• Order modifiers.
• Express 'clusters' in compound nominal form.
5.3 Assessment of output expressions
Given that the overall aim is the implementation of a system which can be used in the production of appropriate
compound nominals, there must be some assessment of the output compound nominal expressions. It is
expected that this evaluation will occur in two stages: an initial stage in which the output expressions are judged
informally, in a crude sense, for 'understandability'; and a second, final stage involving the appraisal of the
degree to which the output expressions are judged to be truly representative of the main aboutness of the
abstract. It is expected that the former stage will not require expert opinion, but that the latter stage will need to
be judged by professional abstractors.
There are a number of possible formats that the final evaluation could take, such as:
• simply asking willing abstractors to give acceptability judgements on the output expressions from
particular input abstracts;
• asking professional abstractors to generate a number of compound nominals to express the main
aboutness of particular abstracts;
• presenting professional abstractors with a multiple choice questionnaire which gives alternative possible
aboutness terms.
23
The initial evaluation, which will not require professional judgement, is likely to be based on intuitive
judgements of the output expressions. The precise form of the final evaluation selected will depend partly on the
range of output expressions actually generated by the system, and the degree to which the strong hypothesis
appears to be supported, as judged by the initial assessment.
6 Problems envisaged
• Lexical ambiguity.
The problem of one lexical item having multiple meanings (eg 'bank' as in 'river bank' versus 'high street
bank') is not a new one. It is expected that such ambiguity will lead to the formation of unjustified linkages.
However, it may be that, in cases where such polysemy exists, the dictionary definitions of words occurring in
the same sentence as each other could be used as an indication of which meaning should be assumed.
• Ambiguity of word class.
This is an implementation problem, the effect of which is not likely to be as serious as might at first be
envisaged. If we consider the fact that, after deletion, we are left with nouns, verbs, adjectives and adverbs, then
the main ambiguities of word class after the deletions are going to be between noun and verb status. However,
since all lexical verbs undergo conversion into their noun forms, it seems reasonable to state that where cases of
noun / verb ambiguity arise, the default assignment of word class should be that of noun.
• Nominalisation of verb forms.
The ideal in this regard would be to use a dictionary which lists the noun forms of verbs. If this were the
case, then the problem of nominalising the verb form would be restricted to selecting the appropriate one in
those cases where multiple noun forms exist. Since it is likely, however, that the dictionary to be used does not
list the noun forms, it has been necessary to specify a method for the conversion. The following method has been
adopted:- for a given verb, identify its stem and search for an item which is listed as a noun, which has the
same stem as the verb. In cases where there is no success, assume that the noun form comprises the verb stem
plus the suffix 'ing'. The main problem with this crude method is the fact that the verb form and noun form
often do not share a common stem, resulting in the selection of unrelated noun forms. It may be that this
problem proves prohibitively large, in which case the default method (stem + 'ing') would need to be adopted
universally. Although this method has its shortcomings, a better one remains to be found.
• Synonymy.
This problem arises largely because of the lack of objectivity involved in the particular terms employed by
different lexicographers in dictionary definitions. Different synonyms may be employed by different
lexicographers, with the result that some linkages which should be formed are not in fact formed, even at the L2
lookup stage. The clearest solution to this problem would be to include an on-line thesaurus, to be consulted at
the pre-dictionary lookup stage, as discussed above.
24
• Ordering of sequences of modifiers.
Once the 'clusters' within a network have been identified, the premodifiers must be ordered so as to
produce meaningful, well-formed compound nominal expressions. The identification and specification of the
constraints which govern this ordering is another stage of the research that remains to be done. It may be that
the initial lists of compound nominals (labelled according to the sentence in which they occurred) are helpful in
this regard.
7 Summary
This report describes the work in progress which is scheduled for completion at the end of July 1994. It describes
the general aims of the project, which are: the implemention of a system which generates novel compound
nominal expressions from free text; and the use of that system as a tool in studying the constraints governing the
acceptability of such expressions.
The motivation behind the system can be seen from both practical and theoretical viewpoints. From the
practical point of view, the general motivation is that of improved access to relevant database material. From
the theoretical standpoint, the incentive is to test the hypothesis that the semantic information implicit in the
dictionary lookup terms of individual words is sufficient to produce meaningful compound nominal expressions
representing the aboutness of the text.
The current work also aims to produce a system that takes 'real life' input (abstracts) and generates useful
novel output expressions (representing the aboutness of the abstract) in a variety of domains, rather than
requiring the detailed specification of a highly specific microworld. The disadvantages of this latter shortcoming
are discussed by McDonald (1983).
Bibliography
Baxendale, P.B. (1958) Machine-Made Index for Technical Literature - An Experiment. IBM Journal of
Research and Development, 2 (4), pp. 354-361.
Beardon, C. & Turner, K. (1993) An analysis of the problems involved in understanding complex nominals.
Research Report (RSRC-93001), RSRC, University of Brighton.
Cumming, S. (1991) 'Nominalization in English and the organization of grammars'. In IJCAI 1991, pp 42-51.
van Deemter, K. (1991) On the Compositionality of Meaning: Four variations on the Theme of
Compositionality Iin Natural Language Processing. PhD Thesis. University of Amsterdam.
van Dijk, T.A. (1980) Macrostructures. Lawrence Erlbaum Associates, New Jersey.
Downing, P. (1977) On the creation and use of English Compound Nouns.
Language, 44, pp 810 - 842.
25
Endres-Niggemeyer, B. (1990) A Procedural Model of an Abstractor at work. International Forum of
Information and Documentation (IFID), 15 (4).
Finin, T.W. (1986) The Semantic Interpretation of Compound Nominals. PhD Thesis, University of Illinois,
Urbana, Illinois.
Gay, L.S. & Croft, W.B. (1990) Interpreting Nominal Compounds for Information Retrieval. Information
Processing and Management, 26 (1), pp 21-38.
Gladwin, P., Pulman, S. & Sparck Jones, K. (1991) Shallow Processing and Automatic Summarising: a First
Study. Technical Report No. 223. Computer Laboratory, University of Cambridge, Cambridge.
Halliday, M.A.K. (1988) 'On the language of physical science' In Ghadessy, M. (ed) (1988) 'Registers of
Written English: situational factors and linguistic features. London, Pinter. pp 162-178.
Hintikka, J. (1980) 'Theories of Truth and Learnable Languages' In S. Kanger & S. Ohman (eds.) Philosophy
and Grammar, D. Reidel. Dordrecht, pp 37-57.
Kieras, D.E. (1980) Initial Mention as a Signal to Thematic Content in Technical Passages. Memory and
Cognition, 8, 345-353.
Katz, J.J. (1981) Language and Other Abstract Objects. Rowman & Littlefield. Totowa, New Jersey.
Lahav, R. (1989) 'Against Compositionality: the case of adjectives' . Philosophical Studies, 57 (3) pp 261-
279.
Langacker, R.W. (1987) Foundations of Cognitive Grammar, Volume 1: Theoretical Prerequisites. Stanford
University Press. Stanford, Ca.
Langacker, R.W. (1990) Concept, Image and Symbol: The Cognitive Basis of
Grammar. Mouton de Gruyter. New York.
Lees, R.B. (1960) The Grammar of English Nominalizations. Indiana University, Bloomington, IN.
Leonard, R. (1984) The Interpretation of English noun sequences on the computer. North-Holland.
Amsterdam.
Levi, J. (1978) The Syntax and Semantics of Complex Nominals.
Academic Press. New York.
Liddy, E.D. (1991) The Discourse-Level Structure of Empirical Abstract: an Exploratory Study. Information
Processing and Management, 27 (1), pp. 55-81.
Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IBM Journal of Research and
Development, 2 (2), pp. 159-165.
McDonald, D.D. (1993) Issues in the Choice of a Source for Natural Language Generation. Association for
Computational Linguistics, 19 (1), pp 191-197.
Montague, R. (1970) Universal Grammar. Theoria, 36 , pp.373-398.
Paice, C.D. (1990) Constructing Literature Abstracts by Computer: Techniques and Prospects. Information
Processing and Management, 26 (1), pp. 171-186.
Partee, B.H. (1984) 'Compositionality' In F. Landman & F. Veltman (eds.), Varieties in Formal Semantics,
Foris, Dordrecht, pp 281-311.
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985) A Comprehensive Grammar of the English
Language. Longman. London.
26
Rau, L.F., Jacobs, P.S. & Zernik, U. (1989) Information Extraction and Text Summarization using Linguistic
Knowledge Acquisition. Information Processing and Management, 25 (4), pp. 419-428.
Sager, J.C. (1990). A Practical Course in Terminology Processing. John Benjamins Pub. Co., Amsterdam.
Schank, R.C. (1972) Conceptual Dependency: A Theory of Natural Language Understanding. Cognitive
Psychology, 3 (4), pp 552-631.
Smadja, F. (1993) Retrieving Collocations from Text: Xtract. Association for Computational Linguistics, 19
(1), pp 143-176.
Sparck Jones, K. (1983) Compound Noun Interpretation Problems. Technical Report No. 45, Computer
Laboratory, University of Cambridge, Cambridge.
Sparck Jones, K. (1983a) 'So What about Parsing Compound Nouns?' In Sparck Jones, K. & Wilks, Y.
Automatic natural language parsing. Ellis Horwood. Chichester. pp 164-168.
Sparck Jones, K. & Tait, J. (1984) Automatic Search Term Variant Generation. Journal of Documentation, 40,
pp 50-66.
Sparck Jones, K. (1993) Discourse Modelling for Automatic Summarising. Technical Report No. 290,
Computer Laboratory, University of Cambridge, Cambridge.
Winograd, T. (1972) Understanding Natural Language. Academic Press. New York.