computing lexical cohesion as a tool for text analysis hideki
TRANSCRIPT
Computing Lexical Cohesion as a Tool for Text Analysis
Hideki Kozima
Course in Computer Science and Information Mathematics
Graduate School of Electro-Communications
University of Electro-Communications
Doctoral Thesis, December 13, 1993
Abstract
Recognizing coherent structure of a text is an essential task in natural language understanding.
It is necessary, for example, to resolve anaphora, ellipsis, and ambiguity. One of the dominant
factors of coherence of the text structure is lexical cohesion, namely the dependency relationship
between words based on associative relations in common knowledge.
This thesis proposes an objective and computationally feasible method for measuring lexical
cohesion, especially semantic relations, between words. Lexical cohesion between words is com-
puted on a semantic network constructed systematically from a subset of an ordinary English
dictionary. Spreading activation on the semantic network analyses the meaning of a word into a
2,851-dimensional semantic space and computes the strength of lexical cohesion between any two
words in the dictionary.
As an evaluation of the measurement of lexical cohesion, this thesis then presents a quantita-
tive indicator, Lexical Cohesion Pro�le (LCP), for segmenting narratives into scenes, the smallest
domain in which text coherence can be de�ned. LCP is a record of the density of lexical cohesion
of words in a window (of 51 words long, in an example) that moves forwards word by word on
the text. Hills and valleys in a graph of LCP plotted against word position indicate alternation
of scenes in the text.
A psychological experiment shows that LCP correlates closely with the human judgements.
The evaluation through the text-level application reveals that the proposed measurement of lexical
cohesion works well as an indicator of coherent structure of a text.
The measurement of lexical cohesion provides semantic information for text analysis. The
segmentation scheme provides the frame work for recognizing coherent text structure. Both can
be applied to various studies in a broad range of �elds in natural language processing.
Contents
Chapter 1. Introduction page 1
2. Related Work and the Strategy of This Thesis 2
3. Computing Lexical Cohesion 7
4. Segmenting Narratives into Scenes 16
5. Retrospects and Prospects 24
6. Conclusion 30
1 Introduction
Words and phrases in a text display a kind of
mutual dependence which creates a coherent tex-
ture: they do not occur at random. The tex-
ture is what distinguishes a text from something
that is not a text. Let us refer to the texture un-
der the heading of text structure, following re-
cent studies of text understanding [Hobbs, 1979;
Beaugrande and Dressler, 1981; Grosz and Sid-
ner, 1986; Mann and Thompson, 1987; Morris and
Hirst, 1991; Hahn, 1992].
Recognizing the coherent text structure is an
essential task in text understanding [Grosz and
Sidner, 1986; Mann and Thompson, 1987]. Spe-
ci�c meaning of a lexical item in a text, espe-
cially of a pronoun (e.g. she) and of a de�nite
noun phrase (e.g. the box), can only be deter-
mined when placed in the whole structure of the
text. One needs to recognize the text structure,
for instance, in resolving anaphora, ellipsis, and
ambiguity.
The threads of the textual structure are called
cohesion or cohesive relations [Halliday and
Hasan, 1976]. Cohesive relations within a text are
relationships between items of any size, from sin-
gle words to lengthy passages, over gaps of any
distance. They are established where the inter-
pretation of some items in the text is dependent
on that of another. Let us consider the following
text.
Molly came to a theatre. But she
couldn't see Desmond. The film had
already started. She decided to wait
for the next one. For two hours!
Several types of cohesive factors can be seen in
the text: conjunction (� � � But � � �), coreference
(Molly=she), substitution (film=one), ellipsis (^
For two hours), and lexical cohesion (theatre =
film).
Lexical cohesion is the aspect on which this
thesis focuses its e�ort. Lexical cohesion is the
dependency relationship between words (or lexi-
cal items) based on associative relations in com-
mon knowledge. Lexical cohesion plays a domi-
nant role in text structure. Yet it has no clear
computational de�nition. There have been several
attempts to compute lexical cohesion, for example
[Osgood, 1952; Morris and Hirst, 1991]. These at-
tempts, however, face di�culties in managing the
common knowledge objectively. (Details of lexi-
cal cohesion and the related work are described in
Chapter 2.)
This thesis has two topics: (1) a proposal for
an objective and computational measurement of
lexical cohesion between words [Kozima and Furu-
gori, 1993a, 1993b, 1993c] (described in Chapter
3), and (2) its application to analysing the text
structure (described in Chapter 4), namely seg-
menting narratives into coherent scenes [Kozima,
1993; Kozima and Furugori, 1993d]. The latter,
text segmentation, is intended as the evaluation
of the proposed measurement. The rest of this
chapter brie y outlines these two topics, respec-
tively.
1.1 Computing Lexical Cohesion
| An Outline
The �rst topic in this thesis is computing lexical
cohesion. Lexical cohesion is a relationship be-
tween words which makes the words signify iden-
tical or semantically related concepts in common
knowledge. In view of recognizing it, lexical co-
hesion is classi�ed into two major types: reitera-
tion (or repetition) and semantic relations.
� Reiteration
Molly likes cats very much.
She keeps a cat in her room.
� Semantic relations
Desmond saw a cat in the street.
It was Molly's pet.
Molly goes to the north.
Desmond goes to the east.
Desmond often goes to a theatre.
He likes films very much.
Reiteration of words is easy to capture by morpho-
logical analysis. Recognizing semantic relations is
di�cult for computers, since it requires dealing
with large and objective common knowledge.
The strategy of this thesis is to use an English
dictionary as the common knowledge for recog-
nizing lexical cohesion. A dictionary is the lexical
knowledge shared by people in a linguistic com-
munity. Each of its headwords is de�ned by a
phrase which is composed of the headwords and
their derivations. A dictionary is a closed para-
phrasing system, or a tangled network of words.
Lexical cohesiveness �(w;w
0
)2 [0; 1], namely
the strength of lexical cohesion between words
w;w
0
, is computed on a semantic network which
is systematically constructed from the English dic-
tionary. Each node of the semantic network repre-
sents a headword of the dictionary and has links to
other nodes | links to the words in the dictionary
de�nition of the headword. As illustrated in Fig-
ure 1.1, spreading activation [Waltz and Pol-
1
'
&
$
%
'
&
$
%
�
�
�
�
?
w
activate
�
�
�
�
6
w
0
observe
�
�
�
�
�
�
�
�
�
�
�
�
�
�
Figure 1.1 Computing the lexical cohesiveness be-
tween words w;w
0
by spreading activation on the se-
mantic network.
lack, 1985; Rumelhart et al., 1986] on the network
computes lexical cohesion between any two words
in the dictionary. The following examples suggest
the feature of the lexical cohesiveness �(w;w
0
).
w w
0
�(w;w
0
)
cat pet 0.133722 (cohesive)
cat mat 0.002692 (incohesive)
The value of �(w;w
0
) increases with the strength
or tightness of the semantic relation between
w;w
0
.
1.2 Segmenting Narratives
into Scenes | An Outline
The second topic in this thesis is text segmen-
tation | segmenting a text into coherent units
of the text structure. Analysing the coherent text
structure is the most important purpose of com-
puting the lexical cohesion between words. This
text-level evaluation will reveal the nature of the
measurement of lexical cohesion.
Most studies on text structure assume that a
text can be segmented into units that then form
a hierarchical structure [Grosz and Sidner, 1986;
Mann and Thompson, 1987]. Also agreed com-
monly here is that each unit plays its own role (as
introduction and conclusion, for instance) in the
whole text. However, no clear discussion is ever
given to the problem of how to segment a text into
such units computationally.
This thesis deals with scenes, namely contigu-
ous and non-overlapping units of a narrative text.
A scene, whether or not it is explicitly realized
in a device like a paragraph, is de�ned as a se-
quence of sentences which displays local coher-
ence. A scene describes, just like in a movie, cer-
tain objects (characters and properties) in a sit-
uation (time, place, and backgrounds). This sug-
gests that a scene is the smallest domain in which
text coherence can be de�ned.
LCP
� � � � � � � � � � � � � � � � �� � �
scene 1 scene 2
Figure 1.2 Correlation between LCP (mutual lexi-
cal cohesiveness in the moving window) and a bound-
ary of coherent scenes.
Lexical Cohesion Pro�le (LCP) is a quan-
titative indicator proposed here for marking scene
boundaries in narrative texts. LCP is a record of
mutual lexical cohesiveness of words in a win-
dow (of 51 words long, for instance) that moves
forward word by word on the text. Since a coher-
ent portion of a text tends to be lexically cohe-
sive [Halliday and Hasan, 1976; Morris and Hirst,
1991], the mutual lexical cohesiveness of the text
portion suggests local coherence of it.
A graph of LCP plots local coherence estimated
from the mutual lexical cohesion at every point
of a text. Hills and valleys of the graph indicate
alternations of scenes in the text, as illustrated in
Figure 1.2. Here lies the basic idea of LCP:
� When the window is inside a scene, the words
in the window tend to be cohesive, making
LCP high.
� When the window is crossing a scene boundary,
the words in the window tend to lexically vary,
making LCP low.
So, the valleys (or minimum points) of the LCP
can be considered as marking scene boundaries.
Comparison with the scene boundaries, marked
by a number of subjects, shows that valleys of LCP
closely correlate with the dominant scene bound-
aries on which most subjects have agreed. This
also suggests the validity of the lexical cohesive-
ness which is the most signi�cant factor of scene
coherence.
2 Related Work and
the Strategy of This Thesis
The necessity for recognizing coherent text struc-
ture has been noticed in recent studies of text
understanding. For example, Hobbs [1979] pro-
2
posed a set of coherence relations (e.g. elabo-
ration, parallel, and contrast) based on inferences
between successive portions of a text. Mann and
Thompson [1987] proposed rhetorical structure
theory which characterizes hierarchical structure
of a text in terms of unstated but inferred propo-
sitions (e.g. motivation, enablement, and solution-
hood) between clauses in the text.
Grosz and Sidner [1986] proposed discourse
structure theory, a general theory common to
all discourses. It assumes that a discourse struc-
ture is composed of three separate but interrelated
components: (1) linguistic structure | segmenta-
tion of a discourse into segments, (2) intentional
structure | purposes of each segments with re-
spect to the overall discourse, and (3) attentional
state | a stack-based model of topics on which
participants of the discourse will pay their atten-
tion.
The discourse structure theory and other re-
lated studies presuppose that a text being anal-
ysed has already been partitioned into segments,
namely the linguistic structure where each seg-
ment displays local coherence and plays its own
role. While a need for text segmentation is gener-
ally agreed, there is little consensus on computa-
tional de�nitions of local coherence of a segment
or how the text is partitioned into segments.
This chapter brie y reviews related work on co-
hesion, especially on lexical cohesion, and also
describes the strategy of this thesis for comput-
ing lexical cohesion. Section 2.1 makes clear the
nature of cohesion and the relationship between
lexical cohesion and common knowledge. Section
2.2 reviews two major approaches to computing
lexical cohesion: a psycholinguistic approach and
a thesaurus-based approach. Section 2.3 de-
scribes the strategy of this thesis: a dictionary-
based approach.
2.1 Cohesion and Lexical Cohesion
Cohesion is what makes a sequence of lexical
items into a coherent texture. Cohesive relations
are dependency relationships of interpretation of
the lexical items. This section brie y reviews the
function and structure of cohesion and also of lex-
ical cohesion.
2.1.1 Major Types of Cohesion
Several types of cohesive factors have been recog-
nized, as exempli�ed in the preceding chapter. De-
scribed below are �ve major types of cohesive fac-
tors [Halliday and Hasan, 1976], namely conjunc-
tion, coreference, substitution, ellipsis, and lexical
cohesion.
Conjunction covers a cohesive bond between
what has been said before and what is about to
be said, expressed by a conjunction (e.g. but) or
a conjunctive adverb (e.g. accordingly). For ex-
ample:
� Wash and core six apples.
Then put them into a bowl.
� Molly came to a theatre.
However, she couldn't see Desmond.
This cohesion type includes additive, adversative,
causal, and temporal relations between clauses.
Coreference is formed by features that cannot
be semantically interpreted without referring to
some other features in the text. For example:
� Molly came to a theatre.
But the girl couldn't see Desmond.
� No one knows that.
Desmond is getting married.
Two subtypes of coreference are recognized:
anaphora (in the �rst example) referring back-
ward, and cataphora (in the latter) referring for-
ward. Yet another subtype is exophora (or deixis)
referring to something out of the text (e.g. Look
at that.), whose interpretation requires broader
context.
Ellipsis and substitution are variants of the
same type of cohesion; both of them require that
the missing expressions have to be grammatically
appropriate for being inserted in place. Substi-
tution serves as a place-holding device, showing
where something has been omitted.
� The film had already started.
She waited for the next one.
� Desmond will come here on time.
I think so.
While ellipsis is complete omission of an expres-
sion which can be recovered by syntactic or seman-
tic expectations from the preceding or succeeding
text:
� Desmond ordered apple juice,
and Molly ^ orange juice.
� Put the apples into a bowl.
Now add some sugar ^.
Lexical cohesion semantically relates a word
with another in the text; it is classi�ed into two
subtypes: reiteration and semantic relation. Re-
iteration is repetition of a word by the same word
or its derivations:
3
� I saw a cat in the street.
But I hate cats.
� Driving a car is interesting.
But I can't drive by myself.
Semantic relation between words is the seman-
tic relationship between concepts referred by the
words:
� A cat was running along the street.
It was Molly's pet.
� Molly often goes to a theatre.
She likes films very much.
Note that lexical cohesion occurs not only between
pairs of words but also over a succession of a num-
ber of related words and thus forms a lexical
chain (or a thread of texture) in a text.
2.1.2 Lexical Cohesion and
Common Knowledge
Lexical cohesion, especially semantic relation, be-
tween words (or lexical items) is the relationship
between concepts referred by the words; the con-
ceptual relationship lies in the common knowl-
edge shared by people in a linguistic community.
In view of the traditional frame-based knowledge
representation [Minsky, 1975; Schank, 1980], se-
mantic relation is classi�ed into two categories:
systematic semantic relation and non-systematic
semantic relation [Morris and Hirst, 1991].
Systematic semantic relation is the seman-
tic relationship logically classi�able in the struc-
ture of common knowledge. For example:
� A cat was running along the street.
It was Molly's pet.
� I saw a white cat.
However, there were no black ones.
Such structural relationships can be analysed by
the following logical relationships: synonymy of
close similarity (e.g. hear=listen), hyponymy of
general and speci�c (e.g. animal=cat), metonymy
of whole and part (e.g. room = window), and
antonymy of opposites (e.g. weak = strong).
Non-systematic semantic relation is the
other semantic relationship that is not logically
classi�able in the knowledge structure. For exam-
ple:
� Molly often goes to a theatre.
She likes films very much.
� Desmond is working at the restaurant.
He is a good waiter.
Such non-structural relationship includes colloca-
tion [Firth, 1957], i.e. tendency of co-occurrence
"polite"
angular
weak
rough
active
small
cold
good
tense
wet
fresh
rounded
strong
smooth
passive
large
hot
bad
relaxed
dry
stale
�
�
l
l
�
@
@
�
�
�
�
�
�
�
l
l
�
�
�
�
�
�
�
h
h
h
h
h
h
h
�
!
!
!
�
�
�
Figure 2.1 An example of semantic di�erential.
in similar situations.
Recent studies of knowledge representation and
parallel distributed processing [Minsky, 1986;
Waltz and Pollack, 1985; Rumelhart, et al., 1986]
have claimed that the conceptual relations in com-
mon knowledge do not have such names as syn-
onymy, antonymy, etc. So, both categories of se-
mantic relation should be treated as unnamed as-
sociative relations between concepts in common
knowledge.
2.2 Two Approaches
to Lexical Cohesion
There have been two major approaches to com-
puting lexical cohesion, especially associative rela-
tions, between words. One is a psycholinguistic
approach, which plots di�erences and quanti�es
the psychological distance between words. The
other is a thesaurus-based approach that re-
gards thesauri as the common knowledge on which
lexical cohesion is de�ned.
2.2.1 Semantic Di�erential
Psycholinguists have proposed methods for mea-
suring associative relations between words. One
of the pioneering studies is semantic di�eren-
tial [Osgood, 1952] which analyses the meaning of
a word into a range of di�erent dimensions. Sub-
jects are asked to rate a word in terms of where
it would fall on the 50 dimensions with the op-
posed adjectives at both ends. For example, if the
subjects feel that the word polite is good, they
place a mark towards the `good' end in the `good-
or-bad' dimension. Figure 2.1 illustrates ten of
the dimensions, giving the average responses from
40 subjects to the word polite (after [Osgood,
1952]).
Recent studies of knowledge representation, es-
pecially of distributed knowledge representation,
4
hunting
gambling
dollar
second week inside school restaurantstreet lake canyon
minute month house factory theatre park desert
hour year store casino outside rural mountain
day decade office bar racetrack forest seashore
� � �� � � � � �
�� �
� � � � ��
� � �
�
� �
� � �
� � �
�
� �
� � � � � � � � � � �� � �
Figure 2.2 Examples of word analysis into microfeatures. (� : strong association; � : mild
association; � : negative association)
are somewhat related to Osgood's semantic dif-
ferential. Most of them are to describe mean-
ings of words or sentences using special symbols
like semantic primitives (e.g. ATRANS and MBUILD,
in [Schank, 1980]) and microfeatures (e.g. animal
and plant, in [Waltz and Pollack, 1985; Hendler,
1989]) that correspond to the semantic dimen-
sions. Figure 2.2 illustrates analysis of the words
hunting, gambling, and dollar into the patterns
on microfeatures (after [Waltz and Pollack, 1985]).
The semantic di�erential procedure provides
quantitative data which is presumably veri�able.
However, the following problems arise from the se-
mantic di�erential procedure as the measurement
of word meaning and word association.
� Connotation vs denotation
The procedure is not based on denotative
meaning of words, but only on connotative
emotions attached to the words.
� Coverage of meaning
It is di�cult to choose relevant dimensions
required by the su�cient semantic space for
analysing any English word. The procedure
selects the representative dimensions in terms
of the frequency of their use rather than in
terms of their logically exhaustive coverage, as
given in thesauri.
For example, the procedure will draw out the very
good and slightly strong connotations of the word
mother, but it will not indicate the de�nition of
mother: a female parent of a child or animal .
2.2.2 Thesaurus-based Analysis
A thesaurus is a book which classi�es a large
number of words into categories according to
logical relations between their meanings, rather
than arrange them in alphabetical order. Roget's
thesaurus [1911] is composed of 1000 basic cat-
egories; each category, as shown in Figure 2.3,
contains a series of paragraphs grouping closely
related words. Within each paragraph, still �ner
groups are marked by semicolons; in addition, a
Word (#562)
N. word, term, vocable; name &c. 564;
phrase &c. 566; root, etymon; derivative; part
of speech &c. (grammar) 567; ideophone.
dictionary, vocabulary, lexicon, index,
glossary, thesaurus, gradus, delectus, concor-
dance.
etymology, derivation; glossology, termi-
nology orismology; paleology &c. (philogy)
560.
lexocography; glossography &c. (scholar)
492; lexicologist, verbarian.
� � �
Figure 2.3 A sample category in Roget's thesaurus
[1911].
semicolon group may have pointers, shown as
`&c. � � �', to other related categories or paragraphs.
A thesaurus has an index, which allows for re-
trieval of categories related to a given word. For
example, the word dictionary has the following
index entry:
dictionary : List (#86), School (#542),
Word (#562)
which indicates that each of the categories List,
School, and Word includes the word dictionary.
(See also Figure 2.3.)
Morris and Hirst [1991] used Roget's thesaurus
as the common knowledge for determining
whether or not two words are associatively related.
Their method captures several types of thesaural
relations between words. Two major types are de-
scribed as follows:
� car 2 Vehicle(#272) 3 truck
(Two words have a category common in their
index entries.)
� drive2Journey(#266)!Vehicle(#272)3car
(A category of one word contains a pointer to
a category of the other word.)
Note that the examples above are computed on
the machine-readable version of Roget's thesaurus
[1911], not on the printed version used in [Morris
and Hirst, 1991].
5
The thesaurus-based method is quite objective
and computationally feasible, since it regards the
thesaurus as the common knowledge shared by
people. The method can capture almost all types
of semantic relations between words. For ex-
ample, in systematic semantic relations, polite
= courteous (synonymy), plant = flower (hy-
ponymy), hand = finger (metonymy), and good
= bad (antonymy), and in non-systematic seman-
tic relations, post = letter and drink = coffee.
However, thesauri are designed to help writers
�nd the words that best express the writer's ideas,
not to provide the meanings of words. The nature
of thesauri poses the following problems: (1) the-
sauri do not provide information about semantic
di�erence between words juxtaposed in a category,
and (2) thesaural relations indicate only whether
or not two words are semantically related, not the
strength of the semantic relations. These points
are crucial for computing lexical cohesion. The fol-
lowing section will provide preliminary solutions
to these problems.
2.3 Dictionary-based Analysis
| The Strategy of This Thesis
A method for computing lexical cohesion between
words as an indicator of text coherence requires
the following points recognized through the dis-
cussions on related studies.
� Denotation
Denotational meaning of words, not the conno-
tational or emotional meaning, should be mea-
sured.
� Coverage and sensitivity
Semantic di�erence between any two words
should be computable, regardless of their cat-
egories in a thesaurus.
� Scalability
The strength of lexical cohesion, not only its
existence (of all-or-nothing), should be com-
putable.
This section outlines the strategy of this thesis
for coping with these requirements. In short,
the strategy is to use a dictionary as the com-
mon knowledge in which lexical cohesion between
words is de�ned.
2.3.1 Dictionaries and
Common Knowledge
Recent studies of knowledge representation
describe the meaning of texts in terms of arti-
�cial symbols like semantic primitives [Schank,
1980] or microfeatures [Waltz and Pollack, 1985],
as we have seen in the preceding section. How-
ever, Hjelmslev [1943], the leading theoretician of
the Copenhagen School of linguistics, has claimed
theoretical limitation of arti�cial languages:
� Any text in any natural language can be de-
scribed not by subjective arti�cial languages,
but only by the natural language itself. On the
other hand, any text in arti�cial language can
be translated into a natural language.
� Arti�cial languages are meta-languages depen-
dent on the knowledge system of a natural lan-
guage. While, a natural language is dependent
only on the knowledge system of the natural
language itself.
Each of natural languages (e.g. English and
Japanese) works as self-contained and self-
su�cient device for describing meaning of texts
written in any languages. Any natural language
is the system of signs which can articulate the real
world entirely; any other system of signs cannot.
Any knowledge or ideas for certain purposes can
be represented by texts written in a natural lan-
guage, as opposed to arti�cial languages. There-
fore, the common knowledge for text under-
standing can be represented by, and only by, texts
in a natural language. One form of such texts is
a dictionary, which provides the knowledge of
words shared in the minds of individuals. One
may draw a distinction between the knowledge of
a natural language and the knowledge of the real
world. However, they are not ultimately separa-
ble, just as dictionaries and encyclopedias are not
separable.
A dictionary is a reference book that lists words,
usually in alphabetical order, along with infor-
mation about their spelling, pronunciation, gram-
matical status, meaning, and use. A mono-
lingual dictionary can be considered as a para-
phrasing system of a natural language. Each of
its headwords is paraphrased by a phrase which
is composed of its headwords and their deriva-
tions. So, a dictionary is a self-contained and self-
su�cient system in which every element is de�ned
in terms of the relationships with other elements.
In view of structural linguistics and semiology
[Saussure, 1916; Sapir, 1921; Hjelmslev, 1943], any
language is characterized as a system based en-
tirely on the associative relations (or paradig-
matic relations) between signs (i.e. words or lexical
items). In other words, meaning of a sign is de-
�ned only by the associative relations with other
signs in the system, without being dependent on
entities in the real world. A mono-lingual dictio-
6
nary is an example of such closed systems of
signs. Viewed as a whole, it looks like a cross-
reference network of words.
2.3.2 Semantic Di�erential
on a Dictionary
The strategy of this thesis for computing lexical
cohesion is semantic di�erential on a dictio-
nary (hereafter, SDD), which analyses meaning
of a word into the strength of associative rela-
tions with the headwords of a mono-lingual dic-
tionary. SDD is somewhat similar to Osgood's
semantic di�erential [Osgood, 1952]. However, it
is quite di�erent from his method in the following
points.
� Source of linguistic data
In SDD, the dictionary works both as a se-
mantic space and as the source of linguistic
data for semantic di�erential. Whereas, Os-
good obtained his linguistic data from psycho-
logical experiments on native speakers of the
language (i.e. informants).
� Semantic dimensions
SDD uses the headwords of the dictionary
as semantic dimensions. Osgood used 50 di-
mensions (with pairs of opposed adjectives) in
his semantic di�erential procedure, while SDD
uses all headwords as the semantic dimensions
into which meaning of a word is analysed.
These points guarantee objectivity and complete-
ness of the semantic space of SDD as a �eld for
analysing meanings of words.
SDD satis�es the requirements for computing
lexical cohesion that are described at the begin-
ning of this section. The �rst and the second
requirements can obviously be handled as fol-
lows:
� Denotation
SDD deals with the denotational meaning of
words described in the dictionary de�nitions
of the words, not the connotations attached to
them. Dictionary de�nitions are the common
lexical knowledge shared by the people.
� Coverage and sensitivity
SDD maps each word in the dictionary onto
a point in the semantic space spanned by the
dimensions of all headwords in the dictionary.
Each word is mapped onto a di�erent point.
In other words, di�erent words are mapped
onto di�erent points; two words are mapped
onto the same point, only if their de�nitions
are identical.
The third requirement, the scale or strength of as-
sociative relations, can be treated in the following
manner:
� Scalability
In SDD, each dimension of the semantic space
is a continuous scale (for instance, the interval
[0; 1] of real numbers), not in a discrete scale
(for instance, all-or-nothing).
Each dimension represents the strength of associa-
tive relation between the word w being analysed
and the headword w
0
of the dimension.
SDD thus analyses meaning of a word w into a
N -dimensional vector of continuous scales, where
N is the number of the headwords in the dictio-
nary. The semantic vector represents the strength
of associative relations between w and the head-
words in the dictionary. In other words, the se-
mantic vector represents the meaning of w. The
following chapter describes the method for com-
puting the semantic vector of a given word and
the method for computing the strength of lexical
cohesion between words in a continuous scale.
3 Computing
Lexical Cohesion
A computational method for measuring lexical co-
hesiveness [Kozima and Furugori, 1993a, 1993b,
1993c] is described in this chapter. The lexical co-
hesiveness is computed on a semantic network,
called Paradigme, which is systematically con-
structed from a subset of the English dictio-
nary: Longman Dictionary of Contemporary En-
glish (hereafter, LDOCE). Section 3.1 describes
how the network Paradigme is constructed from
LDOCE.
Spreading activation [Waltz and Pollack,
1985; Rumelhart et al., 1986] on the network can
compute the lexical cohesiveness between any two
words in LDOCE | directly 2,851 core words and
their derivations, and indirectly all the other head-
words of LDOCE and their derivations. Section
3.2 describes how to compute the lexical cohesive-
ness on Paradigme. As an application, Section
3.3 describes a measurement of lexical cohesive-
ness between texts.
The lexical cohesiveness �(w;w
0
) 2 [0; 1] be-
tween words w;w
0
is an objective and computa-
tionally feasible measurement of lexical cohesion.
Section 3.4 discusses the nature and the limits
of Paradigme and of the lexical cohesiveness com-
puted on it. Finally, Section 3.5 gives a brief
7
red
1
/red/ adj -dd- 1 of the colour
of blood or �re: a red rose/dress j We
painted the door red. | see also like a
red rag to a bull (rag
1
) 2 (of hu-
man hair) of a bright brownish orange
or copper colour 3 (of the human skin)
pink, usu. for a short time: I turned red
with embarrassment/anger. j The child's
eye (= the skin round the eyes) were red
from crying. 4 (of wine) of a dark pink
to dark purple colour | �ness n [U]
(red adj ; headword, word-class
((of the colour) ; unit 1 - head-part
(of blood or fire) ) ; rest-part
((of a bright brownish orange
or copper colour )
(of human hair) )
(pink ; unit 3 - head-part
(usu for a short time) ; rest-part 1
(of the human skin) ) ; rest-part 2
((of a dark pink to dark purple colour)
(of wine) ))
Figure 3.1 A sample entry (of red/adjective) of LDOCE and the corresponding entry of
Gloss�eme (in S-expression).
conclusion of this chapter.
3.1 Paradigme: A Field for
Measuring Lexical Cohesion
The semantic network Paradigme is a �eld for
measuring the lexical cohesiveness. It provides a
semantic space in which meaning of a word is anal-
ysed. Paradigme is systematically constructed
from a small English dictionary, calledGloss�eme,
that is a subset of LDOCE.
3.1.1 Gloss�eme: A Closed Subsystem
of English
LDOCE is an English dictionary with a unique
feature | each of its 56,000 headwords is de-
�ned by using the words in the Longman De�n-
ing Vocabulary (hereafter, LDV) and their deriva-
tions. The use of LDV by the lexicographers is
restricted in that only the most frequent senses of
words, self-explanatory compounds, and phrasal
verbs are permitted [LDOCE, 1987; Carter and
McCarthy, 1988].
LDV consists of 2,851 words (counted as the
headwords in LDOCE, distinguishing homographs
like red = adjective and red =noun) and 48 a�xes
(10 pre�xes and 38 su�xes) that make deriva-
tions from the 2,851 core words. LDV is origi-
nally based on a survey of word frequency and re-
stricted vocabulary for English language teaching
[West, 1953], and has been updated by Longman
with reference to more recent frequency informa-
tion [LDOCE, 1987].
Gloss�eme is a reduced version of LDOCE. It
consists of every entry of LDOCE whose head-
word is included in LDV. So, each word in LDV
is de�ned by Gloss�eme. Obviously, all words in
Gloss�eme (its headwords and the words in their
de�nitions) are included in LDV and its deriva-
tions. It is worth noting that Gloss�eme is a closed
subsystem of English: each of its headwords is
paraphrased into a phrase which is composed of
the headwords and their derivations.
Gloss�eme has 2,851 entries (the same size as
that of LDV) that consist of 101,861 words (35.73
words/entry on the average). As shown in Fig-
ure 3.1, an entry of Gloss�eme has a headword, a
word-class, and one or more units corresponding
to numbered de�nitions in the entry of LDOCE.
Note that Gloss�eme is described in the notation
of S-expression.
Each unit has one head-part and several rest-
parts. For example, the �rst one in the entry red=
adjective of LDOCE:
1 of the colour of blood or �re
is converted into the following unit. (This conver-
sion is partly done by hand.)
((of the colour)
(of blood or fire) )
A head-part (e.g. (of the colour)), which cor-
responds to the �rst phrase in a de�nition, pro-
vides broader meaning of the headword; rest-
parts (e.g. (of blood or fire)), which corre-
spond to succeeding subordinates, restrict mean-
ing of the head-part to speci�c one for the head-
word.
The structure of a unit is based on the structure
of de�nitions in the dictionary: (1) a de�nition
�rst provides broader meaning of the headword,
(2) then imposes several restrictions on the mean-
ing. The following schemes illustrate major types
of the structure of dictionary de�nitions.
noun = noun-phrase
+ adjectival-phrase/clause ...
verb = verb-phrase
+ adverbial-phrase/clause ...
adjective = adjectival-phrase
+ adverbial-phrase/clause ...
8
(red_1 (adj) 0.000000 ;; headword, word-class, and activity-value
;; referant
(+ ;; subreferant 1
(0.333333 ;; weight of subreferant 1
(* (0.001594 of_1) (0.001733 the_1) (0.001733 the_2) (0.042108 colour_1)
(0.042108 colour_2) (0.000797 of_1) (0.539281 blood_1) (0.000529 or_1)
(0.185058 fire_1) (0.185058 fire_2) ))
;; subreferant 2
(0.277778
(* (0.000278 of_1) (0.000196 a_1) (0.030997 bright_1) (0.065587 brown_1)
(0.466411 orange_1) (0.000184 or_1) (0.385443 copper_1) (0.007330 colour_1)
(0.007330 colour_2) (0.000139 of_1) (0.009868 human_1) (0.009868 human_2)
(0.016372 hair_1) ))
;; subreferant 3
(0.222222
(* (0.410692 pink_1) (0.410692 pink_2) (0.003210 for_1) (0.000386 a_1)
(0.028846 short_1) (0.006263 time_1) (0.000547 of_1) (0.000595 the_1)
(0.000595 the_2) (0.038896 human_1) (0.038896 human_2) (0.060383 skin_1) ))
;; subreferant 4
(0.166667
(* (0.000328 of_1) (0.000232 a_1) (0.028368 dark_1) (0.028368 dark_2)
(0.123290 pink_1) (0.123290 pink_2) (0.000273 to_1) (0.000273 to_2)
(0.000273 to_3) (0.028368 dark_1) (0.028368 dark_2) (0.141273 purple_1)
(0.141273 purple_2) (0.008673 colour_1) (0.008673 colour_2) (0.000164 of_1)
(0.338512 wine_1) )))
;; refere
(* (0.031058 apple_1) (0.029261 blood_1) (0.008678 colour_1) (0.009256 comb_1)
(0.029140 copper_1) (0.009537 diamond_1) (0.003015 fire_1) (0.073762 flame_1)
(0.005464 fox_1) (0.005152 heart_1) (0.098349 lake_2) (0.007025 lip_1)
(0.029140 orange_1) (0.007714 pepper_1) (0.196698 pink_1) (0.012294 pink_2)
(0.098349 pink_2) (0.018733 purple_2) (0.028100 purple_2) (0.098349 red_2)
(0.196698 red_2) (0.004230 signal_1) ))
Figure 3.2 A sample node of Paradigme (in S-expression).
The markers in bold-face indicate that they are
head-parts of the de�nitions; other makers indi-
cate rest-parts. (See [Markowitz, 1986; Alshawi,
1987; Nakamura and Nagao, 1988] for details of
the structure of dictionary de�nitions.)
3.1.2 Paradigme: A Semantic Network
The closed sub-dictionary Gloss�eme is then trans-
lated into a semantic network Paradigme. Each
entry of Gloss�eme is mapped onto a node in Para-
digme. Paradigme has 2,851 nodes (the same size
as that of Gloss�eme) which include 295,914 un-
named links between the nodes (103.79 links/node
on the average). Figure 3.2 shows a sample
node red 1 (corresponds to the entry of Gloss�eme
shown in Figure 3.1). Each node consists of a
headword, a word-class, an activity-value, and two
structures: a r�ef�erant and a r�ef�er�e.
A r�ef�erant provides information about the in-
tension (i.e. de�nition) of the headword. It con-
sists of several subr�ef�erants, each one containing
a set of links that is mapped from the correspond-
ing unit in the entry of Gloss�eme. For example,
the second unit in the entry red = adjective:
((of a bright brownish orange
or copper colour )
(of human hair) )
is mapped onto the following subr�ef�erant.
(0.277778
(* (0.000278 of_1) (0.000196 a_1)
(0.030997 bright_1) (0.065587 brown_1)
(0.466411 orange_1) (0.000184 or_1)
(0.385443 copper_1) (0.007330 colour_1)
(0.007330 colour_2) (0.000139 of_1)
(0.009868 human_1) (0.009868 human_2)
(0.016372 hair_1) ))
Each subr�ef�erant has a weight, e.g. 0.333333 and
0.277778, which is computed from the position
in the sequence of units arranged in order of their
signi�cance.
A morphological analysis on a�xes de�ned
in LDV maps all the derivations of LDV onto their
root forms (i.e. the headwords of the nodes in
Paradigme). For example, the word brownish in
the unit shown above is mapped onto the link to
brown 1, and the word colour onto two links to
colour 1 = adjective and colour 2 = noun. So,
a word can be identi�ed with the corresponding
node or nodes, and vice versa.
Each link in a subr�ef�erant, e.g. (0.065587
brown 1), consists of a weight and a headword
of the node to which the link refers. A weight
h
k
2 [0; 1] of a link to a node w
k
is computed from
the word frequency of the word w
k
in Gloss�eme
and other information (such as whether the word
is in a head-part or rest-part), and normalized as
9
'
&
$
%
�
�
�
�
�
�
�
�
?
w
s(w)
'
&
$
%
�
�
�
�
�
�
�
�
?
w
s(w)
�
�
�
�
�
�
�
�
�
�
�
�
�
�
'
&
$
%
�
�
�
�
�
�
�
�
6
w
0
s(w
0
)
�
�
�
�
�
�
�
�
�
�
�
�
�
�
(1) Start activating w. (2) Produce a pattern. (3) Observe activity of w
0
.
Figure 3.3 Computing the lexical cohesiveness �(w;w
0
) on Paradigme.
P
h
k
=1 in each subr�ef�erant.
While a r�ef�er�e provides information about the
extensions (i.e. examples) of the headword, which
is the converse of the r�ef�erant showing the inten-
sion. A r�ef�er�e of a node w has the links to the
nodes referring to w. For example, the r�ef�er�e of
red 1 (shown in Figure 3.2) describes examples
of red things. The link to apple 1 in the r�ef�er�e
of red 1 means that apple 1 has a link to red 1
in its r�ef�erant, in other words, the entry apple
in Gloss�eme contains the word red. Each link
in a r�ef�er�e, e.g. (0.031058 apple 1), also has a
weight, which is computed from the weight of
the corresponding link (e.g. that from apple 1 to
red 1).
Details of the structure of Paradigme, especially
of the translation procedure from Gloss�eme, are
described in Appendix A.
3.2 Computing Lexical Cohesion
between Words
The lexical cohesiveness between words is com-
puted by spreading activation on the semantic
network Paradigme. At each point T in time, each
node w
i
in Paradigme has an activity value v
i
(T ).
The activity value can be seen as passing through
a set of uni-directional links to other nodes in
Paradigme. The weight of each link determines
the amount of e�ect that the referred node has on
the referring node. Note that the weights of the
links are �xed for all time, since this thesis does
not deal with learning or evolution of the network.
Each node w
i
computes its activity value v
i
(T )
at every point T (of discrete steps) in time. The
spreading activation rule is given by
v(T ) = � (R
i
(T�1); R
0
i
(T�1); e
i
(T�1)) ;
where R
i
(T ) is the sum of weighted activity values
(at time T ) of the nodes referred in the r�ef�erant,
and R
0
i
(T ) is the sum of those referred in r�ef�er�e.
And, e
i
(T ) is the activity value given to the node
w
i
from outside (at time T ); to activate a node is
to let e
i
(T )>0. The function � sums up three ac-
tivity values in appropriate proportion and limits
the output value to [0; 1]. Appendix B describes
the spreading activation rule in details.
3.2.1 Computing the Lexical Cohesiveness
The lexical cohesiveness �(w;w
0
) between words
w;w
0
is computed by spreading activation on
the semantic network Paradigme. As illustrated
in Figure 3.3: (1) the computing procedure ac-
tivates the node w, (2) and produces an activated
pattern on Paradigme, (3) then observes the ac-
tivity value of the word w
0
, which indicates the
strength of association from w to w
0
.
Activating a node w for a certain period of time
causes the activity to spread over Paradigme and
produces an activated pattern on it. Figure
3.4 shows an activated pattern produced from the
word red. The graph plots the activity values of
10 dominant nodes at each step of time. I em-
pirically found that the activated pattern approx-
imately gets equilibrium after 10 steps, whereas it
will never reach the actual equilibrium. The ac-
tivated pattern thus produced can be considered
as a 2,851-dimensional vector. Each of its dimen-
sions, i.e. the activity value of one node, represents
the strength of association with the node w.
The procedure of computing the lexical cohe-
siveness �(w;w
0
) 2 [0; 1] between words w;w
0
is
described as follows.
1. Activate the node w with strength s(w) for 10
steps of time, where s(w) is the signi�cance of
w (de�ned below).
2. Then (T = 10), an activated pattern P (w) is
produced on Paradigme, as shown in Figure
3.4.
3. Observe a(P (w); w
0
) | the activity value of
the node w
0
in P (w). Finally, the lexical cohe-
siveness �(w;w
0
) is given by s(w
0
)�a(P (w); w
0
).
Note that each node has no activity at the begin-
10
A
A
A
A
A
A
A
A
2 4 6 8 10
0.2
0.4
0.6
0.8
1.0
red 2
red 1
orange 1
pink 1
pink 2
blood 1
copper 1
purple 1
purple 2
rose 2
activity
T (steps)
Figure 3.4 An activated pattern produced from
the word red. (Changing of activity values of 10 nodes
holding highest activity at T =10).
ning of this procedure, and a word and the cor-
responding node or nodes can be identi�ed with
the help of the morphological analysis (cf. Section
3.1).
The signi�cance s(w)2 [0; 1] is de�ned as the
normalized information of the word w in West's
corpus [West, 1953]. For example, the word red
appears 2,308 times in the 5,487,056-word corpus,
and the word and 106,064 times. So, s(red) and
s(and) are computed as follows.
s(red) =
� log(2308=5487056)
� log(1=5487056)
� 0:500955;
s(and) =
� log(106064=5487056)
� log(1=5487056)
� 0:254294:
Note that the estimation of the words excluded
from West's word list [West, 1953] virtually en-
larges the original 5,000,000-word corpus. The
frequency of the extra words (9.65% of LDV) are
estimated at the average frequency of their word
class.
For example, let us consider the lexical cohe-
siveness between red and orange. First, we pro-
duce an activated pattern P (red) on Paradigme
(as shown in Figure 3.4). In this case, both of
the nodes red 1 = adjective and red 2 = noun are
activated with strength s(red)=0:500955. Then,
we compute s(orange) = 0:676253, and observe
a(P (red); orange)= 0:390774. Finally, we obtain
the lexical cohesiveness �(red; orange) as follows.
�(red; orange)
= s(orange) � a(P (red); orange)
= 0:676253 � 0:390774
= 0:264262:
Note that the fractions are rounded o� to six dec-
imal places.
3.2.2 Examples of the Computation
The procedure described above can compute the
lexical cohesiveness �(w;w
0
) 2 [0; 1] between any
two words w;w
0
in LDV and its derivations. Com-
puter programs of the procedures, namely spread-
ing activation (written in C programming lan-
guage and translated by the compiler on SunOS
4.1.3), morphological analysis and others (written
in Common Lisp and executed on KCL), can com-
pute �(w;w
0
) within 2.5 seconds on the worksta-
tion (SPARCstation 2 = SunOS 4.1.3). Note that
most of the time is used for spreading activation.
The lexical cohesiveness �(w;w
0
) increases with
the strength of the systematic semantic relation
between words w;w
0
, as shown in the following
examples.
w w
0
�(w;w
0
)
wine alcohol 0.118078
wine line 0.002040
big large 0.120587
clean large 0.004943
buy sell 0.135686
buy walk 0.007993
Also the lexical cohesiveness � increases with the
strength of the non-systematic semantic relation
between words, as shown in the following exam-
ples.
w w
0
�(w;w
0
)
waiter restaurant 0.175699
computer restaurant 0.003268
red blood 0.111443
green blood 0.002268
dig spade 0.116200
fly spade 0.003431
Note that �(w;w
0
) has direction (from w to w
0
),
so that �(w;w
0
) may not be equal to �(w
0
; w). For
example:
w w
0
�(w;w
0
)
cow cattle 0.303977
cattle cow 0.379470
The lexical cohesiveness �(w;w
0
) increases with
the signi�cance s(w) and s(w
0
) that represent
meaningfulness of w and w
0
. The reason for this
is that � suggests the strength of associative rela-
tion between words, so meaningful words should
have higher lexical cohesiveness, while meaning-
less words (especially, function words) should have
lower one. For example:
11
'
&
$
%
�
�
�
�
�
�
�
�
�
�
�
�
C
C
C
C
CW
C
C
C
CW
C
C
C
C
CW
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
w
1
; � � � ; w
n
z }| {
W
w
0
1
; � � � ; w
0
m
z }| {
W
0
LDOCE ! LDOCE
Figure 3.5 Computing lexical cohesiveness of extra
words. (An extra word is treated as a list of the words
in its de�nition.)
A
A
A
A
A
A
A
A
2 4 6 8 10
0.1
0.2
0.3
0.4
alcohol 1
drink 1
red 2
drink 2
red 1
bottle 1
wine 1
poison 1
swallow 1
spirit 1
activity
T (steps)
Figure 3.6 A pattern produced from a word list:
fred, alcoholic, drinkg. (Changing of activity val-
ues of 10 nodes holding highest activity at T =10.)
w w
0
�(w;w
0
)
north east 0.100482
to theatre 0.007259
films of 0.005914
to the 0.002240
Also the re ective lexical cohesiveness �(w;w),
i.e. the lexical cohesiveness with itself, depends
on the signi�cance s(w), so that �(w;w)�1. For
example:
w w
0
�(w;w
0
)
waiter waiter 0.596803
of of 0.045256
3.2.3 Lexical Cohesiveness of ExtraWords
The lexical cohesiveness of words in LDV and its
derivations is directly computed on Paradigme, as
we have seen above; the lexical cohesiveness of ex-
tra words (i.e. those excluded from LDV) is indi-
rectly computed by treating an extra word as a list
of the words in its de�nition of LDOCE, as illus-
trated in Figure 3.5. Note that each word in the
de�nition is included in LDV or its derivations.
The lexical cohesiveness between two word lists,
W =fw
1
;� � �; w
n
g andW
0
=fw
0
1
;� � �; w
0
m
g is de�ned
as follows:
�(W;W
0
) =
X
w
0
2W
0
s(w
0
)�a(P (W ); w
0
)
!
;
where P (W ) is an activated pattern produced by
activating each word w
i
in W with strength
s(w
i
)
2
=
X
k
s(w
k
):
And, is a function which limits the output value
to [0; 1].
Figure 3.6 illustrates the activated pattern
P (W ) produced from the word list W = fred,
alcoholic, drinkg. It is worth noting that the
nodes bottle 1 and wine 1 are highly activated
in the pattern P (W ), whereas those nodes never
get such high activity in any patterns produced
from a single word in W . So, we may say that the
overlapped pattern implies a bottle of wine.
For example, the lexical cohesiveness between
linguistics and stylistics | both are extra
words | is computed as follows.
�(linguistics, stylistics)
= �(f the, study, of, language, in, general,
and, of, particular, languages, and,
their, structure, and, grammar, and,
history g,
f the, study, of, style, in, written,
or, spoken, language g )
= 0.140089 .
Obviously, both �(w;W ) and �(W;w), where w
is included in LDV or its derivations and W is
not, are also computable in the same scheme (by
replacing w with the word list fwg). Therefore, we
can compute the lexical cohesiveness between any
two headwords in LDOCE and their derivations.
3.3 Computing Lexical Cohesion
between Texts
This section describes an application of the lexi-
cal cohesiveness between words, that is comput-
ing lexical cohesiveness between texts. Let us as-
sume that a text is a simple word list without any
syntactic structure or punctuations. Then, the
lexical cohesiveness �(X;X
0
) between two texts
X = fw
1
;� � �; w
n
g and X
0
= fw
0
1
;� � �; w
0
m
g can be
12
'
&
$
%
�
�
�
�
�
�
�
�
�
�
�
�
C
C
C
C
CW
C
C
C
CW
C
C
C
C
CW
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
��
�
�
�
�
��
�
�
�
��
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
w
1
; � � � ; w
n
z }| {
X
w
0
1
; � � � ; w
0
m
z }| {
X
0
-
-
-
Figure 3.7 Computing lexical cohesiveness between
texts. (An overlapped pattern makes implicit infer-
ences.)
computed as follows. (See also Figure 3.7.)
�(X;X
0
) =
X
w
0
2X
0
s(w
0
)�a(P (X); w
0
)
!
:
The lexical cohesiveness between texts is com-
puted in the very same way as the lexical cohe-
siveness of extra words described above.
3.3.1 Text Cohesiveness and
Implicit Inferences
The lexical cohesiveness between portions of a text
or discourse suggests the coherence of them |
how naturally and reasonably they are connected.
The following examples suggest that �(X;X
0
) in-
dicates the strength of the coherence relation be-
tween text portions X;X
0
.
�
X
X
0
�(X;X
0
)
�
"I have a hammer."
"Take some nails."
0.100611
�
"I have a hammer."
"Take some apples."
0.005295
�
"I have a pen."
"Where is ink?"
0.113140
�
"I have a pen."
"Where do you live?"
0.007676
Note that the lexical cohesiveness between texts
has direction, so that �(X;X
0
) may not be equal
to �(X
0
;X). And re ective lexical cohesiveness
�(X;X) must be less than 1. Compare the fol-
lowing examples with the ones above.
�
X
X
0
�(X;X
0
)
�
"Where is ink?"
"I have a pen."
0.103681
�
"Take some apples."
"Take some apples."
0.434443
The directly activated nodes interact each other
and produce an overlapped pattern which in-
cludes other nodes indirectly associated or im-
plicitly inferred. For example, the phrase "red
alcoholic drink" (its activated pattern is shown
in Figure 3.6) has the strong coherence with
"a bottle of wine" and weak coherence with
"fresh orange juice" as follows.
�
X
X
0
�(X;X
0
)
�
"Red alcoholic drink."
"A bottle of wine."
0.280683
�
"Red alcoholic drink."
"Fresh orange juice."
0.096469
�
"Red alcoholic drink."
"An English dictionary."
0.008166
3.3.2 Text Cohesiveness and
Word Signi�cance
The lexical cohesiveness between texts re ects the
signi�cance of the words in the texts. Each word
in a text has its own weight for activation and
observation. This results in that meaningless iter-
ation of words (especially, of function words) has
less in uence on the lexical cohesiveness between
texts.
Let us consider the following examples of the
lexical cohesiveness between sentences:
�
X
X
0
�(X;X
0
)
�
"It is a dog."
"That must be your dog."
0.252536
�
"It is a dog."
"It is a log."
0.053261
where, the signi�cance s of the words in the ex-
amples above are as follows.
w s(w) w s(w)
it 0.280136 that 0.253374
is 0.297779 must 0.421726
a 0.274085 be 0.297779
dog 0.589734 your 0.382722
dog 0.589734
w s(w) w s(w)
it 0.280136 it 0.280136
is 0.297779 is 0.297779
a 0.274085 a 0.274085
dog 0.589734 log 0.621410
The sentences in the �rst pair have only one word
(namely, dog) in common; those in the latter
13
have three words (namely, it, is, and a) in com-
mon. However, the signi�cant words or focuses
of the sentences (shown in bold-face) play dom-
inant role on computing the lexical cohesiveness
between them, so that the lexical cohesiveness be-
tween sentences re ect the semantic coherence be-
tween them.
3.4 Discussion
The lexical cohesiveness computed on Paradigme
works as an indicator of lexical cohesion between
words and also between texts, as we have seen
in Section 3.2 and Section 3.3. This section
discusses the nature of Paradigme, limits of the
lexical cohesiveness computed on it, and possible
application of the lexical cohesiveness.
3.4.1 Paradigme and the Semantic Space
Paradigme works as a �eld for semantic di�eren-
tial of a word or a set of words. A set of the
activity values of the nodes in Paradigme spans
a 2,851-dimensional semantic space, or a 2,851-
dimensional hypercube, where an activated pat-
tern is represented as a point. Each edge of the
hypercube corresponds to a word in the de�ning
vocabulary LDV.
LDV is originally based on the survey of word
frequency [West, 1953]. The frequency is a count
of the occurrence of words in the 5,000,000-word
corpus of written English, and has been updated
by Longman with reference to more recent fre-
quency information [LDOCE, 1987]. This crite-
rion implies objectivity of LDV. Also the follow-
ing criteria provide the basis for the selection of
LDV.
� Necessity
An indispensable word, which alone covers a
certain range of meaning, should be adopted
regardless of its frequency.
� E�ciency
The semantic range of a word should be as
wide as possible, so as to reduce the cost of
learning. This criterion is the converse of the
�rst one.
These criteria imply completeness of LDV | a
potential for covering all the concepts commonly
found in the world.
Objectivity and completeness of LDV as the
de�ning vocabulary suggest su�ciency of the se-
mantic space. Osgood [1952] used 50 dimensions
in his semantic di�erential procedure. SDD (se-
mantic di�erential on a dictionary) uses 2,851 di-
mensions with objectivity and completeness. Ob-
viously, SDD can be applied to construct a se-
mantic network from an ordinary dictionary whose
de�ning vocabulary is not restricted. However,
such a network is too large to compute spread-
ing activation on ordinary sequential computers.
Paradigme is the small but objective and complete
network for analysing meaning of words.
The lexical cohesiveness computed by SDD is
not a distance or closeness between two acti-
vated patterns in the semantic space. Osgood
[1952] measured similarity between words in terms
of the distance between two vectors in the 50-
dimensional semantic space. In SDD, the lexi-
cal cohesiveness between words w;w
0
is calculated
from a(P (w); w
0
) | the activity value of the word
w
0
in the activated pattern P (w) produced from
the word w. The reason for this is that the ac-
tivated pattern P (w) directly represents associa-
tive relations from w and to other words in LDV,
i.e. the de�nition of lexical cohesion.
3.4.2 Limits of Paradigme
The proposed lexical cohesiveness is based only on
the denotational and intensional de�nitions in
the English dictionary LDOCE. Lack of the con-
notational and extensional knowledge causes
some unexpected e�ects on the lexical cohesive-
ness. For example, let us consider the following
example.
�(tree; leaf) = 0:008693:
We can recognize the apparent relationship be-
tween tree and leaf. However, the lexical co-
hesiveness between them is much lower than the
estimation by our intuition.
The reason for this disagreement is due to the
nature of the dictionary de�nitions: they only in-
dicate su�cient conditions of the headwords. For
example, the de�nition of tree in LDOCE tells us
nothing about leaves.
tree n 1 a tall plant with a wooden trunk
and branches, that lives for many years 2 a
bush or other plant with a treelike form 3 a
drawing with a branching form, esp. as used
for showing family relationships
However, the de�nition is followed by pictures of
leafy trees, which provide readers with connota-
tional and extensional stereotypes of tree.
In SDD, each de�nition in LDOCE is treated as
a list of words, though it is a phrase with syn-
tactic structure. Let us consider the following
de�nition of the verb lift.
14
'
&
$
%
Paradigme
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
#
#
#c
c
c
e
e
e
e%
%
%
%
c
c
c#
#
#
J
J
J
J
J
text: t
e
1
e
2
e
3
| {z }
episodes
Figure 3.8 Text retrieval by the cohesiveness be-
tween texts. (Recalling the most similar episode in the
memory to the given text.)
lift v 1 to bring from a lower to a higher level;
raise 2 (of movable parts) to be able to be
lifted 3 � � �
Anyone can imagine that something is moving up-
wards. However, such a movement cannot be ex-
pressed by the corresponding word list, nor by the
activated pattern produced from the word list.
The measurement of the lexical cohesiveness be-
tween words is intended to provide bottom-up in-
formation for analysing the semantic and syntactic
structure of a phrase, sentence, or text. However
the measurement requires such structure of higher
levels. As far as the lexical cohesiveness between
words is concerned, I assume that an activated
pattern on Paradigme will approximate the mean-
ing of a word w, like a still picture can express a
story.
3.4.3 Application to Text Retrieval
The lexical cohesiveness between texts computed
on Paradigme can be applied to text retieval. It
is to recall the most similar episode e to the given
text t:
e = argmax
e
i
2 E
�(t; e
i
);
where E = fe
1
;� � �; e
n
g is a set of episodes
(i.e. texts) stored in the memory. Fig-
ure 3.8 illustrates this mechanism: once P (t)
is produced on Paradigme, the cohesiveness
�(t; e
1
);� � �; �(t; e
n
) can immediately be computed
and compared. This text retrieval scheme is a
mapping:
t 7! P (t) 7! e;
in other words, a mapping from the given text t
to another text e in the memory.
Also a text set T =ft
1
;� � �; t
m
g can associate an
episode e in the memory which is most similar to
'
&
$
%
Paradigme
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�@
@
@
@
,
,
,l
l
l
�
�
�
�
�\
\
\
\
\
\
\
\
\
\�
�
�
�
�
l
l
l,
,
,
@
@
@
@�
�
�
�
t
2
t
1
t
3
z }| {
text and context
e
1
e
2
e
3
| {z }
episodes
Figure 3.9 Context-sensitive text retrieval. (Re-
calling the most similar episode to the given text and
context.)
T . This association scheme can be written in the
following form:
T 7! P (T ) 7! e;
where P (T ) is an overlapped pattern of t
i
2
T . Pattern overlapping provides interaction be-
tween texts and produces an activated pattern
which includes novel nodes indirectly associated
or implicitly inferred, as we have seen in Section
3.3. If it is necessary, each text t
i
2 T can be
weighted according to its signi�cance in T .
The mapping from a text set to an episode
(T 7!P (T ) 7!e) works as context-sensitive text
retrieval. As illustrated in Figure 3.9, P (t
1
) and
P (ft
2
;� � �; t
m
g) are overlapped on Paradigme. The
main key t
1
is strongly activated so as to produce
the �gure, and the others ft
2
;� � �; t
m
g is weakly
activated so as to produce the ground or context
for text retrieval.
This text retrieval scheme provides a new
method for semantic retrieval which recalls the
most semantically similar episodes in the mem-
ory, regardless of typological identity of the key-
words. Moreover, it can be applied to automatic
text classi�cation which determines categories
or genres of given texts. Each category is de-
�ned not by its intension (or attributes), but by
its extension (or members). This suggests that the
scheme can provide exibility for EBR (example-
based reasoning) and EBL (example-based learn-
ing) systems.
3.5 Summary
This chapter described the computation of the
lexical cohesiveness between words, i.e. a mea-
surement of the strength of lexical cohesion. The
15
lexical cohesiveness between words is computed
by spreading activation on the semantic net-
work Paradigme which is systematically con-
structed from a subset of the English dictio-
nary LDOCE (Longman Dictionary of Contem-
porary English). Paradigme can directly compute
the lexical cohesiveness between any two words in
LDV (Longman De�ning Vocabulary, consists of
2,851 words) and its derivations, and indirectly
the lexical cohesiveness of all other headwords of
LDOCE and their derivations. The lexical cohe-
siveness provides a new method for analysing the
coherent text structure. It can be applied to cap-
ture coherent relations between sentences or text
portions.
I regard Paradigme as a �eld for the inter-
action between texts and episodes in memory,
i.e. the interaction between what one is reading
or listening and what one knows [Minsky, 1980,
1986; Schank, 1990]. The meaning of words, sen-
tences, or even texts can be projected in a uniform
way on Paradigme, as we have seen in Section
3.2 and Section 3.3. Similarly, we can overlap
the �gure and ground, and recall the most rele-
vant episode for interpretation of the �gure; the
recalled episode will change the ground for the
next step. A preliminary model for this episode
association cycle is described in [Kozima and Fu-
rugori, 1991a, 1991b, 1991c].
In future research, I intend to deal with syn-
tagmatic relations between words. Meaning
of a text lies in the texture of paradigmatic and
syntagmatic relations of lexical items [Hjelmslev,
1943]. Paradigme provides the former dimension
| the associative system of words that works
as a screen onto which the meaning of a word is
projected like a still picture. The latter dimen-
sion | the syntactic process | will be treated
as pattern changing in time, like a �lm pro-
jected dynamically onto Paradigme. This enables
us to compute the coherent relation between texts
as syntactic and semantic processes, not as the
static cohesiveness between lists of words.
The next chapter describes an application of the
lexical cohesiveness to text segmentation [Grosz
and Sidner, 1986; Youmans, 1991], as the evalua-
tion of the lexical cohesiveness proposed here.
4 Segmenting Narratives
into Scenes
This chapter describes a computationally feasible
method for text segmentation. It is an applica-
LCP
� � � � � � � � � � � � � � � � �� � �
scene 1 scene 2
Figure 4.1 Correlation between LCP (mutual lexi-
cal cohesiveness in the moving window) and a bound-
ary of coherent scenes.
tion of the lexical cohesiveness proposed in Chap-
ter 3, and also the evaluation of the lexical cohe-
siveness.
Most studies on text structures assume that a
text can be partitioned into units that form a co-
herent structure [Grosz and Sidner, 1986; Mann
and Thompson, 1987], and recognizing the text
structure is an essential task in text understand-
ing, as we have seen in Chapter 1 and Chapter
2. However, there is no clear discussion on how to
segment a text into such units computationally.
This thesis focuses its e�ort on scenes, i.e. con-
tiguous and non-overlapping units of a narrative
text. A scene is a sequence of sentences which dis-
plays local coherence or semantic continuity
on objects (characters and properties) and situa-
tions (time, place, and backgrounds).
Lexical Cohesion Pro�le (LCP) is a quan-
titative indicator proposed here for marking scene
boundaries of narratives. LCP is a record of the
mutual lexical cohesiveness of words in a win-
dow (of 51 words long, for instance) that moves
forwards word by word on a text. Since a coher-
ent text tends to be lexically cohesive [Halliday
and Hasan, 1976; Morris and Hirst, 1991], LCP
indicates local coherence and therefore continu-
ity of scenes in the text. Figure 4.1 (same as
Figure 1.2) illustrates the basic idea of LCP.
Section 4.1 reviews related work on text seg-
mentation. Section 4.2 describes how to com-
pute LCP, the mutual lexical cohesiveness of
words in the moving window. Section 4.3 com-
pares LCP with scene boundaries marked by a hu-
man experiment. Section 4.4 discusses the na-
ture and limits of LCP, and Section 4.5 gives a
summary of this chapter.
16
4.1 Related Work on
Text Segmentation
A number of methods for segmenting a text into
coherent units have been proposed in the studies
of text structure. One of the valuable indicators is
a cue phrase [Grosz and Sidner, 1986] (or clue
words [Reichman-Adar, 1984]). For example, "by
the way" and "anyway" indicate the beginning of
new units.
In narratives, several types of cue phrases that
specify time or place at the beginning of sen-
tences are recognized. For example, a new scene
begins with the cue phrase "In the summer of
last year" in the following text portion.
� � � she could see the windowless brick
wall of the box factory in the next
street. But she thought of grassy
walks and trees and bushes and roses.
In the summer of last year Sarah had
gone into the country and fallen in
love with a farmer. � � �
Note that paragraph breaks explicit in the original
text (O.Henry's Springtime �a la Carte [Thornley,
1960]) are discarded in the examples here.
Scenes in narratives do not always begin with
cue phrases, however. Let us consider the follow-
ing text portion.
� � � Sarah knew that it was time for
her to read. She got out her book,
settled her feet on her box, and began.
The front-door bell rang. The landlady
answered it. Sarah left the book and
listened. � � �
Anyone can capture the discontinuity of scenes
at the sentence "The front-door bell rang".
However, this is not a cue phrase, and an assertion
is that we need a stronger device to capture the
scene alternation like this.
LCP is a quantitative device to mark the conti-
nuity and discontinuity of objects and situations
described in a text. But before going on to de�ne
LCP, let us brie y review two related studies that
have intended to capture scene coherence.
4.1.1 Word Reiteration and
Scene Coherence
Youmans [1991] has proposedVocabulary Man-
agement Pro�le (VMP) as a quantitative indi-
cator of scene alternation of written texts. VMP
is a record of the proportion of new words intro-
duced in a window (of 35 words long, for instance)
moving word by word on a text. For example,
the underlined words in the following text are new
0 100 200 300 400 500 600 700
0.2
0.3
0.4
0.5
VMP
i (words)
Figure 4.2 An example of VMP [Youmans, 1991].
(Text: O.Henry's Springtime �a la Carte [Thornley,
1960].)
words.
A new word is a word in the text
which never appears in the
preceding text. VMP counts the new
words in the window moving on the
text.
Figure 4.2 shows the VMP of O.Henry's short
story, Springtime �a la Carte [Thornley, 1960],
plotted against window position (in word num-
bers). Note that VMP takes the given text for a
list of words without any punctuations or para-
graph breaks.
The principle of VMP is based on information
ow in a text which suggests the introduction and
succession of scenes.
� Introduction
At the beginning of a scene, new vocabulary
(for objects and situations) will be introduced
into the scene.
� Succession
Once a scene is created by vocabulary intro-
duction, rest of the scene will reuse the intro-
duced vocabulary.
VMP presented in a graph has hills and valleys,
as shown in Figure 4.2. They suggest the scene
alternation: (1) an ascending slope suggests the
introduction of a new scene, (2) a descending slope
suggests the succession of the scene thus intro-
duced.
VMP is a neat but rather simple indicator for
segmenting narrative texts. However, the method
based onword reiteration causes some problems
to deal with various aspects of scene coherence.
My experiments on VMP have revealed that it
does not work well on high-density texts rich
in vocabulary. The reason for this seems obvious:
the words assumed to be reiterated in a scene were
often restated (or paraphrased) by using di�erent
words or phrases.
17
4.1.2 Lexical Cohesion and
Scene Coherence
A better way to capture scene coherence is using
lexical cohesion, especially semantic relations, be-
tween words in a text. Morris and Hirst [1991],
as we have seen in Section 2.2, used Roget's
thesaurus as the knowledge base for determining
whether or not two words are semantically related.
They also proposed lexical chains, i.e. chains of
the thesaural relations between words in a text,
as an indicator of the text structure proposed by
Grosz and Sidner [1986].
A text in general has several lexical chains.
Each chain indicates a range of semantic con-
tinuity on certain objects and situations. So,
the density of the lexical chains will suggest lo-
cal coherence of the text, and minimum points
of the density can be considered as scene bound-
aries of the text. However, as Hearst and Plaunt
[1993] have claimed, the lexical chains of a lengthy
text tend to overlap so often that it is not pos-
sible to place scene boundaries of the text. In
other words, many chains would end at a particu-
lar scene boundary, while at the same time many
other chains would cross it.
Hearst and Plaunt [1993] then incorporated
the thesaural information into their segmentation
scheme based on tf.idf (i.e. an information re-
trieval measurement) between contiguous blocks
of sentences. The tf.idf value of a word is the
frequency of the word within a text divided by
the frequency of the word throughout a large cor-
pus. Words that are frequent in an individual text
but relatively infrequent throughout the corpus
are considered as good indicators of the contents
of the text.
Their segmentation scheme is a two step pro-
cess: (1) all pairs of adjacent blocks of a text
(where each block is usually 3{5 sentences long)
are compared and assigned a similarity value
computed by tf.idf of the words in common and
of the words have thesaural relations, then (2) the
resulting sequence of similarity values, after be-
ing graphed and smoothed by some special algo-
rithms, is examined for hills and valleys. The hills
indicate that the adjacent blocks are coherent; the
valleys indicate scene boundaries.
This method is the pioneering attempt for text
segmentation using lexical cohesion (in the the-
saurus) between words. However, there is still
room for improvement: (1) the size of the block is
de�ned arbitrarily (as 3{5 sentences long), and (2)
the smoothing algorithms are so complicated that
it seems to have no psychological validity. These
points are to be improved in LCP described in the
next section.
4.2 LCP: Lexical Cohesion Pro�le
I have devised a method to capture semantic con-
tinuity in a text and developed an objective and
quantitative indicator of scene boundaries. This
method segments a narrative text only by using
the following lexical information.
� The mutual lexical cohesiveness of words
that interact each other in a portion of the text
(i.e. the words in the window).
� The strength of each cohesive relation be-
tween words which has its own strength of con-
tribution to coherence of a scene.
This section describes (1) the computation of
the mutual lexical cohesiveness of a text portion,
which estimates the strength of text coherence
from the lexical cohesiveness between words, (2)
the computation of LCP as an indicator of local
coherence of the text, and (3) the resulting scene
boundaries in a graph of LCP.
4.2.1 Mutual Lexical Cohesiveness
Coherence of a text portion is estimated by mu-
tual lexical cohesiveness of words in the text
portion. The mutual lexical cohesiveness c(S) of
the text portion S=fw
1
;� � �; w
n
g is de�ned as the
density of the lexical cohesiveness of the words in
S:
c(S) =
X
w
i
2S
s(w
i
) � a(P (S); w
i
)
!
:
where, P (S) is an activated pattern pro-
duced by activating each word w
i
2 S with
strength s(w
i
)
2
=
P
k
s(w
k
) at the same time, and
a(P (S); w
i
) is an activity value of the node w
i
in
the activated pattern P (S). The function lim-
its the output value to [0; 1]. Note that c(S) =
�(S; S), cf. Section 3.3.
The activated pattern P (S) is the result of the
interaction of w
i
2 S. It represents the mean-
ing of S as a whole. So, the mutual lexical cohe-
siveness c(S) represents how cohesively each word
w
i
2 S is related to the whole meaning P (S). In
other words, c(S) represents the semantic ho-
mogeneity of S, which is closely related to dis-
tortion in clustering techniques, since P (S) can
be considered as a centroid of the word cluster S.
The mutual lexical cohesiveness c(S) suggests
how coherent S is. The following examples show
18
0 100 200 300 400 500 600 700
0.3
0.4
0.5
0.6
LCP
i (words)
Figure 4.3 An example of LCP. (Computed
with the rectangular window of 51 words long; Text:
O.Henry's Springtime �a la Carte [Thornley, 1960].)
the mutual lexical cohesiveness of a coherent text
portion in a short story, and that of an incoherent
text portion consists of three sentences randomly
selected from an English dictionary.
c ("Molly saw a cat.
It was her family pet.
She wished to keep a lion." )
= 0.403239 (coherent),
c ("There is no one but me.
Put on your clothes.
I can not walk more." )
= 0.235462 (incoherent).
Thus the mutual lexical cohesiveness c(S) works
as a quantitative indicator of coherence of the text
portion S.
4.2.2 Computing LCP and
Estimating Scene Boundaries
LCP is a record of the mutual lexical cohesiveness
c(S
i
) of the local text S
i
at every position i in
a text. Let us assume the text T is a word list
fw
1
;� � �; w
N
g without any punctuations or para-
graph breaks. Then the local text S
i
at position
i=1;� � �; N in the text T is de�ned as follows:
S
i
= fw
l
; w
l+1
;� � �; w
i�1
; w
i
; w
i+1
;� � �; w
r�1
; w
r
g;
where, w
i
is the i-th word of T . The indices l and
r are de�ned as follows.
l=
�
i�� (if i>�)
1 (otherwise),
r=
�
i+� (if i�N��)
N (otherwise).
The local text S
i
is a text portion which can be
seen through a window whose center is w
i
. The
constant � determines the width of the window
(as 2�+1).
Figure 4.3 shows a graph of the LCP computed
on O.Henry's short story, Springtime �a la Carte
Rectangular
J
J
J
J
Triangular Hanning
Figure 4.4 Various types of windows.
0 100 200 300 400 500 600 700
0.3
0.4
0.5
0.6
0.7
LCP
i (words)
Figure 4.5 LCP computed with the Hanning win-
dow (of 51 words long).
[Thornley, 1960]. The mutual lexical cohesiveness
c(S
i
) is plotted against the position i of the win-
dow. The graph has hills and valleys that sug-
gest scene alternation in the text. Large valleys
can be considered as scene boundaries. However,
the graph has unnecessary noise that makes dif-
�cult to determine which minimum points should
be considered as scene boundaries.
In order to eliminate the noise from the graph
of LCP, a window function is introduced into
pattern production of the text portion S
i
=
fw
l
;� � �; w
i
;� � �; w
r
g. The window function W (i; j)
de�nes the weight of w
j
2S
i
. The activated pat-
tern P (S
i
) is produced by activating each w
j
2S
i
with strength s
0
(w
j
)
2
=
P
k
s
0
(w
k
), where s
0
(w
j
) is
de�ned as s(w
j
)�W (i; j)=W (i; i). Comparing var-
ious types of windows, such as shown in Figure
4.4, I empirically found that the Hanning win-
dow,
W (i; j) =
1
2
�
1 + cos
�
ji�jj
�
�
��
;
gives the most remarkable e�ect on eliminating
the noise. Figure 4.5 shows that the Hanning
window illuminates themacroscopic features of
LCP better than the rectangular window used in
the LCP shown in Figure 4.3.
Window width is also an important factor
in clarifying the macroscopic features of LCP. If
the window is too wide, LCP cannot detect short
scene alternation. If the window is too narrow,
on the other hand, it makes much noise on LCP.
Figure 4.6 compares the LCPs taken for the Han-
19
0 100 200 300 400 500
0.2
0.3
0.4
0.5
0.6
0.7
0.8
LCP
i (words)
(1) Hanning window
of 25 words long
0 100 200 300 400 500
0.4
0.5
0.6
0.7
LCP
i (words)
(2) Hanning window
of 75 words long
Figure 4.6 LCP and window width. (Compari-
son between the Hanning windows of 25 and 75 words
long.)
ning windows of 25 words long and 75 words long.
I empirically found with experimenting 18 window
widths (from 11 to 121 words) that the Hanning
window of 51 words long (�= 25) gives us the
best correlation with the actual scene boundaries.
4.3 Veri�cation of LCP
with Human Judgements
LCP seems to provide a reasonable measurement
for segmenting scenes in a text. To see the point
in case, LCP have been compared with a hu-
man experiment of marking scene boundaries
of O.Henry's short story, Springtime �a la Carte
[Thornley, 1960]. The whole text is described in
Appendix C with its LCP graphs. Another ex-
periment on a biography,Mahatma Gandhi [Leav-
itt, 1958], is also described in Appendix D with
its LCP graphs.
0 100 200 300 400 500 600 700
Segs
i (words)
0
2
4
6
8
10
12
14
16
�
�
�
�
�
� ��
�
�
�
�
��
�
�
�
�
�
� � �
�
�
Figure 4.7 Histogram of human judgements.
(The solid bars represent histogram of human judge-
ments; the dotted lines represent the original para-
graph breaks.)
4.3.1 Human Judgements
In the human experiment, the text given to sub-
jects contains no original paragraph breaks. Sen-
tences in the text are aligned line by line, as shown
in the following example | the head part of the
text given to the subjects.
It was a day in March.
Never, never begin a story this way
when you write one.
No opening could possibly be worse.
There is no imagination in it.
It is flat and dry.
But it is allowable here.
The instruction to the subjects is \Putting your-
self in a position of a �lm director, place scene
boundaries wherever you think there may be a
cut".
Figure 4.7 shows the histogram of scene
boundaries marked by 16 subjects. The solid bars
indicate the number of the subjects who placed a
scene boundary at the text position i, and the dot-
ted lines indicate the original paragraph breaks.
The number of total scene boundaries is 214 (13.38
boundaries/person on the average); the number of
their types is 50. The histogram suggests the fol-
lowing points.
� Agreement
The subjects segment the text in a similar way.
158 scene boundaries (of 16 types), 73.83% of
the total, are the dominant scene boundaries
on which more than 1=3 of the subjects agreed.
� Correlation with paragraphs
The reported scene boundaries closely corre-
late with the original paragraph breaks. 179
scene boundaries (of 29 types), 83.64% of the
total, correspond with the original paragraph
breaks.
20
0 100 200 300 400 500 600 700
0.3
0.4
0.5
0.6
0.7
LCP
i (words)
0
2
4
6
8
10
12
14
16
segs
�
�
�
�
�
� ��
�
�
�
�
��
�
�
�
�
�
� � �
�
�
Figure 4.8 LCP and human judgements. (LCP is
computed with Hanning window of 51 words long.)
4.3.2 Correlation between LCP and
Human Judgements
The LCP computed with Hanning window of 51
words long (shown in Figure 4.5) and the his-
togram of Human judgements (shown in Figure
4.7) are overlapped in Figure 4.8. It is clear
that the minimum points of the LCP corre-
spond mostly to the dominant scene boundaries
reported by the subjects.
In order to examine the correlation between
the dominant scene boundaries and the minimum
points of LCP, let us de�ne a break point as
a sentence break which is nearest to one of the
dominant minimum points of LCP. The dominant
minimum point of LCP is a text position i which
satis�es
8j 2 [i��; i+�]; L
j
> L
i
;
where � is a constant which determines the degree
of localization of minimum points, and L
i
is the
value of LCP, i.e. c(S
i
), at text position i. In case
of �=20, a set of dominant minimum points,
f 51, 112, 194, 235, 273, 374, 442, 501,
533, 664, 759, 833, 909, 975, 1016, 1080,
1152, 1208, 1263, 1302, 1360, 1432 g
is obtained. Then, coercing each of the dominant
minimum points into the nearest sentence break,
the set B
20
of break points is obtained as follows.
f 39, 110, 192, 242, 281, 381, 449, 511, 537,
652, 749, 834, 900, 974, 1012, 1076, 1155,
1210, 1275, 1301, 1350, 1433 g
I have computed the following sets of break
points: B
5
; B
10
; B
15
;� � �; B
100
, and compared them
with the dominant scene boundaries. As shown
in Figure 4.9, the recall rate and the precision
rate of estimating the dominant scene boundaries
by B
�
indicate that the sets of break points (es-
pecially, B
20
) closely correlate with the dominant
� (words)
(%)
precisionrecall
0 20 40 60 80 100
0
20
40
60
80
100
Figure 4.9 Correlation between LCP and human
judgements. (The recall rate and the precision rate
of estimating the dominant scene boundaries by the
break points of LCP.)
scene boundaries. The recall and precision rate
are de�ned as follows:
Recall = Hit / Human,
Precision = Hit / Machine,
where, Human is the number of the dominant
scene boundaries. In the human experiment, 16
dominant scene boundaries:
f 65, 110, 192, 227, 281, 465, 537, 652, 749,
834, 974, 1076, 1155, 1210, 1301, 1346 g
are observed. While Machine is the number of the
break points (i.e. jB
�
j), and Hit is the number of
correctly estimated dominant scene boundaries by
B
�
.
4.3.3 Comparing LCP and
Human Judgements with the Text
Let us see concretely the close relationship of (1)
the graph of the LCP, (2) the human judgements
(both are shown in Figure 4.8), and (3) the text
used in the experiment (namely, Springtime �a la
Carte [Thornley, 1960]).
The clear valley at i=192, for instance, exactly
corresponds to the dominant scene boundary (and
also to the paragraph break). The following is the
portion of the original text from i=157 to 227.
Sarah had managed to
160
open the
world a little with her typewriter.
That was
170
her work --- typing.
She did not type very quickly,
and
180
so she had to work alone, and
not in a
190
great office.
The most successful of Sarah's
battles with the
200
world was the
arrangement that she made with
Schulenberg's Home
210
Restaurant.
The restaurant was next door to the
old red-brick
220
building in which she
had a room. � � �
21
Note that the original paragraph breaks are shown
in the examples here, and the subscritps in the
examples indicate the text position i (words). We
can see the discontinuity of scenes: the �rst part
(before the paragraph break) is focused on Sarah's
job, and the second (after the paragraph break) on
Schulenberg's restaurant. (Sarah is the heroine of
the story.)
It is worth noting that LCP can detect scene
alternation irrespective of the paragraph breaks
placed by the author of the story. For example,
the paragraph break at i=156 is not a minimum
point of the LCP. And the likely continuation of a
scene indicated by the LCP at that point is sup-
ported by the human judgements. The following
is the text portion from i=111 to 192.
The gentleman who said the world
was an oyster which
120
he would open
with his sword became more famous
than
130
he deserved. It is not
difficult to open an oyster
140
with
a sword. But did you ever notice
anyone try
150
to open it with a
typewriter?
Sarah had managed to
160
open the
world a little with her typewriter.
That was
170
her work --- typing.
She did not type very quickly, and
180
so she had to work alone, and not in
a
190
great office.
On the other hand, the author of the story did
not place a paragraph break at i = 228, but the
LCP and a half of the 16 subjects mark a scene
boundary at that point. The following is the text
portion from i=193 to 281.
The most successful of Sarah's
battles with the
200
world was the
arrangement that she made with
Schulenberg's Home
210
Restaurant.
The restaurant was next door to the
old red-brick
220
building in which she
had a room. One evening, after
230
dining at Schulenberg's, Sarah took
away with her the bill
240
of fare.
It was written in almost unreadable
handwriting, neither
250
English nor
German, and was so difficult to
understand that
260
if you were not
careful you began with the sweet
270
and ended with the soup and the day
of the
280
week.
It is obvious that a new scene begins with the
underlined cue phrase "One evening" which in-
dicates discontinuity of time.
There are some discrepancies between the LCP
and the human judgements, however. For exam-
ple, the minimum point at i=450 disagrees with
the dominant scene boundary at i=465. The fol-
lowing is the portion in question.
Both were satisfied with the
agreement. Those who ate
430
at
Schulenberg's now knew what the food
they were eating
440
was called, even
if its nature sometimes puzzled
them. And
450
Sarah had food during
a cold dull winter, which was
460
the
main thing with her.
When the spring months arrived
470
,
it was not spring. Spring comes
when it comes. The
480
frozen snows
of January still lay hard in the
streets
490
. � � �
The �rst part (before the paragraph break) is fo-
cused on the agreement made between Sarah and
Schulenberg, and the second (after the paragraph
break) on the severe weather of winter. This
disagreement between the LCP and the human
judgements may be accounted for by the lexical
similarity between the last part of the �rst para-
graph and the �rst part of the second paragraph
| the words used there are related with the severe
weather of winter.
4.4 Discussion
LCP is based on the hypothesis that a local text
tends to be coherent when the local text is lex-
ically cohesive [Halliday and Hasan, 1976; Mor-
ris and Hirst, 1991]. This section discusses (1)
the relationship between the lexical cohesiveness
and coherence of a text, and (2) the width of the
window used in computing LCP.
4.4.1 Lexical Cohesiveness and
Text Coherence
LCP deals with the lexical cohesiveness of words
in a text, and left out any syntactic structure
or punctuations in the text. The mutual lexical
cohesiveness c(S) does not work well on an ill-
structured (or incoherent) but lexically cohesive
text. Compare the following example with those
in Section 4.2.
c ("I saw cats.
A lion belongs to the cat family.
My family keeps a pet." )
= 0.653580 (incoherent, but cohesive).
The reason for this lies in the shortcomings of the
lexical cohesiveness between words de�ned on the
English dictionary. For instance, it ignores the
connotational and extensional meaning of words
22
Role on
coherence
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Syntagmatic
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Paradigmatic
HH
��
Word distance
Figure 4.10 Role of syntagmatic and paradigmatic
relation on text coherence.
and any syntactic structure in the dictionary def-
initions.
Syntagmatic relations between words can
make up for the limits of LCP which is based only
on paradigmatic relations between words. As il-
lustrated in Figure 4.10, coherence of a scene is
maintained by (1) syntagmatic relations between
words closely positioned and (2) paradigmatic re-
lations between distant words. Syntagmatic rela-
tions can be computed as the co-occurrence prob-
ability of words in corpora [Church and Hanks,
1990] or in dictionary de�nitions [Wilks, et al.,
1989].
4.4.2 Adapting Window Width
The width of the window should be as narrow as
possible, if noises are not present, since a nar-
row window can capture alternations of both short
and long scenes. The experiments on the vari-
ous widths of the window revealed that Hanning
window of 51 words long gives the best correla-
tion with human judgements, as we have seen in
Section 4.2. Obviously, this window width is
however applicable only to the text examined in
the experiment. The best window width will de-
pend on genres and styles of texts. For example,
the following factors may a�ect the best window
width.
� Average length of scenes or mini-
mum/maximum length of scenes of the text
(however, they depend on the result of human
judgements), or those computed on a large cor-
pus.
� Lexical density of the text, i.e. the propor-
tion of the number of types (the size of vo-
cabulary used) and the number of tokens (the
length of the text).
At present, I have no e�ective method for adapting
the window width to these data.
In the present stage of my research, I am trying
to adapt the window width to the total signi�-
cance of words in the window. In this scheme, the
window width is dynamically determined so as to
make the total signi�cance of words w2S
i
in the
window be a certain constant value G. In other
words, the scheme is to �nd �, such minimize
�
�
�
�
�
G�
X
w2S
i
s(w)
�
�
�
�
�
:
If it is the case, we can apply the window function
to the computation described above. It seems that
G can be derived from corpus analysis. However,
this is an unsolved problem.
4.4.3 Structure of Scenes
LCP partitions a text into scenes, i.e. contigu-
ous and non-overlapping units of the text. How-
ever, LCP tells us nothing about hierarchical
structure of the scenes; it provides only push/pop
clues for constructing such structure as the ones
discussed in [Grosz and Sidner, 1986; Mann and
Thompson, 1987].
One may say that we can capture super-scenes
of higher level by using wider window and then
construct tree-like structure of a text. However,
some anticipated problems in this method are:
� De�nition of super-scenes
It seems di�cult to de�ne the super scenes in
terms of local coherence or the lexical cohesive-
ness between words. (What is the linguistic
de�nition of a super-scene?)
� Tree vs network
Structure of scenes sometimes would not �t in
tree-like structure; it may have network-like
structure. (Consider a text ABA
0
B
0
� � �, where
A and A
0
are scenes on a hero, B and B
0
on a
heroine.)
I take a position where the structure of a text is
considered as a network of scenes. The network is
based on coherence relations between scenes | the
lexical cohesiveness between two texts described in
Section 3.3. It is assumed here that:
� Cohesive scenes tend to share a topic or se-
mantically related topics.
� Anaphora and ellipsis beyond one scene can
be resolved in a set of adjacent scenes.
In the next stage of my research, I intend to in-
corporate the idea of a scene network in the study
of text segmentation and of text structure.
23
4.5 Summary
This chapter proposed Lexical Cohesion Pro-
�le (LCP) as a quantitative indicator of scene
boundaries of narratives. LCP is a record of the
mutual lexical cohesiveness of words in a win-
dow moving word by word on the text. The mu-
tual lexical cohesiveness is computed by spreading
activation on the semantic network of the subset
of LDOCE, as described in Chapter 3. Hills and
valleys of a graph of LCP closely correlate with
scene alternation. The hills indicate continuity of
each scene, and the valleys indicate scene bound-
aries.
LCP deals only with lexical cohesion between
words; it ignores any grammatical information
(even that of sentence or paragraph breaks) or
other linguistic devices (such as cue phrases). The
reason for this is the purpose of this paper | to
see the role of the lexical cohesiveness of words on
local coherence of a text. As examined in Section
4.3, LCP closely correlates with human judge-
ments. This means that (1) local coherence of
a text is a valuable indicator of scene alternation,
and (2) the local coherence can be estimated by
the lexical cohesiveness between words in the local
text.
The text segmentation scheme described in this
chapter is a bottom-up analysis of coherent
structure of a text. The information provided by
this analysis works as top-down clues for further
text analysis, for example:
� Resolving anaphora and ellipsis
A scene is the smallest domain in which text
coherence can be de�ned. In other words, a
scene is a text portion which describes certain
objects in a situation. This suggests that most
of the referents of anaphoric or elliptic expres-
sions can be found inside the scene.
� Information retrieval
Each scene has a topic phrase (or sentence),
that is the semantic center of the scene com-
puted from an activated pattern produced
from the scene. A set of topic phrases works as
a key for text retrieval and also for text sum-
marization.
Meantime I have to make clear the relationship
between the window width and the word sig-
ni�cance that we have discussed in Section 4.4,
and examine validity of LCP on other genres and
styles of texts. Also to make the segmentation
scheme more robust, it is necessary to incorporate
syntagmatic relation (i.e. co-occurrence proba-
bility) in computing text coherence.
5 Retrospects and Prospects
The lexical cohesiveness, described in Chapter
3, objectively and computationally measures the
strength of lexical cohesion between words in
terms of their associative relations in the English
dictionary. As it is evaluated by text segmenta-
tion, described in Chapter 4, the lexical cohe-
siveness works as an indicator of text coherence,
and also provides valuable information for further
analysis on text structure.
This chapter discusses various theoretical as-
pects of the proposed measurement of lexical co-
hesion, in view of the past, the present, and the
future. Section 5.1 discusses the relationships
with recent work in other �elds. Section 5.2 de-
scribes how to capture the syntagmatic relations
between words, that is the converse of the paradig-
matic relations captured by the method proposed
here. Section 5.3 puts this thesis in perspective
for future research.
5.1 Relations with Other Fields
The proposed method for measuring lexical co-
hesion has been constructed from the evaluation
of related studies in other �elds. The idea of
Gloss�eme, the closed subsystem of English, is
based on the studies of core vocabulary and dic-
tionaries. Knowledge and semantic representation
on the semantic network Paradigme are based on
recent development in psychology.
5.1.1 Lexicology and Lexicography
| Backgrounds of Gloss�eme
Several methods to construct a basic minimum
language and its core vocabulary, like Gloss�eme,
have been proposed. The proposal of Basic En-
glish was �rst put forward in the early 1930s [Og-
den, 1968]. Basic English is English as a secondary
world language which is simpli�ed by restricting
the vocabulary to 850 words and by reducing the
rules for using them to the smallest number nec-
essary to dearly state ideas.
Basic English is designed as a basis for learn-
ing general English; it is based on the minimum
learning cost for communication, not on the fre-
quency of word use in general English. However,
the following points suggest Basic English is not
a subsystem of general English but another inde-
pendent one.
� Vocabulary selection
The criteria for vocabulary selection are sub-
24
0 2000 4000 6000 8000 10000
0
20
40
60
80
100
size (words)
coverage
(%)
Figure 5.1 Coverage of frequent words in a corpus.
(Accummulative frequency plotted against the vocab-
ulary size. Computed on the LOB corpus (1,006,815
words; 47,888 types).)
jective and unclear. For example, the 850-
word basic vocabulary contains only 18 verbs,
so that even common verbs in general English
have to be paraphrased as follows.
general English Basic English
ask put a question
walk have a walk
� Sense selection
Learning 850 words is not the same thing as
learning 850 senses, since each of the words
may have several senses. However, Basic En-
glish o�ers no guidance for this. (One calcula-
tion is that the 850 words have 12,425 mean-
ings. [Carter and McCarthy, 1988])
Basic English is a consistent language system
which works as a useful tool for communication.
However, it is not the core of general English in
everyday use.
The most remarkable proposal for core vocabu-
lary after Basic English is A General Service List
of English Words [West, 1953] (hereafter, GSL),
which is the outcome of major studies of the
1930's on vocabulary selection for language teach-
ing. GSL consists of 2,000 words drawn from a
corpus of 5,000,000 words. The main criteria for
the selection of GSL are: (1) the frequency of
words (not only the occurrence of words but also
the proportion of di�erent meanings of each word),
and (2) coverage and granularity of meaning,
that determine semantic range and separability,
respectively. GSL can be seen as a result from a
mixture of the objective frequency (as shown in
Figure 5.1) and subjective criteria on meaning.
GSL has had the most lasting in uence among
core vocabulary proposals, and it is widely used
today in forming the basis of the principles un-
derlying Longman Simpli�ed English Series and
Longman Structural Readers of simpli�ed �ction,
non-�ction, poems, and plays. The narrative texts
used in text segmentation (described in Chapter
4), namely Springtime �a la Carte [Thornley, 1960]
andMahatma Gandhi [Leavitt, 1958], are adopted
from these series.
GSL has also been applied to lexicography:
techniques for compiling dictionaries. LDOCE
(Longman Dictionary of Contemporary English)
[1987, �rst ed. 1978] is one of the remarkable out-
comes of GSL. All the de�nitions and examples
in LDOCE are written in the restricted vocabu-
lary LDV (Longman De�ning Vocabulary), which
is originally based on GSL and updated by Long-
man. LDV consists of 2,191 words (correspond-
ing to 2,851 headwords of LDOCE, with distin-
guishing homographs) and 48 a�xes. LDV cov-
ers 83.07% of 1,006,815 words in the Lancaster-
Oslo/Bergen corpus (hereafter, LOB corpus) with
the help of a morphological analysis.
The result of using LDV as de�ning vocabu-
lary is the ful�lment of the most basic lexico-
graphic principle: the de�nitions of headwords are
always written by using more simple words than
the headwords they describe. This principle pro-
vides the basis of the work described in this thesis
| Gloss�eme, since it is based on the de�ning vo-
cabulary LDV and their de�nitions in LDOCE,
works as a closed subsystem of English.
5.1.2 Psychology of Memory
| Backgrounds of Paradigme
Psychological studies of organization of human
memory have revealed the functional distinction
between semantic and episodic memory [Tulv-
ing, 1972]. Semantic memory is the knowledge
shared by the people, while episodic memory
stores personal experiences. This distinction is
summarized as follows.
Semantic memory
contents socially shared codes
elements linguistic concepts
relations associative relations
Episodic memory
contents personal experiences
elements episodes and events
relations temporal/spatial relations
These two functions of memory are installed in bi-
ologically di�erent ways, as it is proved through
recent studies of amnesia and aphasia [Squire,
1986].
25
The work described in this thesis deals mainly
with the semantic memory. The reason for this
is that the common knowledge, on which lexi-
cal cohesion is de�ned, corresponds to the seman-
tic memory. In view of structural linguistic [Saus-
sure, 1916], the semantic memory corresponds to
langue, i.e. the knowledge for using one's �rst lan-
guage (mother tongue), while the episodic mem-
ory corresponds to parole, i.e. one's whole use of
the language.
There have been a number of arguments regard-
ing on the way of representing the semantic mem-
ory. Even recent work on network representation
has two mutual-exclusive but complementary ap-
proaches.
� Local representation
Di�erent concepts are embodied in di�erent
nodes. Each node is individual and self-
explanatory on meaning or value of the concept
it represents. (For example, see frame-based
models [Minsky, 1975; Schank, 1980].)
� Distributed representation
Di�erent concepts correspond to di�erent pat-
terns of activity over the very same nodes.
Each node is involved in representing a num-
ber of (almost all) concepts. (For example, see
PDP models [Rumelhart et al., 1986].)
Both approaches have their own advantages:
(1) local representation is a explicit and well-
articulated representation which can perform log-
ical and sequential inferences and reasoning,
while (2) distributed representation can perform
implicit analogical and metaphoric inferences,
and it also has tolerance of noise in input texts
and statistical learnability.
The semantic network Paradigme is a result of
a mixture of local and distributed representation:
(1) each node in Paradigme corresponds to one
headword in the dictionary Gloss�eme, and (2)
the meaning of a headword is represented by an
activated pattern distributed over the nodes (or
the headwords of Gloss�eme). In other words, dif-
ferent headwords correspond to di�erent nodes,
while meaning of a word is represented by using
all nodes of Paradigme.
Essence of knowledge and semantic representa-
tion on Paradigme lies in one of the principles
in structural linguistics and semiology [Saussure,
1916; Hjelmslev, 1943]: the value of a word is de-
�ned only by its relationships with other words in
the language. Each word has no value or mean-
ing by itself, but structural relations with other
words de�ne the value or meaning of the word.
This means that the language is the system of
-
Syntagmatic relations
?
Paradigmatic relations
I + can + go
We + must + walk
Boys + will + run
.
.
.
.
.
.
.
.
.
Figure 5.2 Syntagmatic and paradigmatic relations
between words.
signs (or of words).
5.2 Syntagmatic Relations
between Words
Words in a text display mutual dependence which
creates coherent textural structure, as outlined at
the beginning of Chapter 1. The mutual depen-
dence can be classi�ed into two categories of rela-
tionships between lexical items, namely paradig-
matic and syntagmatic relations. As illustrated
in Figure 5.2, these two kinds of thread can
be recognized in the texts. Paradigmatic rela-
tions are based on association between concepts,
while syntagmatic relations are based on co-
occurrency of lexical items in actual texts.
The focus of this thesis has been mainly on
paradigmatic relations, not on syntagmatic rela-
tions. The reason for this is obvious: the com-
mon knowledge for measuring lexical cohesion
is mainly maintained by paradigmatic relations.
However, as we have seen in Section 4.4, paradig-
matic relations are not enough to cover all the
aspects of lexical cohesion and text coherence, and
syntagmatic relations can make up for this limita-
tion.
This section describes two experiments for
extracting syntagmatic relations between words
from a machine-readable corpus. A corpus is a
representative sample of a language that consists
of massive quantities of texts. For example, the
LOB corpus, one of the standard corpora, con-
sists of about one million words of British En-
glish. (Cf. the Bible has approximately one million
words.)
5.2.1 Extracting n-gram Data
The increasing availability of machine-readable
corpora has suggested new statistical and prob-
abilistic methods for capturing linguistic informa-
tion [Church and Mercer, 1993], especially col-
locations. Collocation is the co-occurrence ten-
26
Table 5.1 The most frequent trigrams and tetra-
grams computed on the LOB corpus (total 1,006,815
words; 47,888 types). (Note that the total number of
the n-grams becomes 1; 006; 815�n+1.)
trigram (n=3)
w
i
w
i+1
w
i+2
freqency
one of the 390
there was a 204
out of the 192
the end of 185
some of the 184
part of the 182
there is no 170
it was a 167
there is a 165
the fact that 165
tetragram (n=4)
w
i
w
i+1
w
i+2
w
i+3
freqency
the end of the 102
at the same time 95
on the other hand 94
in the case of 77
at the end of 74
for the first time 72
as a result of 48
in the form of 41
the fact that the 41
the rest of the 38
dency of words to work together in predictable
ways. This approach is summarized with the
memorable line: `You shall know a word by the
company it keeps' [Firth, 1957].
One of the traditional indicators of the co-
occurrence tendency of words is n-gram data
[Brown, et al., 1992; Church and Mercer, 1993].
The n-gram analysis is quite similar to word-
frequency analysis which counts the occurrence of
words; the n-gram analysis counts the occurrence
of tuples of adjacent n words. For example, the
text "We need to provide the solution" pro-
duces the following trigrams (n = 3), ft
i
j t
i
=
hw
i
; w
i+1
; w
i+2
ig.
w
i
w
i+1
w
i+2
t
1
we need to
t
2
need to provide
t
3
to provide the
t
4
provide the solution
As shown in Table 5.1, frequent n-grams display
collocative relations between words, and they can
be considered as phrases or phrasal lexemes.
The n-gram analysis provides syntagmatic
prediction, i.e. the probability of occurrence of
a word w immediately after given two contigu-
ous words w
1
; w
2
. For example, when we observe
two adjacent words "we need", the trigram data
computed on the LOB corpus can predict which
Table 5.2 The trigram prediction of the third word.
(The most probable sequences with their probabili-
ties.)
w
1
w
2
w probability
we need a 22.73%
not 18.18%
to 18.18%
more 4.55%
� � � � � �
need to be 17.24%
provide 3.45%
make 3.45%
keep 3.45%
� � � � � �
to provide a 18.18%
the 12.99%
for 6.49%
such 3.90%
� � � � � �
provide the means 12.50%
solution 6.25%
money 6.25%
food 6.25%
� � � � � �
words tend to follow: a (22.73%), not (18.18%),
to (18.18%), etc. Table 5.2 shows that the n-
gram prediction captures a number of important
frequency-based relations between words. How-
ever, this method cannot capture syntagmatic
relations of long range; it can only detect co-
occurrence of words within a window of n words
long.
5.2.2 Mutual Information
A wide window might be able to capture the long-
range relationships between words. For example,
mutual information [Church and Hanks, 1990]
computed by a wider window works as a more e�-
cient indicator. The mutual information I(w;w
0
)
between words w;w
0
is de�ned as follows.
I(w;w
0
) = log
�
Pr(w;w
0
)
Pr(w) � Pr(w
0
)
�
:
It compares the probability Pr(w;w
0
) of observ-
ing w and w
0
together in the window (i.e. the
joint probability) with the probabilities Pr(w)
and Pr(w
0
) of observing w and w
0
independently
(i.e. chance). If there is a strong relationship
between w;w
0
, then I(w;w
0
) � 0. If there is
no interesting relationship between w;w
0
, then
I(w;w
0
) � 0. If w;w
0
are in complementary dis-
tribution, then I(w;w
0
)�0.
The following list shows the words w
i
that have
the highest mutual information I(w;w
i
), where w
is the word hair, computed on the LOB corpus
by using the window of 16 words long.
27
I(w;w
i
) w
i
4.513954 cuticle
4.420844 coal-black, crinkly, fairish,
flaxen, frizzled, iron-grey,
itched, red-gold, tufting,
volos, waist-long, wavy
4.321308 falkirk, hugs, imprudently,
looped-up
4.268841 ruddy
Most of words in the list above appear only once
in the corpus, however. Such low-frequency words
can be considered as noise. The following two
lists show the highest mutual information between
hair and the words whose frequencies are more
than 2 and 4, respectively.
w
i
I(w;w
i
) (freq.�2)
4.268841 ruddy
4.420844 auburn
4.016454 mousy
4.157810 greying
3.883187 scurf
w
i
I(w;w
i
) (freq.�4)
3.371935 brushing
3.321308 cropped,
ribbons,
thinning
3.098916 coppery
3.050006 colouring
After eliminating its noise, the mutual information
works as an indicator of syntagmatic relation (or
collocation) of words in actual texts.
5.2.3 Problems and Perspectives
of Corpus-based Analysis
The corpus-based analysis of word co-occurrency,
like the ones described above, poses the following
problems. One problem is the size of the corpus,
and the other is the quality of the corpus. Both
problems are deeply concerned with the nature of
corpora.
The LOB corpus that have been used in the ex-
periments above consists of 1,006,815 words. How-
ever, it is not large enough for extracting syntag-
matic relations of words. As we have seen above,
a large number of words appear only once in the
corpus. The list below illustrates the relationship
between the word frequency and the coverage in
the corpus.
coverage in vocabulary coverage in words
freq. (total 47,888 types) (total 1,006,815 words)
and its accumulation and its accumulation
1 44.57% (44.57%) 2.12% (2.12%)
2 14.48% (59.06%) 1.37% (3.50%)
3 7.94% (66.99%) 1.13% (4.63%)
4 5.02% (72.01%) 0.95% (5.58%)
5 3.47% (75.48%) 0.83% (6.41%)
It is obvious that the more frequently a word ap-
pears in the corpus, the more accurate the statis-
tical analysis of the word is. Most words of the
Table 5.3 The composition of the LOB corpus.
(The size and proportion of each text genre.)
size proportion
text categories (words) (%)
Press: reportage 88727 8.81
Press: editorial 54293 5.39
Press: reviews 34213 3.39
Religion 34226 3.39
Skills, trades, and hobbies 76556 7.60
Popular lore 88679 8.80
Bells letters, biography, essays 155111 15.40
Miscellaneous 60591 6.01
Learned and scienti�c writings 161215 16.01
General �ction 58476 5.80
Mystery and detective �ction 48211 4.78
Science �ction 12026 1.19
Adventure and western �ction 58274 5.78
Romance and love story 58148 5.77
Humour 18069 1.79
vocabulary appear only a few times in the corpus,
however.
A corpus is intended to be a representative sam-
ple of the real use of a language, and its quality is
determined by sampling techniques. The texts
in the LOB corpus were selected by strati�ed ran-
dom sampling based on several bibliographical al-
manacs, where the texts are classi�ed into cate-
gories according to the Dewey Decimal Classi�-
cation on the subjects of the texts. The texts
are then classi�ed into 15 genres, as shown in
Table 5.3, based on rhetorical properties of
the text. There is no clear discussion about the
amount of information in each text (or the num-
ber of its copies) exchanged by people. Also the
relationship between the Dewey Decimal Classi�-
cation and the genres of the corpus is unclear.
Important points of the corpus-based analysis
is that (1) all corpora are limited in their size
and quality, and (2) texts in corpora tend to
be novel and impressive, so that the corpora no
longer contain the whole common syntagmatic re-
lations shared by people. The corpus-based anal-
ysis needs to be supplemented by data derived
from the intuitions of informants through either
introspection or experimentation, or of lexicogra-
phers as the work of this thesis depends on them.
The general approach of the corpus-based analysis
is illuminating, with considerable research poten-
tial. By eliminating the noise appropriately, the
corpus-based analysis will provide valuable infor-
mation for natural language processing.
5.3 Future Research
The work described in this thesis has twomajor di-
rections for further research. One is to go deeper:
towards the interaction of paradigmatic associa-
28
tion and syntagmatic prediction, and its applica-
tion for metaphoric processing. The other is to
go wider: towards the analysis on Japanese texts,
which requires a corpus of Japanese, selecting core
vocabulary, and a well-structured dictionary.
5.3.1 Interaction between
Paradigme and Syntagme
Paradigmatic association is the lexical co-
hesiveness or similarity �(w;w
0
) between words
w;w
0
, as described in Section 3.2. So, it can
be considered as a mapping � : w 7! w
0
, where
w
0
( 6=w) is the most similar word to w.
Syntagmatic prediction, on the other hand,
is the recency �(w;w
0
), i.e. the probability of ob-
serving w;w
0
together in the window (where w is
followed by w
0
) or of the n-gram prediction, like
those described in Section 5.2. So, it can be con-
sidered as a mapping � : w 7! w
0
, where w
0
has
the highest probability of co-occurrence with w.
Most of semantic and syntactic relations be-
tween words can be de�ned by combination or
interaction of the paradigmatic association �
and the syntagmatic prediction �. The following
example illustrates the relationship from supply
to meat. (Note that there is no signi�cant direct
mapping from supply to meat.)
supply meat
� �
provide
�
7�! food
The paradigmatic association from supply to
provide and from food to meat are computed re-
spectively as follows:
�(supply, provide) = 0.174675,
�(food, meat) = 0.155881,
where provide is the most similar word to supply,
and meat is the most similar word to food. And, if
syntagmatic prediction � is de�ned as the trigram
prediction (as shown in Table 5.2), the syntag-
matic prediction from provide to food is com-
puted as follows:
�(provide; food) = 6:25%:
where the word food has the second-highest prob-
ability of co-occurrence with provide in the table.
The interaction of paradigmatic association and
syntagmatic prediction can be applied to process-
ing metaphoric expressions. For example, let
us consider the sentence "She is shining". The
de�nition of shine in LDOCE is as follows.
She is shining
�
�
�
bright
burn
polish
flash
cheerful
She is bright.
She is cheerful.
Figure 5.3 An example of metaphoric interpreta-
tion. (The paradigmatic association � provides possi-
ble meanings of each word; the syntagmatic prediction
� selects relevant sequences of the meanings.)
shine v 1 to produce light 2 to re ect light;
be bright 3 to direct (a lamp, beam of light,
etc.) 4 to polish; make bright by rubbing 5
� � �
However, the sentence does not mean that she is
re ecting light nor has she been polished. Rather
it means that she is cheerful and lively.
The paradigmatic association � from shine pro-
vides the following words that have the highest
similarity to shine.
w
0
�(shine; w
0
)
bright 0.249966
burn 0.190900
polish 0.180333
flash 0.145012
cheerful 0.143962
Then, the syntagmatic prediction � between words
which can be used as a person and each of
the words above is examined; it makes clear
that bright and cheerful tend to co-occur
with expressions used as a person. Finally,
the sentence "She is shining" is interpreted as
"She is bright" or "She is cheerful". (This
scheme is illustrated in Figure 5.3.)
5.3.2 Constructing the System
of Japanese Language
In the following stage of my work described in this
thesis, I intend to apply the scheme of computing
lexical cohesion to Japanese language process-
ing. So, I have to obtain the following informa-
tion: (1) a list of core words and their de�nition,
(2) the combinations which the words typically
form. Such information should be extracted from
a corpus, because it must have objectivity and
completeness beyond intuition of researchers.
Here, I should refer to a remarkable example:
The Collins COBUILD English Language Dictio-
nary [1987], which is an outcome of recent lex-
icographic work after LDOCE. The COBUILD
29
project is an ambitious lexicographic research pro-
gramme designed to construct a mono-lingual for-
eign learner's dictionary of English which is based
on naturally-occurring data extracted from the
Birmingham Corpus (20 million words). Note
that `COBUILD' stands for the Collins Birming-
ham University International Language Database.
[Carter and McCarthy, 1988]
At present, however, there are no Japanese cor-
pora with enough size and quality available for re-
searchers. I have to start with building a large and
objective Japanese corpus. From the corpus, the
following lexicographic data can be extracted.
� Core vocabulary
The smallest and complete core vocabulary
can be selected with respect to the word fre-
quency and the coverage of use in the corpus.
The size of the vocabulary should be on the
order of several thousands.
� Word patterns
Word patterns, namely syntagmatic relations
or collocation of words, can be extracted
from the corpus. The word patterns provide
example-based de�nitions of words and also
syntactic rules.
From these lexicographic data, a well-structur-
ed dictionary of the Japanese language can be
constructed. The core vocabulary, of course,
works as the de�ning vocabulary, like LDV in
LDOCE; the word pattern determines the range
of contexts or the correct meanings of the words,
and also provides a lot of examples for actual use.
A closed subset of the dictionary can be consid-
ered as a closed sub-system of the Japanese lan-
guage. Like Gloss�eme, it will consist of the dictio-
nary entries whose headwords are included in the
de�ning vocabulary. Such subset is quite useful for
research in computational linguistics and other re-
lated �elds, because the size is small enough to be
computationally feasible, while still covering most
of words in general use of the Japanese language.
6 Conclusion
This thesis described (1) an objective and com-
putationally feasible method for measuring lexi-
cal cohesion between words and between texts
of any size, (2) its application to text segmen-
tation of narratives into coherent scenes, as the
evaluation of the measurement of lexical cohesion,
and (3) discussions about various aspects of this
work and prospects for future research.
The lexical cohesiveness, namely the strength
of lexical cohesion between words, is computed on
the semantic network Paradigme. Paradigme is
systematically constructed from Gloss�eme, a sub-
set of the English dictionary LDOCE (Longman
Dictionary of Contemporary English). Gloss�eme
consists of every entry of LDOCE whose headword
is included in LDV (Longman De�ning Vocabu-
lary), so that Gloss�eme is a closed subsystem
of English where each of its headwords is de�ned
by a phrase being composed of the headwords and
their derivations. Spreading activation on the
semantic network can directly compute the lexi-
cal cohesiveness �(w;w
0
)2 [0; 1] between any two
words w;w
0
in LDV and its derivations. It can also
indirectly compute the lexical cohesiveness of all
headwords of LDOCE and their derivations, and
as well as the lexical cohesiveness between texts.
The lexical cohesiveness �(w;w
0
) represents the
strength of association from w to w
0
and works as
an indicator of lexical cohesion.
The text segmentation is based on the Lexical
Cohesion Pro�le (LCP) which is a record of the
mutual lexical cohesiveness of words in a win-
dow moving word by word on a text. The mutual
lexical cohesiveness is de�ned as the density of the
lexical cohesiveness of words in the window, and
it suggests local coherence of the text. A graph
of LCP has hills and valleys which suggest scene
alternations, because (1) when the window is in-
side a scene, the words in the window tend to be
cohesive, and (2) when the window is crossing a
scene boundary, the words in the window tend to
be incohesive. So, the minimum points of the LCP
can be considered as marking scene boundaries
of the text. Comparison with the scene bound-
aries marked by human judgements proved that
minimum points of LCP closely correlate with
the dominant scene boundaries on which most of
the subjects agreed. The proposed segmentation
scheme works as a new tool for analysing the text
structure, resolving anaphora and ellipsis, infor-
mation retrieval, etc.
Conclusions of this proposal for the measure-
ment of lexical cohesion and its evaluation by text
segmentation are:
� Lexical cohesion of words in a text (or in a
text portion) suggests coherence of the text (or
local coherence of the text portion).
� Lexical cohesion can be computed as associa-
tive relations in the common knowledge de-
scribed in the English dictionary.
And we may say that a dictionary contains infor-
mation for detecting lexical cohesion. However,
lexical cohesion cannot cover all aspects of the
30
common knowledge shared by the people in a lin-
guistic community. This is due to the di�erence
between cohesion and coherence: lexical cohesion
is the relationship between words in a text, and
coherence is the whole structure of the text made
up mainly by lexical cohesion.
The work described in this thesis focuses on
paradigmatic relations between lexemes, which
represents how the concepts are formed into the
whole knowledge of the world. Future research
will focus on syntagmatic relations, which rep-
resents how the concepts and ideas are expressed
as sequences found in actual texts. The syntag-
matic relations should be objectively extracted
from corpora, i.e. massive quantities of represen-
tative texts. So, the next stage of this work will
deal mainly with English corpora. Also this work
should be extended to Japanese language process-
ing where constructing a Japanese corpus and se-
lecting core vocabulary will be required.
Acknowledgements
I thank Dr. Teiji Furugori, my thesis advisor, for
his thoughtful suggestions and comments on this
work. Throughout the long and painful evolution
of this study, he has guided me with his insight,
encouragement, and persistence. The work would
not have been possible without his supervision.
I am grateful to the other members of my the-
sis committee: Drs. Kohei Noshita, Kiyoshi
Hashimoto, Makoto Yasuhara, and Kazuhiko
Ozeki. They have given acute criticisms and
suggestions on the thesis. I am also indebted
to Dr. Ken Church (AT&T Bell Laboratories),
Dr. Graeme Hirst (University of Toronto), Dr. Pim
van der Eijk (Digital Equipment Corporation),
and Dr. Marti Hearst (University of California,
Berkeley), who made a number of contributions
to my work with their comments and suggestions.
And, discussions with the following people in UEC
produced many of the ideas that my work is based
upon: Prof. Mituo Kobayasi, Takuzi Suzuki, Ed-
uardo de Paiva Alves, Hidemi Nishiyama, and the
members of Furugori laboratory. Had I taken
their advice more thoroughly, the thesis would
have been improved substantially.
Finally, my thanks go to my parents and
Takako. They have given me the in�nite amount
of moral support throughout this undertaking of
the laborious research.
References
[Alshawi, 1987] H. Alshawi : Processing dictio-
nary de�nitions with phrasal pattern hierarchies,
Computational Linguistics , Vol.13, pp.195{202.
[Beaugrande and Dressler, 1981] R. de Beau-
grande and W. U. Dressler : Introduction to Text
Linguistics, Longman, Harlow, Essex.
[Brown et al., 1992] P. F. Brown, V. J. Della
Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer
: Class-based n-gram models of natural language,
Computational Linguistics , Vol.18, pp.467{479.
[Carter and McCarthy, 1988] R. Carter and
M. McCarthy : Vocabulary and Language Teach-
ing, Longman, Harlow, Essex.
[Charniak, 1983] E. Charniak : Passing mark-
ers: a theory of contextual in uence in language
comprehension, Cognitive Science, Vol.7, pp.171{
190.
[Church and Hanks, 1990] K. W. Church and
P. Hanks : Word association norms, mutual in-
formation, and lexicography, Computational Lin-
guistics, Vol.16, pp.22{29.
[Church and Mercer, 1993] K. W. Church and
R. L. Mercer : Introduction to the special issue
on computational linguistics using large corpora,
Computational Linguistics , Vol.19, pp.1{24.
[Firth, 1957] J. R. Firth : A synopsis of lin-
guistic theory 1930{1955, In Studies in Linguistic
Analysis, Philological Society, Oxford. (Reprinted
in F. Palmer (ed.), Selected Papers of J. R. Firth,
Longman, Harlow, Essex, 1968.)
[Grosz and Sidner, 1986] B. J. Grosz and C. L.
Sidner : Attention, intentions, and the structure
of discourse, Computational Linguistics , Vol.12,
pp.175{204.
[Hahn, 1992] U. Hahn : On text coherence
parsing, in Proceedings of the Fifteenth Interna-
tional Conference on Computational Linguistics
(COLING-92, Nante), pp.25{31.
[Halliday and Hasan, 1976] M. A. K. Halliday
and R. Hasan : Cohesion in English, Longman,
Harlow, Essex.
[Hearst and Plaunt, 1993] M. Hearst and
C. Plaunt : Subtopic structuring for full-length
document access, in Proceedings of ACM/SIGIR
(Pittsburgh, PA).
[Hendler, 1989] J. A. Hendler : Marker-
passing over microfeatures: towards a hybrid
symbolic/connectionist model, Cognitive Science,
Vol.13, pp.79{106.
[Hirst, 1988] G. Hirst : Resolving lexical am-
biguity computational with spreading activation
and polaroid words, in S. Small et al. (eds.), Lexi-
cal Ambiguity Resolution, Morgan Kaufmann, San
Mateo, California.
[Hjelmslev, 1943] L. Hjelmslev : Omkring
Sprogteoriens Grundl�ggelse, Akademisk Forlag,
K�benhavn.
[Hobbs, 1979] J. R. Hobbs : Coherence and
coreference, Cognitive Science, Vol.3, pp.67{90.
31
[Kozima, 1993] H. Kozima : Text segmenta-
tion based on similarity between words, in Pro-
ceedings of 31st Annual Meeting of the Association
for Computational Linguistics (ACL-93, Ohio),
pp.286{288.
[Kozima and Furugori, 1991a] H. Kozima
and T. Furugori : A computational model for
text disambiguation using knowledge and context
(in Japanese), in Proceedings of the 42nd Annual
Convention IPS Japan, Vol.3, pp.43{44.
[Kozima and Furugori, 1991b] H. Kozima and
T. Furugori : Building conceptual system under
the adaptation to texts (in Japanese), in Proceed-
ings of the 43rd Annual Convention IPS Japan,
Vol.3, pp.219{220.
[Kozima and Furugori, 1991c] H. Kozima
and T. Furugori : A disambiguation model for
text interpretation using knowledge and context
(in Japanese), Transactions of Information Pro-
cessing Society of Japan, Vol.32, pp.1366{1373.
[Kozima and Furugori, 1993a] H. Kozima and
T. Furugori : Semantic similarity between words
(in Japanese), Technical Report of IEICE , AI92-
100, pp.81-88.
[Kozima and Furugori, 1993b] H. Kozima and
T. Furugori : Word similarity computed on an
English dictionary (in Japanese), in Proceedings
of the 46th Annual Convention IPS Japan, Vol.3,
pp.93{94.
[Kozima and Furugori, 1993c] H. Kozima
and T. Furugori : Similarity between words com-
puted by spreading activation on an English dic-
tionary, in Proceedings of 6th Conference of the
European Chapter of the Association for Compu-
tational Linguistics (EACL-93, Utrecht), pp.232{
239.
[Kozima and Furugori, 1993d] H. Kozima and
T. Furugori : Text segmentation based on lexical
cohesion (in Japanese), IPSJ SIG Reports, NL95-
7, pp.49{56.
[Kozima and Furugori, to appear] H. Kozima
and T. Furugori : Segmenting narrative text into
coherent scenes, Literary and Linguistic Comput-
ing, to appear.
[Leavitt, 1958] L. W. Leavitt : Great Men and
Women, in Longman Structured Readers, Long-
man, Harlow, Essex.
[LDOCE, 1987] Longman Dictionary of Con-
temporary English , Longman, Harlow, Essex.
[Mann and Thompson, 1987] W. C. Mann
and S. A. Thompson : Rhetorical structure the-
ory: a theory of text organization, Technical Re-
port of Information Science Institute (University
of Southern California), ISI/RS-87-190.
[Markowitz, 1986] J. Markowitz : Semantically
signi�cant patterns in dictionary de�nitions, in
Proceedings of 24th Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL-86),
pp.112{119.
[Minsky, 1975] M. L. Minsky : A framework for
representing knowledge, in P. H. Winston (ed.),
The Psychology of Computer Vision , McGraw-
Hill, New York.
[Minsky, 1980] M. L. Minsky : K-lines: a theory
of memory, Cognitive Science, Vol.4, pp.117-133.
[Minsky, 1986] M. L. Minsky : Society of Mind ,
Simon and Schuster, New York.
[Morris and Hirst, 1991] J. Morris and
G. Hirst : Lexical cohesion computed by thesaural
relations as an indicator of the structure of text,
Computational Linguistics , Vol.17, pp.21{48.
[Nakamura and Nagao, 1988] J. Nakamura
and M. Nagao : Extraction of semantic infor-
mation from an ordinary English dictionary and
its evaluation, in Proceedings of the 11th Inter-
national Conference on Computational Linguistics
(COLING-88), pp.459{464.
[Ogden, 1968] C. K. Ogden : Basic English In-
ternational Second Language: A Revised and Ex-
panded Version of the System of Basic English,
Brace and World, New York.
[Osgood, 1952] C. E. Osgood : The nature and
measurement of meaning, Psychological Bulletin,
Vol.49, pp.197{237.
[Reichman-Adar, 1984] R. Reichman-Adar :
Extended person-machine interface, Arti�cial In-
telligence, Vol.22, pp.157{218.
[Roget, 1911] P. M. Roget (ed.) : Roget's The-
saurus of English Words and Phrases, Crowell.
[Rumelhart et al., 1986] D. E. Rumelhart,
J. L. McClelland, and the PDP Research Group
: Parallel Distributed Processing: Explorations in
the Microstructure of Cognition, MIT Press, Cam-
bridge, Mass.
[Sapir, 1921] E. Sapir : Language: An Intro-
duction to the Study of Speech, Brace and World,
New York.
[Saussure, 1916] F. de Saussure : Cours de Lin-
guistique G�en�erale, Payot, Paris.
[Schank, 1980] R. C. Schank : Language and
memory, Cognitive Science, Vol.4, pp.243{284.
[Schank, 1990] R. C. Schank : Tell Me a Story:
A New Look at Real and Arti�cial Memory, Scrib-
ner, New York.
[Squire, 1986] L. R. Squire : Mechanisms of
memory, Science, Vol.232, pp.1612{1619.
[Thornley, 1960] G. C. Thornley (edited and
simpli�ed) : British and American Short Stories,
in Longman Simpli�ed English Series, Longman,
Harlow, Essex.
[Tulving, 1972] E. Tulving : Episodic and se-
mantic memory, in E. Tulving and W. Donaldson
(eds.), Organization of Memory, Academic Press,
New York.
[Veronis and Ide, 1990] J. Veronis and N. M.
Ide : Word sense disambiguation with very large
neural networks extracted from machine readable
dictionaries, in Proceedings of the 13th Interna-
tional Conference on Computational Linguistics
(COLING-90), pp.389{394.
[Waltz and Pollack, 1985] D. L. Waltz and
J. B. Pollack : Massively parallel parsing: a
32
strongly interactive model of natural language in-
terpretation, Cognitive Science, Vol.9, pp.51{74.
[Wilks et al., 1989] Y. Wilks, D. Fass, C. M.
Guo, J. McDonald, T. Plate, and B. Slator :
A tractable machine dictionary as a resource for
computational semantics, in B. Boguraev and
T. Briscoe (eds.), Computational Lexicography for
Natural Language Processing, Longman, Harlow,
Essex.
[West, 1953] M. West : A General Service List
of English Words: with Semantic Frequencies and
a Supplementary Word-list for the Writing of Pop-
ular Science and Technology, Longman, Harlow,
Essex.
[Youmans, 1991] G. Youmans : A new tool for
discourse analysis: the vocabulary-management
pro�le, Language, Vol.67, pp.763{789.
Appendicies
A. Structure of Paradigme
| Mapping Gloss�eme
onto Paradigme
The semantic network Paradigme is systemat-
ically constructed from the small but closed
English dictionary Gloss�eme. Each entry of
Gloss�eme is mapped onto a node of Paradigme
in the following way. (See also Figure 3.1 and
Figure 3.2.)
Step 1. For each entry G
i
of Gloss�eme, make
an empty node P
i
in Paradigme and copy the
headword and word-class from G
i
. Add a su�x
(like ` 1' and ` 2') to the headword in order to
distinguish the same headword used in entries of
Gloss�eme (e.g. red/adjective!red 1, red/noun
!red 2g).
Then, for each entry G
i
, map each unit u
ij
onto
a subr�ef�erant s
ij
of the corresponding node P
i
.
The mapping from a word w
ijn
in u
ij
to a link or
links in s
ij
is described as follows.
1. Let h
n
be the reciprocal of the number of ap-
pearance of the root form of w
ijn
in Gloss�eme.
(A morphological analysis on LDV and a�xes
determines the root form.)
2. If w
ijn
is in a head-part, let h
n
be doubled.
(Since a head-part provides the basis of the
meaning of the headword.)
3. Find the node or nodes fp
n1
; p
n2
; � � �g which
correspond to w
ijn
. Then, divide h
n
into
fh
n1
; h
n2
; � � �g in proportion to their frequency.
4. Add links l
n1
; l
n2
; � � � to the subr�ef�erant s
ij
,
where l
nm
is a link to the node p
nm
with the
weight h
nm
.
Thus, s
ij
becomes a set of links: fl
ij1
; l
ij2
; � � �g,
where l
ijk
is a link with a weight h
ijk
. Then, nor-
malize each weight of the links as
P
k
h
ijk
= 1,
in each s
ij
. Namely, let h
ijk
(of l
ijk
2 s
ij
) be
h
ijk
=
P
k
h
ijk
.
Step 2. For each node P
i
, compute the weight
H
ij
of each subr�ef�erant s
ij
(which indicates the
signi�cance of s
ij
) in the following way:
1. Let m be the number of subr�ef�erants of P
i
(i.e. the number of units in the entry G
i
of
Gloss�eme).
2. Let H
ij
be 2m�1�j. For instance, if m= 3,
H
i1
:H
i2
:H
i3
= 4 :3 :2. Note that H
i1
:H
im
=
2:1 (m�2).
3. Normalize each weight H
ij
as
P
j
H
ij
= 1, in
each P
i
. Namely, let H
ij
be H
ij
=
P
j
H
ij
.
Thus, each node P
i
obtains its r�ef�erant.
Step 3. The �nal step is to generate r�ef�er�es
(i.e. sets of reverse links). Map each link in
r�ef�erants of all nodes in Paradigme onto a reverse
link in their r�ef�er�e, in the following way.
1. For each node P
i
, let its r�ef�er�e r
i
be an empty
set (of links).
2. For each P
i
, for each subr�ef�erant s
ij
of P
i
, map
each link l
ijk
2 s
ij
onto the corresponding re-
verse link, in the following way.
2.1 Let p
ijk
be the node referred by l
ijk
, and
let h
ijk
be the weight of l
ijk
.
2.2 Let l
0
be a new link referring to P
i
with the
weight H
ij
� h
ijk
, where H
ij
is the weight
of s
ij
. Then, add l
0
to r�ef�er�e of p
ijk
. (The
link l
0
is the reverse link corresponding to
l
ijk
.)
Then, the r�ef�er�e r
i
of each node P
i
becomes a set
of links: fl
0
i1
; l
0
i2
; � � �g, where l
0
ij
is a link with a
weight h
0
ij
. Then, for each node P
i
, normalize each
weight of the links in its r�ef�er�e r
i
as
P
j
h
0
ij
= 1.
Namely, let h
0
ij
be h
0
ij
=
P
j
h
0
ij
.
Thus, each node P
i
of Paradigme is mapped
from the corresponding entry G
i
of Gloss�eme. A
computer program (written in Common Lisp and
executed on KCL) carries out the procedures de-
scribed above.
B. Function of Paradigme
| Spreading Activation Rules
Each node P
i
of the semantic network Paradigme
computes its activity value v
i
(T+1) at time T+1
33
as follows:
v
i
(T+1) = �
�
R
i
(T ) +R
0
i
(T )
2
+ e
i
(T )
�
;
where R
i
(T ) and R
0
i
(T ) are the activity (at time
T ) collected from the nodes referred in the r�ef�erant
and r�ef�er�e of P
i
, respectively. And, e
i
(T )2 [0; 1] is
the activity given from outside (at time T ). The
output function � limits the value to [0,1].
R
i
(T ) is the activity value of the most plausible
subr�ef�erant in P
i
, de�ned as follows:
R
i
(T ) = A
im
(T );
m = argmax
j
(H
ij
�A
ij
(T )) ;
where H
ij
is the weight of s
ij
(i.e. the j-th
subr�ef�erant of P
i
). And, A
ij
(T ) is the sum of
weighted activity of the nodes referred in s
ij
, de-
�ned as follows:
A
ij
(T ) =
X
k
h
ijk
�a
ijk
(T );
where h
ijk
is the weight of l
ijk
(i.e. the k-th link
of s
ij
), and a
ijk
(T ) is activity (at time T ) of the
node referred by l
ijk
.
R
0
i
(T ) is the sum of weighted activity of the
nodes referred in the r�ef�er�e r
i
of P
i
:
R
0
i
(T ) =
X
j
h
0
ij
�a
0
ij
(T );
where h
0
ij
is the weight of l
0
ij
(i.e. the j-th link
of r
i
), and a
0
ij
is activity (at time T ) of the node
referred by l
0
ij
.
In the experiments described in this thesis, I
have used the output function de�ned as follows.
�(x) =
8
<
:
1 (x > 1)
C �x (0�C �x�1)
0 (x < 0)
The constant C determines the decaying factor (in
the experiments, C=0:9).
As mentioned in Section 3.2, a computer pro-
gram carrys out this spreading activation proce-
dure. The program is written in C programming
language and translated by the compiler on SunOS
4.1.3. It computes the transition of an activated
pattern (from T to T+1) within 0.25 seconds on
the workstation (SPARCstation 2 =SunOS 4.1.3).
C. Text Used in the Experiment:
Springtime �a la Carte
The following text is a simpli�ed version of the
short story, Springtime �a la Carte by O.Henry
(adopted from the book, British and American
Short Stories , edited and simpli�ed by Thorn-
ley [1960]). It is the text used in the experiment
of scene segmentation described in Section 4.3.
The two graphs of LCP below are computed with
Hanning window of 51 words long. The vertical
solid lines in the graphs show the histogram of
the scene boundaries marked by 16 subjects.
Springtime �a la Carte
It was a day in March.
Never, never begin a
10
story this way when you write
one. No opening could
20
possibly be worse. There is no
imagination in it. It
30
is at and dry. But it is allowable
here. For
40
the following paragraph, which should have
started the story, is
50
too wild and impossible to be thrown
in the face
60
of the reader without preparation.
Sarah was crying over the
70
bill of fare.
To explain this youmay guess that
80
oysters were not on
the list, or that she had
90
promised not to eat ice-cream
just now. But your guesses
100
are wrong, and you will
please let the story continue
110
.
The gentleman who said the world was an oyster
which
120
he would open with his sword became more fa-
mous than
130
he deserved. It is not di�cult to open an
oyster
140
with a sword. But did you ever notice anyone
try
150
to open it with a typewriter?
Sarah had managed to
160
open the world a little with
her typewriter. That was
170
her work | typing. She did
not type very quickly, and
180
so she had to work alone, and
not in a
190
great o�ce.
The most successful of Sarah's battles with the
200
world
was the arrangement that she made with Schulenberg's
Home
210
Restaurant. The restaurant was next door to the
old red-brick
220
building in which she had a room. One
evening, after
230
dining at Schulenberg's, Sarah took away
with her the bill
240
of fare. It was written in almost un-
readable handwriting, neither
250
English nor German, and
was so di�cult to understand that
260
if you were not care-
ful you began with the sweet
270
and ended with the soup
and the day of the
280
week.
The next day Sarah showed Schulenberg a card on
290
which the bill of fare had been beautifully typewritten
with
300
the food temptingly listed in the right and proper
places
310
, from the beginning to the words at the bottom:
\not
320
responsible for overcoats and umbrellas".
Schulenberg was delighted. Before Sarah
330
left him
he had willingly made an agreement. She was
340
to pro-
vide typewritten bills of fare for the twenty-one tables
350
in the restaurant | a new bill for each days dinner
360
, and
new ones for breakfast and lunch as often as
370
there were
changes in the food or as neatness made
380
necessary.
In return for this Schulenberg was to send three
390
meals
a day to Sarah's room, and send her also
400
each afternoon
a list in pencil of the foods that
410
Fate had in store for
Schulenberg's visitors on the next
420
day.
Both were satis�ed with the agreement. Those who
ate
430
at Schulenberg's now knew what the food they were
eating
440
was called, even if its nature sometimes puzzled
them. And
450
Sarah had food during a cold dull winter,
which was
460
the main thing with her.
When the spring months arrived
470
, it was not spring.
Spring comes when it comes. The
480
frozen snows of Jan-
34
0 100 200 300 400 500 600 700 800
0.3
0.4
0.5
0.6
0.7
LCP
words
0
4
8
12
16
segs
�
�
�
�
�
� ��
�
�
�
�
��
�
�
�
�
�
� � �
�
�
�
800 900 1000 1100 1200 1300 1400 1500 1600
0.3
0.4
0.5
0.6
0.7
LCP
words
0
4
8
12
16
segs
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
� �
�
�
�
�
�
�
uary still lay hard in the streets
490
. Men in the streets with
their musical instruments still played
500
\In the Good Old
Summertime", with their December activity and
510
expres-
sion. The steam-heat in the houses was shut o�. And
520
when these things happen, one may know that the city
530
is still in the power of winter.
One afternoon Sarah
540
was shaking with cold in her
bed-room. She had no
550
work to do except Schulenberg's
bills of fare. Sarah sat
560
in her rocking-chair and looked
out of the window. The
570
month was a spring month
and kept crying to her
580
: \Springtime is here, Sarah |
springtime is here, I tell you
590
. You've got a neat �gure,
Sarah | a nice, springtime
600
�gure | why do you look
out of the window so
610
sadly?"
Sarah's room was at the back of the house
620
. Looking
out of the window she could see the windowless
630
brick
wall of the box factory in the next street
640
. But she
thought of grassy walks and trees and bushes
650
and roses.
In the summer of last year Sarah had
660
gone into the
country and fallen in love with a
670
farmer.
(In writing a story, never go backwards like this
680
. It
is bad art and destroys interest. Let it go
690
forwards.)
Sarah stayed two weeks at Sunnybrook Farm. There
she
700
learned to love old Farmer Franklin's son Walter.
Farmers have
710
been loved and married in less time. But
young Walter
720
Franklin was modern agriculturist. He
had a telephone in his
730
cow-house, and he could calcu-
late exactly what e�ect next year's
740
Canada wheat crop
would have on what he planted.
It
750
was in this shady place that Walter had won
her
760
. And together they had sat and woven a crown of
770
dandelion for her hair. He had praised the e�ect of
780
the
yellow owers against her brown hair; and she had
790
left
the owers there, and walked back to the house
800
swinging
her straw hat in her hands.
They were to
810
marry in the spring | at the very �rst
signs of
820
spring, Walter said. And Sarah came back to
the city
830
to hit the typewriter.
A knock at the door drove
840
away Sarah's dreams of
that happy day. A waiter had
850
brought the rough pencil
list of the Home Restaurant's next
860
day's food written in
old Schulenberg's ugly handwriting.
Sarah sat
870
down to her typewriter and slipped a card
beneath the
880
rollers. She was a clever worker. Generally
in an hour
890
and a half the twenty-one cards were typed
and ready
900
.
Today there were more changes on the bill of fare
910
than usual. The soups were lighter; there were changes
in
920
the meat dishes. The spirit of spring �lled the
entire
930
list. Fried foods seemed to have gone.
Sarah's �ngers danced
940
over the typewriter like little
ies above a summer stream
950
. Down through the di�er-
ent foods she worked, giving the name
960
of each dish its
proper position according to its length
970
with a watchful
eye.
Just before she reached the fruit
980
, Sarah was crying
over the bill of fare. Tears from
990
the depths of despair
rose in her heart and gathered
1000
in her eyes. Down went
her head on the little
1010
typewriter stand.
For she had received no letter from Walter
1020
in two
weeks, and the next thing on the bill
1030
of fare was dande-
lions | dandelions with some kind of egg
1040
| but never
mind the egg! | dandelions, with whose golden owers
1050
Walter had crowned her his queen of love and future
1060
wife | dandelions, the messengers of spring, her sorrow's
crown of
1070
sorrow | reminder of her happiest days.
But what a wonderful
1080
thing spring is! Into the great
cold city of stone
1090
and iron a message had to be sent.
There was
1100
none to bring it except the little messenger
of the
1110
�elds with his rough green coat, the dandelion
| this lion's
1120
tooth, as the French call him. When he
is in
1130
ower, he will help at love-making, twisted in my
lady's
1140
nut-brown hair; when young, before he has his
owers, he
1150
goes into the boiling pot.
In a short time Sarah
1160
forced back her tears. The
cards must be typed. But
1170
still in a dream she touched
the typewriter without thinking
1180
of it, with her mind
and heart in the country
1190
with her young farmer. But
then she came back to
1200
the stones of Manhattan, and
the typewriter began to jump
1210
.
At six o'clock the waiter brought her dinner and
carried
1220
away the typewritten bills of fare. Sarah ate
her dinner
1230
sadly. At 7.30 the two people in the next
room
1240
began to quarrel; the man in the room above
began
1250
to play something like music; the gas light went
a
1260
little lower; someone started to unload coal; cats
could be
1270
heard on the back fences. By these signs
Sarah knew
1280
that it was time for her to read. She
got
1290
out her book, settled her feet on her box, and
1300
began.
The front-door bell rang. The landlady answered it.
Sarah
1310
left the book and listened. Oh, yes; you would,
just
1320
as she did.
And then a strong voice was heard
1330
in the hall below,
and Sarah jumped for her door
1340
, leaving the book on the
oor.
You have guessed it
1350
. She reached the top of the
stairs just as her
1360
farmer came up, three steps at a jump,
and gathered
1370
her to him.
\Why haven't you written | oh, why?" cried
1380
Sarah.
\New York is a rather large town," said Walter
1390
Franklin. \I came in, a week ago, to your old
1400
address.
I found that you went away on a Thursday
1410
. I've hunted
for you with the police and otherwise
1420
ever since!"
\I wrote to you," said Sarah, with force
1430
.
\Never got it!"
\Then how did you �nd me?"
35
The
1440
young farmer smiled a springtime smile.
\I went into the
1450
Home Restaurant next door this
evening," said he. \I don't
1460
care who knows it; I like
a dish of some
1470
kind of greens at this time of the year.
I
1480
ran my eye down that nice typewritten bill of fare
1490
looking for something like that. But when I looked, I
1500
turned my chair over and shouted for the owner. He
1510
told me where you lived."
\Why?"
\I'd know that
1520
capital W away above the line that
your typewriter makes
1530
anywhere in the world," said
Franklin.
The young man drew
1540
a bill of fare from his pocket,
and pointed to
1550
a line.
She recognized the �rst card she had typed
1560
that
afternoon. There was still a mark in the upper
1570
right-
hand corner where a tear had fallen. But over the
1580
spot where one should have read the name of a
1590
certain
plant, the memory of the golden owers had allowed
1600
her �ngers to strike strange keys.
Between two dishes on
1610
the list was the description:
\DEAREST WALTER, WITH HARD-BOILED
EGG
1620
."
D. Another Experiment
on a Biography: Mahatma Gandhi
The following text is the biography, Mahatma
Gandhi (adopted from the book, Great Men and
Women, by Leavitt [1958]). The four graphs of
LCP below are computed with Hanning window of
51 words long. The dotted lines in the graphs show
scene boundaries of the text carefully marked by
intuition of the author of this thesis.
Mahatma Gandhi
On the evening of January 30, 1948, a little old
10
man
was slowly crossing the courtyard of his home on
20
his way
to prayers. Suddenly the sound of four gun-shots
30
was
heard, and the man fell to the ground. That
40
night his
great friend, Pandit Nehru, speaking on the radio
50
to the
people of India, said: \The light has gone
60
out of our lives
and everywhere it is dark." The
70
life-story of this little,
old and very great man, Mahatma
80
Gandhi, is one which
everyone should know.
Mohandas Gandhi was
90
born in a city in the west part
of India
100
on October 2, 1869. Mohandas was his �rst
name. The
110
word Mahatma means \Great Soul" and is
a title which
120
was given him later. For many years mem-
bers of the
130
Gandhi family had held important govern-
ment posts, and for a
140
long time the father of Mohandas
was chief o�cer in
150
one of the states of India. The father
was �ne
160
and brave man, and very good at his work.
The
170
son loved his father very much, and also his
mother
180
. His mother was very serious in her religion
and never
190
thought of beginning a meal without prayer.
At one time
200
she felt that her religion demanded that she
should not
210
eat until she saw the sun. It was then the
220
season of rain, and often the sun was not seen
230
for a long
time. Her children were much troubled and
240
spent long
hours looking up at the sky to be
250
able to hurry to tell
her that the sun was
260
shining and that she could eat.
0 200 400 600 800 1000 1200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
LCP
words
In his later life
270
Gandhi wrote a book which tells us
many things about
280
his early years. In this book he
says that it
290
was not easy for him to make friends with
other
300
boys in school, and that his only companions were
his
310
books and his lessons. He used to run home from
320
school as soon as classes were over for fear that
330
someone
would talk to him or make fun of him
340
. As a little boy
he was very honest. One day
350
a small event concerned
with school games troubled him very
360
much. Because
he did not enjoy being with other boys
370
and also because
he wanted to help his father after
380
school, he did not like
to take part in school
390
games. He thought they were a
waste of time.
One
400
day when they had had their classes in the
morning
410
only, Mohandas was supposed to return to
school at four
420
o'clock for school games. He had no watch
and the
430
cloudy weather deceived him. He arrived late;
the games were
440
over; the boys had gone home. The
next day when
450
he explained to the head of the school
why he
460
was late, he was not believed. He, Mohandas
Gandhi, a
470
liar? No! No! But how could he prove that
he
480
was telling the truth? At this early age he began
490
to understand that a man of truth must also be
500
a care-
ful man. Carelessness often leads others to have wrong
510
ideas about a person.
Later Mohandas changed his mind about
520
the value
of games in the playground. Fortunately he had
530
read
in books that walking was a valuable exercise, and
540
while
still a boy began to take long walks in
550
the open air, a
form of exercise which he enjoyed
560
and carried on during
all his life.
He also says
570
in his book that his handwriting was
very poor, and
580
that he did nothing to improve it because
he believed
590
that it was not important. Later, when he
was in
600
South Africa, he saw the excellent handwriting
of lawyers and
610
young men of that country and became
ashamed of his
620
own. He saw that bad handwriting
should be considered a
630
weakness in a person. When he
then tried to improve
640
his own handwriting, he found it
was too late.
Mohandas
650
was married at the early age of thir-
teen, which in
660
India at that time was not thought to
be too
670
young. The oldest son of the family was already
married
680
, and the father and mother decided that the
second son
690
and the third son, Mohandas, together with
an uncle's son
700
, should all be married at the same time.
Marriages, with
710
their presents, dinners, �ne clothes and
all the rest, cost
720
the families a lot of money, and a mar-
riage of
730
all three together would save much. The young
wife of
740
Mohandas had never been to school. This early
marriage did
750
not help his lessons, and he lost a year
in
760
high school. Fortunately, by hard work he was later
able
770
to �nish two classes in one year.
Among his few
780
friends at school was a young man
whose character was
790
not very good. Mohandas knew
36
1000 1200 1400 1600 1800 2000 2200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
LCP
words
this, but refused to accept
800
the advice of others and felt
that he would be
810
able to change the character of his
friend. The family
820
of Gandhi belonged to a religious
group which did not
830
believe in taking the life of any
creature, and so
840
the eating of meat was forbidden them.
But Mohandas's friend
850
set out to make him believe that
the eating of
860
meat was good for him. He explained it
in this
870
way: \We are a weak people. The English are
able
880
to rule over us because they eat meat. I myself
890
am strong and a �ne runner. It is because I
900
am a meat-
eater. You should eat meat, too. Eat some
910
and see what
strength it gives you." After a time
920
the young Mohan-
das partly believed his companion. He himself was
930
cer-
tainly not strong and could hardly jump or run. He
940
was
afraid of the dark, too, and always had a
950
light burning
in his bed-room at night. The desire to
960
eat meat was
great, even though he hated to deceive
970
his father and
mother. One day the two boys went
980
o� to a quiet place
by the river alone, and
990
there Mohandas tasted meat,
goat meat, for the �rst time
1000
. It made him sick. For
about a year after that
1010
, from time to time his friend ar-
ranged for him to
1020
eat meat. At last Mohandas stopped
completely, believing that nothing
1030
was worse than de-
ceiving his father and mother in this
1040
way. They never
learned of what he had done, but
1050
from that time on
through his whole life he never
1060
tasted meat again.
At about this time he and another
1070
young man began
to smoke, not because they really liked
1080
it but because
they thought that they got pleasure in
1090
blowing smoke
from their mouths like grown-up men. They had
1100
little
money to buy cigarettes, and the unsmoked ends of
1110
their uncle's cigarettes were not enough. So occasionally
they stole
1120
a little money from the servants in the house.
Mohandas
1130
soon gave up smoking, and came to feel that
it
1140
was dirty and harmful.
These actions of his troubled the
1150
young man Mo-
handas because he had determined to build his
1160
life on
truth, and he knew that in deceiving his
1170
father and
mother and breaking the rules laid down by
1180
his reli-
gion he was not honest. There was one more
1190
event of
the same kind. Once when �fteen years of
1200
age he stole
a small piece of gold from his
1210
older brother, and the
deed lay heavy on his mind
1220
. Finally he wrote out the
story of what he had
1230
done asking that he be punished
and promising that he
1240
never again would steal. Feel-
ing very much ashamed, he gave
1250
this letter to his good
father, then a sick man
1260
. The father read it carefully,
closed his eyes in thought
1270
, and the tears came. He
slowly tore up the letter
1280
. The boy had expected an-
gry words, and the sorrowful but
1290
loving feelings of the
father were never to be forgotten
1300
by the son.
At the age of eighteen Gandhi went
1310
to a college,
but remained for only part of the
1320
year. The lessons
did not interest him and he did
1330
not do well. Soon after
this he was advised to
1340
go to England to study to be a
lawyer. This
1350
would not be easy. It was di�cult for
him to
1360
leave India and to go to a foreign land where
1370
he would have to eat and drink with foreigners. This
1380
was against his religion, and most leaders of his group
1390
were against his going. Yet, in spite of all di�culties
1400
,
the young Mohandas, at the age of eighteen, sailed for
1410
England, leaving a wife and child behind.
On board ship
1420
he wore, for the �rst time, the new
foreign clothes
1430
provided by his friends. He wore his
black suit, carefully
1440
keeping his new white clothes until
he reached England. This
1450
was at the end of autumn,
and on landing he
1460
was much troubled to �nd he was
the only person
1470
so dressed. To make matter worse, he
could not get
1480
at his baggage to change his clothes. In
his own
1490
account of his early days in London, we �nd
two
1500
interesting events.
One of these was his di�culty in �nding
1510
suitable
food. Unlike most of the Indians in England, he
1520
fol-
lowed the rule of his religion and would not eat
1530
meat.
This was not easy, and he was often hungry
1540
at the
end of a meal. What was his joy
1550
when he discovered a
dining-place where no meat of any
1560
sort was served. He
learned for the �rst time that
1570
there were many people
in England who for health reasons
1580
ate no meat. It
pleased him to �nd science giving
1590
support to his re-
ligious beliefs. Later he found it easier
1600
to prepare
breakfasts and suppers in his own room, and
1610
to buy
his meals in the middle of the day
1620
.
The other event is one which later gave him and
1630
his friends much amusement. The young Indian tried to
\play
1640
the English gentleman". He decided that if he
could not
1650
eat like an Englishman, he would dress like
one and
1660
act like one in other ways. He bought new
clothes
1670
and a tall silk hat, and asked his brother to
1680
send him a gold watch-chain. Then he spent some time
1690
each morning dressing with care and brushing his thick
hair
1700
. Following the advice of friends, he took lessons
in dancing
1710
, French, playing a musical instrument and
speaking in public. But
1720
in these arts he did not do very
well, and
1730
his money was rapidly disappearing. At the
end of three
1740
months he saw that he was not making
the best
1750
use of his time, and gave up all this. He
1760
began to study law.
At this time also he became
1770
more interested in re-
ligions. When friends asked him to help
1780
them in their
understanding of the Gita, the holy book
1790
of his own
Hindu religion, he began to see how
1800
beautiful it was.
Before long it became for him the
1810
one book for the best
knowledge of Truth. Someone gave
1820
him a Bible, and
in it he found some teachings
1830
of Jesus which he liked
very much because they were
1840
so like certain ideas in the
Gita. Then from a
1850
reading of a book by the English
writer Carlyle, he
1860
learned about the Prophet Muham-
mad and about his greatness and
1870
bravery and simple
living. At this time he was beginning
1880
to learn that the
truth he loved was not to
1890
be found in any one religion
only.
After four years
1900
of study, young Gandhi passed his
law examinations and in
1910
1891 returned to India. When
he landed he was met
1920
by friends who told him of his
mother's death. This
1930
was an even greater shock to him
than the death
1940
of his father before he went to England.
The next
1950
few years were not happy ones. He found his
work
1960
as a lawyer not at all interesting, and came to
1970
feel that he was not �tted for this kind of
1980
occupation.
He had trouble on the one occasion when he
1990
was in
court. He almost fainted, and when his turn
2000
came to
speak he could not say a word. He
2010
would welcome a
37
2000 2200 2400 2600 2800 3000 3200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
LCP
words
change. This came when he was invited
2020
to go to South
Africa to advise a rich Indian
2030
merchant who was trying
to collect a large amount of
2040
money from a member of
his family. We �nd him
2050
at the age of twenty-four in
Durban, South Africa.
Gandhi
2060
soon found that conditions among the many
Indians in South
2070
Africa were not at all right. He
learned this �rst
2080
when he went to court wearing foreign
clothes and a
2090
turban. He refused and left the court.
This turban was
2100
soon to become famous all over South
Africa. Most of
2110
the Indians who had left their own
land to look
2120
for work in Africa were considered of a
low rank
2130
and were known as \coolies". Gandhi was
thus a \coolie
2140
" lawyer.
A few days after he arrived, Gandhi was sent
2150
o�
to another city on business for his employer, Abdullah
2160
Sheth. When a white man travelling in the same train
2170
discovered him in a �rst-class seat he called a railway
2180
guard who ordered him to leave the �rst-class carriage.
Gandhi
2190
replied that he had bought a �rst-class ticket
and intended
2200
to use it. A policeman came and forced
him to
2210
leave the train. The next day something even
worse happened
2220
. While making a journey in a large
public carriage, he
2230
was given a seat outside with the
driver. During the
2240
journey the white man in charge
wanted his seat. When
2250
Gandhi refused to move, the
man struck him, but the
2260
other white people in the car-
riage made the man stop
2270
. When he reached the city
he drove to the main
2280
hotel, and there received another
shock. The hotel would not
2290
take him in. It was
events like these which made
2300
Gandhi feel that someone
was needed to help the Indians
2310
in Africa. He himself
was not proud, and he was
2320
not dependent upon a com-
fortable way of living. Later he
2330
accepted for himself
the simple living of the poorest Indians
2340
, and travelled
third-class in trains at all times. But it
2350
hurt him to
see the people of his country treated
2360
badly, and so he
continued to work against all attempts
2370
to treat him and
others in a way that was
2380
not fair and just.
After a time he came to
2390
feel that it would be un-
wise for the merchant who
2400
employed him to go to the
courts to get back
2410
the money that was owed him. As a
result of
2420
very hard work lasting months, he was able to
get
2430
the two merchants to agree outside of court upon
the
2440
amount of money to be paid and how it was
2450
to be paid. This success led him to believe that
2460
most
quarrels between people could be, and should be settled
2470
in a peaceful manner with the aid of friends.
During
2480
this year he met a number of Christians
who were
2490
eager that he should become a Christian and
Moslems who
2500
hoped that he would become a Moslem.
He read from
2510
the Bible and Koran and from books
about both religions
2520
. But at the same time he was
coming to enjoy
2530
and depend more and more upon the
holy books of
2540
hinduism and was coming to �nd for him-
self deep happiness
2550
and peace in them.
At the end of a year
2560
his work with Abdullah Sheth
was �nished and he planned
2570
to return to India. But
at a good-bye dinner given
2580
him in Durban he learned
that a law was being
2590
planned to take away from all
Indians still more of
2600
their rights. During the talk at
the dinner it was
2610
decided that Gandhi must remain in
South Africa and work
2620
for the rights of the Indians.
Thus began twenty years
2630
of hard work for the Indians
of South Africa.
At
2640
the end of three years he returned to India for
2650
several months, and then came back by ship with his
2660
wife and two children. While in India he had tried
2670
to
tell the people there how Indians were treated in
2680
South
Africa, and news of what he had spoken and
2690
written
had reached the white people living in natal before
2700
he
arrived. When he attempted to land he was recognized
2710
and cries of \Gandhi, Gandhi!" quickly brought a crowd
together
2720
. The crowd gathered around him, threw
stones and eggs at
2730
him and struck him. He was saved
by the courage
2740
of the wife of the English Chief of Po-
lice, who
2750
walked along him until policemen came to
his help. He
2760
was then able to escape from the angry
crowd by
2770
dressing himself as an Indian policeman and
slipping out of
2780
the back door while the Police Chief
held the crowd's
2790
attention in front.
It is not possible to describe all
2800
the events of the
years that Gandhi spent in South
2810
Africa serving his fel-
low Indians, and working to improve their
2820
conditions
and to make the government treat them more justly
2830
.
He gave up a position in which he was earning
2840
a lot of
money in order to join with the
2850
poor people for whom
he was working. In all his
2860
work his wife helped him,
and believed in him and
2870
gave him courage to go on.
From the struggle in
2880
South Africa he gained a strong
belief in certain ways
2890
of action which were to be so
important later in
2900
his own country. More and more
he came to believe
2910
in a \soulforce". This was a strug-
gle against evil and
2920
force not by using hatred and force,
but by love
2930
and by quietly refusing to obey unjust laws.
Those who
2940
believed as he did and followed him would
not work
2950
with the government or obey and unjust law.
In the
2960
end there was little that the government could
do about
2970
it. Gandhi was often put in prison, but
his followers
2980
continued to carry on the work. When
Gandhi left South
2990
Africa in 1914 very great improve-
ments in the conditions of
3000
the Indians there had taken
place.
Gandhi returned to India
3010
at the beginning of the
First World War to �nd
3020
himself already recognized as
a leader. His work in South
3030
Africa had been followed
by the people, and he now
3040
was everywhere spoken as
\Mahatma" Gandhi. He settled down near
3050
Ahmed-
abad, where he started an Ashram, a religious group-home.
People
3060
of any race or religion were invited to come
and
3070
join him, if they were willing to make certain
promises
3080
. There were: (1) always to speak the truth;
(2) not
3090
to �ght or hate other people; (3) to eat only
3100
what was necessary to keep them healthy; (4) not to
3110
own anything that was not necessary.
The Untouchables were the
3120
lowest rank in the Hindu
religion; they were allowed to
3130
do only the lowest kind
of work; but they were
3140
welcome in the Gandhi home.
When a family of Untouchables
3150
did come to join the
group trouble arose. The neighbours
3160
threatened that
they would have nothing to do with them
3170
, and the
rich Hindus who were helping to support the
3180
home
with money suddenly stopped giving. Gandhi was not
38
3000 3200 3400 3600 3800 4000 4200
0.2
0.3
0.4
0.5
0.6
0.7
0.8
LCP
words
troubled
3190
, but started making plans to move the whole
group into
3200
the part of the city where the Untouchables
live. He
3210
planned that they all would get their living
by doing
3220
the low work that only Untouchables were
allowed to do
3230
. While these plans were being made,
the Mahatma was called
3240
aside by a Moslem merchant,
who asked him if he
3250
would accept money from him for
the help of the
3260
Ashram. The next day the man re-
turned with a large
3270
amount of money, enough to keep
the home going for
3280
a year. Gandhi said: \God has
sent us help at
3290
the last moment." This event was
the �rst of many
3300
which was to give the Untouchables a
new place in
3310
Indian life. At this time, and for the rest
of
3320
his life, the Mahatma was wearing the simple native
clothing
3330
made of cotton cloth spun in a home.
Gandhi's great
3340
aim in life was to help to improve
the conditions
3350
of poor and su�ering people, and to aid
his people
3360
in any way he could, but always without
using force
3370
. He was against every sort of evil, no mat-
ter of
3380
what kind. When he tried to �nd out about
the
3390
conditions among poor farm workers, the people
crowded around him
3400
by the hundreds. A friend had
come among them, someone
3410
who wanted to help them,
and to them this was
3420
something new. When the po-
lice ordered Gandhi to leave the
3430
place, he refused, and
in court he explained why he
3440
could not obey. Then
he asked the court to punish
3450
him for breaking the
law. The court did not know
3460
what to do with such
a man, and so let
3470
him go free. This was the �rst
step in what
3480
came to be an important and common
event in many
3490
parts of India | to refuse to obey a law
considered
3500
to be unjust, and at the same time calmly
to
3510
accept any punishment that might be given.
Little by little
3520
the people of India came to under-
stand what the Mahatma
3530
meant by �ghting force with
love, instead of �ghting force
3540
with force. In 1930 there
was the famous Salt March
3550
.
According to the law, no one was allowed to make
3560
salt from sea water, but must buy it through the
3570
gov-
ernment. Gandhi considered that this was a bad and
unjust
3580
law and so should not be obeyed. He said
publicly
3590
that he would lead his followers to the sea,
two
3600
hundred miles away, and there disobey the law.
For three
3610
weeks, while the whole world watched and
while conditions of
3620
India were troubled, the little old
man, dressed in the
3630
white cotton which he had spun
himself, walked steadily on
3640
. Crowds followed him, the
people changing from village to village
3650
, on and on, un-
til they reached the sea. There he
3660
made a handful of
salt. God had given the sea
3670
; no government of man
could keep it from the people
3680
. He was put in prison
for a time, but not
3690
for long.
The struggle of the Indian people for self-
government
3700
had begun. Gandhi wanted self-
government, but he knew that Indian
3710
must show that
they were ready for it. \Even God
3720
" he said, \Can-
not grant it; we must work for it
3730
and win it ourselves."
He began to attack the British
3740
government in his writ-
ings because it was unwilling to free
3750
India, but he still
believed in love and not hatred
3760
, and he set his face
against the use of force
3770
. He was sent to prison several
times because of what
3780
he said and what he did. When
his followers did
3790
not obey him and used force, he went
without food
3800
, sometime for so long that he almost died.
His followers
3810
grew in number and in strength.
Crowds gathered to see
3820
him pass and to hear him speak.
Al India read
3830
what he wrote. Important leaders of
India and other parts
3840
of the world came to talk with
him about their
3850
plans, and to listen to his message of
peace and
3860
love for the world. The struggle for self-
government was long
3870
, and in the end success came.
After long years an
3880
Act was passed making India a free
nation. Everyone knew
3890
that the man who had done
more than anyone else
3900
to bring this about was Gandhi.
But Gandhi was troubled
3910
in spite of his success. Such
terrible quarrels had arisen
3920
between the Moslems and
the Hindus that India had had
3930
to be divided between
them, and there were now two
3940
countries: India for the
Hindus and Pakistan for the Moslems
3950
. Gandhi so loved
his country and so hated quarrels that
3960
this division
made him very unhappy.
Terrible things happened in
3970
many parts of India,
especially where Hindus and Moslems lived
3980
side by
side. Fighting between the two groups broke out
3990
,
and men, women and children were killed. Hundreds of
thousands
4000
of people were without homes and there was
very great
4010
su�ering. In the part of the country in
which Gandhi
4020
was living, peace came sooner than in
other parts of
4030
India, because Gandhi had said that he
would refuse to
4040
eat until the �ghting stopped. Both
Hindus and Moslems respected
4050
him so much that they
kept the peace. But Gandhi's
4060
life was coming to its
end. On January 30, 1948
4070
, he was walking slowly from
his home to attend a
4080
prayer meeting. A young Hindu
thought that Gandhi had done
4090
harm to the Hindus be-
cause he was friendly with the
4100
Moslems; he pushed his
way through the crowd and shot
4110
Gandhi in the stom-
ach. Some minutes later a man came
4120
out of the house
into which the body had been
4130
carried and said to the
waiting crowd: \Gandhi is dead
4140
!"
Another great Indian leader, Pandit Nehru, speaking
over the radio
4150
that night, said: \The light has gone
out of our
4160
lives and everywhere it is dark. The father of
the
4170
nation is no more. The best prayer we can o�er
4180
is to give ourselves to Truth and carry on the
4190
noble
work for which he lived and for which he
4200
died." A few
days later, following the custom of the
4210
Hindu religion,
Mahatma Gandhi's body was burned in the presence
4220
of
a great crowd, and later the ashes were scattered
4230
over
the waters of the sacred rivers. So ended the
4240
life, but
not the spirit, of one of the great
4250
men of the world.
39