computing lexical cohesion as a tool for text analysis hideki

40

Upload: truongquynh

Post on 01-Jan-2017

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

Computing Lexical Cohesion as a Tool for Text Analysis

Hideki Kozima

Course in Computer Science and Information Mathematics

Graduate School of Electro-Communications

University of Electro-Communications

Doctoral Thesis, December 13, 1993

Abstract

Recognizing coherent structure of a text is an essential task in natural language understanding.

It is necessary, for example, to resolve anaphora, ellipsis, and ambiguity. One of the dominant

factors of coherence of the text structure is lexical cohesion, namely the dependency relationship

between words based on associative relations in common knowledge.

This thesis proposes an objective and computationally feasible method for measuring lexical

cohesion, especially semantic relations, between words. Lexical cohesion between words is com-

puted on a semantic network constructed systematically from a subset of an ordinary English

dictionary. Spreading activation on the semantic network analyses the meaning of a word into a

2,851-dimensional semantic space and computes the strength of lexical cohesion between any two

words in the dictionary.

As an evaluation of the measurement of lexical cohesion, this thesis then presents a quantita-

tive indicator, Lexical Cohesion Pro�le (LCP), for segmenting narratives into scenes, the smallest

domain in which text coherence can be de�ned. LCP is a record of the density of lexical cohesion

of words in a window (of 51 words long, in an example) that moves forwards word by word on

the text. Hills and valleys in a graph of LCP plotted against word position indicate alternation

of scenes in the text.

A psychological experiment shows that LCP correlates closely with the human judgements.

The evaluation through the text-level application reveals that the proposed measurement of lexical

cohesion works well as an indicator of coherent structure of a text.

The measurement of lexical cohesion provides semantic information for text analysis. The

segmentation scheme provides the frame work for recognizing coherent text structure. Both can

be applied to various studies in a broad range of �elds in natural language processing.

Contents

Chapter 1. Introduction page 1

2. Related Work and the Strategy of This Thesis 2

3. Computing Lexical Cohesion 7

4. Segmenting Narratives into Scenes 16

5. Retrospects and Prospects 24

6. Conclusion 30

Page 2: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

1 Introduction

Words and phrases in a text display a kind of

mutual dependence which creates a coherent tex-

ture: they do not occur at random. The tex-

ture is what distinguishes a text from something

that is not a text. Let us refer to the texture un-

der the heading of text structure, following re-

cent studies of text understanding [Hobbs, 1979;

Beaugrande and Dressler, 1981; Grosz and Sid-

ner, 1986; Mann and Thompson, 1987; Morris and

Hirst, 1991; Hahn, 1992].

Recognizing the coherent text structure is an

essential task in text understanding [Grosz and

Sidner, 1986; Mann and Thompson, 1987]. Spe-

ci�c meaning of a lexical item in a text, espe-

cially of a pronoun (e.g. she) and of a de�nite

noun phrase (e.g. the box), can only be deter-

mined when placed in the whole structure of the

text. One needs to recognize the text structure,

for instance, in resolving anaphora, ellipsis, and

ambiguity.

The threads of the textual structure are called

cohesion or cohesive relations [Halliday and

Hasan, 1976]. Cohesive relations within a text are

relationships between items of any size, from sin-

gle words to lengthy passages, over gaps of any

distance. They are established where the inter-

pretation of some items in the text is dependent

on that of another. Let us consider the following

text.

Molly came to a theatre. But she

couldn't see Desmond. The film had

already started. She decided to wait

for the next one. For two hours!

Several types of cohesive factors can be seen in

the text: conjunction (� � � But � � �), coreference

(Molly=she), substitution (film=one), ellipsis (^

For two hours), and lexical cohesion (theatre =

film).

Lexical cohesion is the aspect on which this

thesis focuses its e�ort. Lexical cohesion is the

dependency relationship between words (or lexi-

cal items) based on associative relations in com-

mon knowledge. Lexical cohesion plays a domi-

nant role in text structure. Yet it has no clear

computational de�nition. There have been several

attempts to compute lexical cohesion, for example

[Osgood, 1952; Morris and Hirst, 1991]. These at-

tempts, however, face di�culties in managing the

common knowledge objectively. (Details of lexi-

cal cohesion and the related work are described in

Chapter 2.)

This thesis has two topics: (1) a proposal for

an objective and computational measurement of

lexical cohesion between words [Kozima and Furu-

gori, 1993a, 1993b, 1993c] (described in Chapter

3), and (2) its application to analysing the text

structure (described in Chapter 4), namely seg-

menting narratives into coherent scenes [Kozima,

1993; Kozima and Furugori, 1993d]. The latter,

text segmentation, is intended as the evaluation

of the proposed measurement. The rest of this

chapter brie y outlines these two topics, respec-

tively.

1.1 Computing Lexical Cohesion

| An Outline

The �rst topic in this thesis is computing lexical

cohesion. Lexical cohesion is a relationship be-

tween words which makes the words signify iden-

tical or semantically related concepts in common

knowledge. In view of recognizing it, lexical co-

hesion is classi�ed into two major types: reitera-

tion (or repetition) and semantic relations.

� Reiteration

Molly likes cats very much.

She keeps a cat in her room.

� Semantic relations

Desmond saw a cat in the street.

It was Molly's pet.

Molly goes to the north.

Desmond goes to the east.

Desmond often goes to a theatre.

He likes films very much.

Reiteration of words is easy to capture by morpho-

logical analysis. Recognizing semantic relations is

di�cult for computers, since it requires dealing

with large and objective common knowledge.

The strategy of this thesis is to use an English

dictionary as the common knowledge for recog-

nizing lexical cohesion. A dictionary is the lexical

knowledge shared by people in a linguistic com-

munity. Each of its headwords is de�ned by a

phrase which is composed of the headwords and

their derivations. A dictionary is a closed para-

phrasing system, or a tangled network of words.

Lexical cohesiveness �(w;w

0

)2 [0; 1], namely

the strength of lexical cohesion between words

w;w

0

, is computed on a semantic network which

is systematically constructed from the English dic-

tionary. Each node of the semantic network repre-

sents a headword of the dictionary and has links to

other nodes | links to the words in the dictionary

de�nition of the headword. As illustrated in Fig-

ure 1.1, spreading activation [Waltz and Pol-

1

Page 3: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

'

&

$

%

'

&

$

%

?

w

activate

6

w

0

observe

Figure 1.1 Computing the lexical cohesiveness be-

tween words w;w

0

by spreading activation on the se-

mantic network.

lack, 1985; Rumelhart et al., 1986] on the network

computes lexical cohesion between any two words

in the dictionary. The following examples suggest

the feature of the lexical cohesiveness �(w;w

0

).

w w

0

�(w;w

0

)

cat pet 0.133722 (cohesive)

cat mat 0.002692 (incohesive)

The value of �(w;w

0

) increases with the strength

or tightness of the semantic relation between

w;w

0

.

1.2 Segmenting Narratives

into Scenes | An Outline

The second topic in this thesis is text segmen-

tation | segmenting a text into coherent units

of the text structure. Analysing the coherent text

structure is the most important purpose of com-

puting the lexical cohesion between words. This

text-level evaluation will reveal the nature of the

measurement of lexical cohesion.

Most studies on text structure assume that a

text can be segmented into units that then form

a hierarchical structure [Grosz and Sidner, 1986;

Mann and Thompson, 1987]. Also agreed com-

monly here is that each unit plays its own role (as

introduction and conclusion, for instance) in the

whole text. However, no clear discussion is ever

given to the problem of how to segment a text into

such units computationally.

This thesis deals with scenes, namely contigu-

ous and non-overlapping units of a narrative text.

A scene, whether or not it is explicitly realized

in a device like a paragraph, is de�ned as a se-

quence of sentences which displays local coher-

ence. A scene describes, just like in a movie, cer-

tain objects (characters and properties) in a sit-

uation (time, place, and backgrounds). This sug-

gests that a scene is the smallest domain in which

text coherence can be de�ned.

LCP

� � � � � � � � � � � � � � � � �� � �

scene 1 scene 2

Figure 1.2 Correlation between LCP (mutual lexi-

cal cohesiveness in the moving window) and a bound-

ary of coherent scenes.

Lexical Cohesion Pro�le (LCP) is a quan-

titative indicator proposed here for marking scene

boundaries in narrative texts. LCP is a record of

mutual lexical cohesiveness of words in a win-

dow (of 51 words long, for instance) that moves

forward word by word on the text. Since a coher-

ent portion of a text tends to be lexically cohe-

sive [Halliday and Hasan, 1976; Morris and Hirst,

1991], the mutual lexical cohesiveness of the text

portion suggests local coherence of it.

A graph of LCP plots local coherence estimated

from the mutual lexical cohesion at every point

of a text. Hills and valleys of the graph indicate

alternations of scenes in the text, as illustrated in

Figure 1.2. Here lies the basic idea of LCP:

� When the window is inside a scene, the words

in the window tend to be cohesive, making

LCP high.

� When the window is crossing a scene boundary,

the words in the window tend to lexically vary,

making LCP low.

So, the valleys (or minimum points) of the LCP

can be considered as marking scene boundaries.

Comparison with the scene boundaries, marked

by a number of subjects, shows that valleys of LCP

closely correlate with the dominant scene bound-

aries on which most subjects have agreed. This

also suggests the validity of the lexical cohesive-

ness which is the most signi�cant factor of scene

coherence.

2 Related Work and

the Strategy of This Thesis

The necessity for recognizing coherent text struc-

ture has been noticed in recent studies of text

understanding. For example, Hobbs [1979] pro-

2

Page 4: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

posed a set of coherence relations (e.g. elabo-

ration, parallel, and contrast) based on inferences

between successive portions of a text. Mann and

Thompson [1987] proposed rhetorical structure

theory which characterizes hierarchical structure

of a text in terms of unstated but inferred propo-

sitions (e.g. motivation, enablement, and solution-

hood) between clauses in the text.

Grosz and Sidner [1986] proposed discourse

structure theory, a general theory common to

all discourses. It assumes that a discourse struc-

ture is composed of three separate but interrelated

components: (1) linguistic structure | segmenta-

tion of a discourse into segments, (2) intentional

structure | purposes of each segments with re-

spect to the overall discourse, and (3) attentional

state | a stack-based model of topics on which

participants of the discourse will pay their atten-

tion.

The discourse structure theory and other re-

lated studies presuppose that a text being anal-

ysed has already been partitioned into segments,

namely the linguistic structure where each seg-

ment displays local coherence and plays its own

role. While a need for text segmentation is gener-

ally agreed, there is little consensus on computa-

tional de�nitions of local coherence of a segment

or how the text is partitioned into segments.

This chapter brie y reviews related work on co-

hesion, especially on lexical cohesion, and also

describes the strategy of this thesis for comput-

ing lexical cohesion. Section 2.1 makes clear the

nature of cohesion and the relationship between

lexical cohesion and common knowledge. Section

2.2 reviews two major approaches to computing

lexical cohesion: a psycholinguistic approach and

a thesaurus-based approach. Section 2.3 de-

scribes the strategy of this thesis: a dictionary-

based approach.

2.1 Cohesion and Lexical Cohesion

Cohesion is what makes a sequence of lexical

items into a coherent texture. Cohesive relations

are dependency relationships of interpretation of

the lexical items. This section brie y reviews the

function and structure of cohesion and also of lex-

ical cohesion.

2.1.1 Major Types of Cohesion

Several types of cohesive factors have been recog-

nized, as exempli�ed in the preceding chapter. De-

scribed below are �ve major types of cohesive fac-

tors [Halliday and Hasan, 1976], namely conjunc-

tion, coreference, substitution, ellipsis, and lexical

cohesion.

Conjunction covers a cohesive bond between

what has been said before and what is about to

be said, expressed by a conjunction (e.g. but) or

a conjunctive adverb (e.g. accordingly). For ex-

ample:

� Wash and core six apples.

Then put them into a bowl.

� Molly came to a theatre.

However, she couldn't see Desmond.

This cohesion type includes additive, adversative,

causal, and temporal relations between clauses.

Coreference is formed by features that cannot

be semantically interpreted without referring to

some other features in the text. For example:

� Molly came to a theatre.

But the girl couldn't see Desmond.

� No one knows that.

Desmond is getting married.

Two subtypes of coreference are recognized:

anaphora (in the �rst example) referring back-

ward, and cataphora (in the latter) referring for-

ward. Yet another subtype is exophora (or deixis)

referring to something out of the text (e.g. Look

at that.), whose interpretation requires broader

context.

Ellipsis and substitution are variants of the

same type of cohesion; both of them require that

the missing expressions have to be grammatically

appropriate for being inserted in place. Substi-

tution serves as a place-holding device, showing

where something has been omitted.

� The film had already started.

She waited for the next one.

� Desmond will come here on time.

I think so.

While ellipsis is complete omission of an expres-

sion which can be recovered by syntactic or seman-

tic expectations from the preceding or succeeding

text:

� Desmond ordered apple juice,

and Molly ^ orange juice.

� Put the apples into a bowl.

Now add some sugar ^.

Lexical cohesion semantically relates a word

with another in the text; it is classi�ed into two

subtypes: reiteration and semantic relation. Re-

iteration is repetition of a word by the same word

or its derivations:

3

Page 5: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

� I saw a cat in the street.

But I hate cats.

� Driving a car is interesting.

But I can't drive by myself.

Semantic relation between words is the seman-

tic relationship between concepts referred by the

words:

� A cat was running along the street.

It was Molly's pet.

� Molly often goes to a theatre.

She likes films very much.

Note that lexical cohesion occurs not only between

pairs of words but also over a succession of a num-

ber of related words and thus forms a lexical

chain (or a thread of texture) in a text.

2.1.2 Lexical Cohesion and

Common Knowledge

Lexical cohesion, especially semantic relation, be-

tween words (or lexical items) is the relationship

between concepts referred by the words; the con-

ceptual relationship lies in the common knowl-

edge shared by people in a linguistic community.

In view of the traditional frame-based knowledge

representation [Minsky, 1975; Schank, 1980], se-

mantic relation is classi�ed into two categories:

systematic semantic relation and non-systematic

semantic relation [Morris and Hirst, 1991].

Systematic semantic relation is the seman-

tic relationship logically classi�able in the struc-

ture of common knowledge. For example:

� A cat was running along the street.

It was Molly's pet.

� I saw a white cat.

However, there were no black ones.

Such structural relationships can be analysed by

the following logical relationships: synonymy of

close similarity (e.g. hear=listen), hyponymy of

general and speci�c (e.g. animal=cat), metonymy

of whole and part (e.g. room = window), and

antonymy of opposites (e.g. weak = strong).

Non-systematic semantic relation is the

other semantic relationship that is not logically

classi�able in the knowledge structure. For exam-

ple:

� Molly often goes to a theatre.

She likes films very much.

� Desmond is working at the restaurant.

He is a good waiter.

Such non-structural relationship includes colloca-

tion [Firth, 1957], i.e. tendency of co-occurrence

"polite"

angular

weak

rough

active

small

cold

good

tense

wet

fresh

rounded

strong

smooth

passive

large

hot

bad

relaxed

dry

stale

l

l

@

@

l

l

h

h

h

h

h

h

h

!

!

!

Figure 2.1 An example of semantic di�erential.

in similar situations.

Recent studies of knowledge representation and

parallel distributed processing [Minsky, 1986;

Waltz and Pollack, 1985; Rumelhart, et al., 1986]

have claimed that the conceptual relations in com-

mon knowledge do not have such names as syn-

onymy, antonymy, etc. So, both categories of se-

mantic relation should be treated as unnamed as-

sociative relations between concepts in common

knowledge.

2.2 Two Approaches

to Lexical Cohesion

There have been two major approaches to com-

puting lexical cohesion, especially associative rela-

tions, between words. One is a psycholinguistic

approach, which plots di�erences and quanti�es

the psychological distance between words. The

other is a thesaurus-based approach that re-

gards thesauri as the common knowledge on which

lexical cohesion is de�ned.

2.2.1 Semantic Di�erential

Psycholinguists have proposed methods for mea-

suring associative relations between words. One

of the pioneering studies is semantic di�eren-

tial [Osgood, 1952] which analyses the meaning of

a word into a range of di�erent dimensions. Sub-

jects are asked to rate a word in terms of where

it would fall on the 50 dimensions with the op-

posed adjectives at both ends. For example, if the

subjects feel that the word polite is good, they

place a mark towards the `good' end in the `good-

or-bad' dimension. Figure 2.1 illustrates ten of

the dimensions, giving the average responses from

40 subjects to the word polite (after [Osgood,

1952]).

Recent studies of knowledge representation, es-

pecially of distributed knowledge representation,

4

Page 6: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

hunting

gambling

dollar

second week inside school restaurantstreet lake canyon

minute month house factory theatre park desert

hour year store casino outside rural mountain

day decade office bar racetrack forest seashore

� � �� � � � � �

�� �

� � � � ��

� � �

� �

� � �

� � �

� �

� � � � � � � � � � �� � �

Figure 2.2 Examples of word analysis into microfeatures. (� : strong association; � : mild

association; � : negative association)

are somewhat related to Osgood's semantic dif-

ferential. Most of them are to describe mean-

ings of words or sentences using special symbols

like semantic primitives (e.g. ATRANS and MBUILD,

in [Schank, 1980]) and microfeatures (e.g. animal

and plant, in [Waltz and Pollack, 1985; Hendler,

1989]) that correspond to the semantic dimen-

sions. Figure 2.2 illustrates analysis of the words

hunting, gambling, and dollar into the patterns

on microfeatures (after [Waltz and Pollack, 1985]).

The semantic di�erential procedure provides

quantitative data which is presumably veri�able.

However, the following problems arise from the se-

mantic di�erential procedure as the measurement

of word meaning and word association.

� Connotation vs denotation

The procedure is not based on denotative

meaning of words, but only on connotative

emotions attached to the words.

� Coverage of meaning

It is di�cult to choose relevant dimensions

required by the su�cient semantic space for

analysing any English word. The procedure

selects the representative dimensions in terms

of the frequency of their use rather than in

terms of their logically exhaustive coverage, as

given in thesauri.

For example, the procedure will draw out the very

good and slightly strong connotations of the word

mother, but it will not indicate the de�nition of

mother: a female parent of a child or animal .

2.2.2 Thesaurus-based Analysis

A thesaurus is a book which classi�es a large

number of words into categories according to

logical relations between their meanings, rather

than arrange them in alphabetical order. Roget's

thesaurus [1911] is composed of 1000 basic cat-

egories; each category, as shown in Figure 2.3,

contains a series of paragraphs grouping closely

related words. Within each paragraph, still �ner

groups are marked by semicolons; in addition, a

Word (#562)

N. word, term, vocable; name &c. 564;

phrase &c. 566; root, etymon; derivative; part

of speech &c. (grammar) 567; ideophone.

dictionary, vocabulary, lexicon, index,

glossary, thesaurus, gradus, delectus, concor-

dance.

etymology, derivation; glossology, termi-

nology orismology; paleology &c. (philogy)

560.

lexocography; glossography &c. (scholar)

492; lexicologist, verbarian.

� � �

Figure 2.3 A sample category in Roget's thesaurus

[1911].

semicolon group may have pointers, shown as

`&c. � � �', to other related categories or paragraphs.

A thesaurus has an index, which allows for re-

trieval of categories related to a given word. For

example, the word dictionary has the following

index entry:

dictionary : List (#86), School (#542),

Word (#562)

which indicates that each of the categories List,

School, and Word includes the word dictionary.

(See also Figure 2.3.)

Morris and Hirst [1991] used Roget's thesaurus

as the common knowledge for determining

whether or not two words are associatively related.

Their method captures several types of thesaural

relations between words. Two major types are de-

scribed as follows:

� car 2 Vehicle(#272) 3 truck

(Two words have a category common in their

index entries.)

� drive2Journey(#266)!Vehicle(#272)3car

(A category of one word contains a pointer to

a category of the other word.)

Note that the examples above are computed on

the machine-readable version of Roget's thesaurus

[1911], not on the printed version used in [Morris

and Hirst, 1991].

5

Page 7: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

The thesaurus-based method is quite objective

and computationally feasible, since it regards the

thesaurus as the common knowledge shared by

people. The method can capture almost all types

of semantic relations between words. For ex-

ample, in systematic semantic relations, polite

= courteous (synonymy), plant = flower (hy-

ponymy), hand = finger (metonymy), and good

= bad (antonymy), and in non-systematic seman-

tic relations, post = letter and drink = coffee.

However, thesauri are designed to help writers

�nd the words that best express the writer's ideas,

not to provide the meanings of words. The nature

of thesauri poses the following problems: (1) the-

sauri do not provide information about semantic

di�erence between words juxtaposed in a category,

and (2) thesaural relations indicate only whether

or not two words are semantically related, not the

strength of the semantic relations. These points

are crucial for computing lexical cohesion. The fol-

lowing section will provide preliminary solutions

to these problems.

2.3 Dictionary-based Analysis

| The Strategy of This Thesis

A method for computing lexical cohesion between

words as an indicator of text coherence requires

the following points recognized through the dis-

cussions on related studies.

� Denotation

Denotational meaning of words, not the conno-

tational or emotional meaning, should be mea-

sured.

� Coverage and sensitivity

Semantic di�erence between any two words

should be computable, regardless of their cat-

egories in a thesaurus.

� Scalability

The strength of lexical cohesion, not only its

existence (of all-or-nothing), should be com-

putable.

This section outlines the strategy of this thesis

for coping with these requirements. In short,

the strategy is to use a dictionary as the com-

mon knowledge in which lexical cohesion between

words is de�ned.

2.3.1 Dictionaries and

Common Knowledge

Recent studies of knowledge representation

describe the meaning of texts in terms of arti-

�cial symbols like semantic primitives [Schank,

1980] or microfeatures [Waltz and Pollack, 1985],

as we have seen in the preceding section. How-

ever, Hjelmslev [1943], the leading theoretician of

the Copenhagen School of linguistics, has claimed

theoretical limitation of arti�cial languages:

� Any text in any natural language can be de-

scribed not by subjective arti�cial languages,

but only by the natural language itself. On the

other hand, any text in arti�cial language can

be translated into a natural language.

� Arti�cial languages are meta-languages depen-

dent on the knowledge system of a natural lan-

guage. While, a natural language is dependent

only on the knowledge system of the natural

language itself.

Each of natural languages (e.g. English and

Japanese) works as self-contained and self-

su�cient device for describing meaning of texts

written in any languages. Any natural language

is the system of signs which can articulate the real

world entirely; any other system of signs cannot.

Any knowledge or ideas for certain purposes can

be represented by texts written in a natural lan-

guage, as opposed to arti�cial languages. There-

fore, the common knowledge for text under-

standing can be represented by, and only by, texts

in a natural language. One form of such texts is

a dictionary, which provides the knowledge of

words shared in the minds of individuals. One

may draw a distinction between the knowledge of

a natural language and the knowledge of the real

world. However, they are not ultimately separa-

ble, just as dictionaries and encyclopedias are not

separable.

A dictionary is a reference book that lists words,

usually in alphabetical order, along with infor-

mation about their spelling, pronunciation, gram-

matical status, meaning, and use. A mono-

lingual dictionary can be considered as a para-

phrasing system of a natural language. Each of

its headwords is paraphrased by a phrase which

is composed of its headwords and their deriva-

tions. So, a dictionary is a self-contained and self-

su�cient system in which every element is de�ned

in terms of the relationships with other elements.

In view of structural linguistics and semiology

[Saussure, 1916; Sapir, 1921; Hjelmslev, 1943], any

language is characterized as a system based en-

tirely on the associative relations (or paradig-

matic relations) between signs (i.e. words or lexical

items). In other words, meaning of a sign is de-

�ned only by the associative relations with other

signs in the system, without being dependent on

entities in the real world. A mono-lingual dictio-

6

Page 8: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

nary is an example of such closed systems of

signs. Viewed as a whole, it looks like a cross-

reference network of words.

2.3.2 Semantic Di�erential

on a Dictionary

The strategy of this thesis for computing lexical

cohesion is semantic di�erential on a dictio-

nary (hereafter, SDD), which analyses meaning

of a word into the strength of associative rela-

tions with the headwords of a mono-lingual dic-

tionary. SDD is somewhat similar to Osgood's

semantic di�erential [Osgood, 1952]. However, it

is quite di�erent from his method in the following

points.

� Source of linguistic data

In SDD, the dictionary works both as a se-

mantic space and as the source of linguistic

data for semantic di�erential. Whereas, Os-

good obtained his linguistic data from psycho-

logical experiments on native speakers of the

language (i.e. informants).

� Semantic dimensions

SDD uses the headwords of the dictionary

as semantic dimensions. Osgood used 50 di-

mensions (with pairs of opposed adjectives) in

his semantic di�erential procedure, while SDD

uses all headwords as the semantic dimensions

into which meaning of a word is analysed.

These points guarantee objectivity and complete-

ness of the semantic space of SDD as a �eld for

analysing meanings of words.

SDD satis�es the requirements for computing

lexical cohesion that are described at the begin-

ning of this section. The �rst and the second

requirements can obviously be handled as fol-

lows:

� Denotation

SDD deals with the denotational meaning of

words described in the dictionary de�nitions

of the words, not the connotations attached to

them. Dictionary de�nitions are the common

lexical knowledge shared by the people.

� Coverage and sensitivity

SDD maps each word in the dictionary onto

a point in the semantic space spanned by the

dimensions of all headwords in the dictionary.

Each word is mapped onto a di�erent point.

In other words, di�erent words are mapped

onto di�erent points; two words are mapped

onto the same point, only if their de�nitions

are identical.

The third requirement, the scale or strength of as-

sociative relations, can be treated in the following

manner:

� Scalability

In SDD, each dimension of the semantic space

is a continuous scale (for instance, the interval

[0; 1] of real numbers), not in a discrete scale

(for instance, all-or-nothing).

Each dimension represents the strength of associa-

tive relation between the word w being analysed

and the headword w

0

of the dimension.

SDD thus analyses meaning of a word w into a

N -dimensional vector of continuous scales, where

N is the number of the headwords in the dictio-

nary. The semantic vector represents the strength

of associative relations between w and the head-

words in the dictionary. In other words, the se-

mantic vector represents the meaning of w. The

following chapter describes the method for com-

puting the semantic vector of a given word and

the method for computing the strength of lexical

cohesion between words in a continuous scale.

3 Computing

Lexical Cohesion

A computational method for measuring lexical co-

hesiveness [Kozima and Furugori, 1993a, 1993b,

1993c] is described in this chapter. The lexical co-

hesiveness is computed on a semantic network,

called Paradigme, which is systematically con-

structed from a subset of the English dictio-

nary: Longman Dictionary of Contemporary En-

glish (hereafter, LDOCE). Section 3.1 describes

how the network Paradigme is constructed from

LDOCE.

Spreading activation [Waltz and Pollack,

1985; Rumelhart et al., 1986] on the network can

compute the lexical cohesiveness between any two

words in LDOCE | directly 2,851 core words and

their derivations, and indirectly all the other head-

words of LDOCE and their derivations. Section

3.2 describes how to compute the lexical cohesive-

ness on Paradigme. As an application, Section

3.3 describes a measurement of lexical cohesive-

ness between texts.

The lexical cohesiveness �(w;w

0

) 2 [0; 1] be-

tween words w;w

0

is an objective and computa-

tionally feasible measurement of lexical cohesion.

Section 3.4 discusses the nature and the limits

of Paradigme and of the lexical cohesiveness com-

puted on it. Finally, Section 3.5 gives a brief

7

Page 9: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

red

1

/red/ adj -dd- 1 of the colour

of blood or �re: a red rose/dress j We

painted the door red. | see also like a

red rag to a bull (rag

1

) 2 (of hu-

man hair) of a bright brownish orange

or copper colour 3 (of the human skin)

pink, usu. for a short time: I turned red

with embarrassment/anger. j The child's

eye (= the skin round the eyes) were red

from crying. 4 (of wine) of a dark pink

to dark purple colour | �ness n [U]

(red adj ; headword, word-class

((of the colour) ; unit 1 - head-part

(of blood or fire) ) ; rest-part

((of a bright brownish orange

or copper colour )

(of human hair) )

(pink ; unit 3 - head-part

(usu for a short time) ; rest-part 1

(of the human skin) ) ; rest-part 2

((of a dark pink to dark purple colour)

(of wine) ))

Figure 3.1 A sample entry (of red/adjective) of LDOCE and the corresponding entry of

Gloss�eme (in S-expression).

conclusion of this chapter.

3.1 Paradigme: A Field for

Measuring Lexical Cohesion

The semantic network Paradigme is a �eld for

measuring the lexical cohesiveness. It provides a

semantic space in which meaning of a word is anal-

ysed. Paradigme is systematically constructed

from a small English dictionary, calledGloss�eme,

that is a subset of LDOCE.

3.1.1 Gloss�eme: A Closed Subsystem

of English

LDOCE is an English dictionary with a unique

feature | each of its 56,000 headwords is de-

�ned by using the words in the Longman De�n-

ing Vocabulary (hereafter, LDV) and their deriva-

tions. The use of LDV by the lexicographers is

restricted in that only the most frequent senses of

words, self-explanatory compounds, and phrasal

verbs are permitted [LDOCE, 1987; Carter and

McCarthy, 1988].

LDV consists of 2,851 words (counted as the

headwords in LDOCE, distinguishing homographs

like red = adjective and red =noun) and 48 a�xes

(10 pre�xes and 38 su�xes) that make deriva-

tions from the 2,851 core words. LDV is origi-

nally based on a survey of word frequency and re-

stricted vocabulary for English language teaching

[West, 1953], and has been updated by Longman

with reference to more recent frequency informa-

tion [LDOCE, 1987].

Gloss�eme is a reduced version of LDOCE. It

consists of every entry of LDOCE whose head-

word is included in LDV. So, each word in LDV

is de�ned by Gloss�eme. Obviously, all words in

Gloss�eme (its headwords and the words in their

de�nitions) are included in LDV and its deriva-

tions. It is worth noting that Gloss�eme is a closed

subsystem of English: each of its headwords is

paraphrased into a phrase which is composed of

the headwords and their derivations.

Gloss�eme has 2,851 entries (the same size as

that of LDV) that consist of 101,861 words (35.73

words/entry on the average). As shown in Fig-

ure 3.1, an entry of Gloss�eme has a headword, a

word-class, and one or more units corresponding

to numbered de�nitions in the entry of LDOCE.

Note that Gloss�eme is described in the notation

of S-expression.

Each unit has one head-part and several rest-

parts. For example, the �rst one in the entry red=

adjective of LDOCE:

1 of the colour of blood or �re

is converted into the following unit. (This conver-

sion is partly done by hand.)

((of the colour)

(of blood or fire) )

A head-part (e.g. (of the colour)), which cor-

responds to the �rst phrase in a de�nition, pro-

vides broader meaning of the headword; rest-

parts (e.g. (of blood or fire)), which corre-

spond to succeeding subordinates, restrict mean-

ing of the head-part to speci�c one for the head-

word.

The structure of a unit is based on the structure

of de�nitions in the dictionary: (1) a de�nition

�rst provides broader meaning of the headword,

(2) then imposes several restrictions on the mean-

ing. The following schemes illustrate major types

of the structure of dictionary de�nitions.

noun = noun-phrase

+ adjectival-phrase/clause ...

verb = verb-phrase

+ adverbial-phrase/clause ...

adjective = adjectival-phrase

+ adverbial-phrase/clause ...

8

Page 10: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

(red_1 (adj) 0.000000 ;; headword, word-class, and activity-value

;; referant

(+ ;; subreferant 1

(0.333333 ;; weight of subreferant 1

(* (0.001594 of_1) (0.001733 the_1) (0.001733 the_2) (0.042108 colour_1)

(0.042108 colour_2) (0.000797 of_1) (0.539281 blood_1) (0.000529 or_1)

(0.185058 fire_1) (0.185058 fire_2) ))

;; subreferant 2

(0.277778

(* (0.000278 of_1) (0.000196 a_1) (0.030997 bright_1) (0.065587 brown_1)

(0.466411 orange_1) (0.000184 or_1) (0.385443 copper_1) (0.007330 colour_1)

(0.007330 colour_2) (0.000139 of_1) (0.009868 human_1) (0.009868 human_2)

(0.016372 hair_1) ))

;; subreferant 3

(0.222222

(* (0.410692 pink_1) (0.410692 pink_2) (0.003210 for_1) (0.000386 a_1)

(0.028846 short_1) (0.006263 time_1) (0.000547 of_1) (0.000595 the_1)

(0.000595 the_2) (0.038896 human_1) (0.038896 human_2) (0.060383 skin_1) ))

;; subreferant 4

(0.166667

(* (0.000328 of_1) (0.000232 a_1) (0.028368 dark_1) (0.028368 dark_2)

(0.123290 pink_1) (0.123290 pink_2) (0.000273 to_1) (0.000273 to_2)

(0.000273 to_3) (0.028368 dark_1) (0.028368 dark_2) (0.141273 purple_1)

(0.141273 purple_2) (0.008673 colour_1) (0.008673 colour_2) (0.000164 of_1)

(0.338512 wine_1) )))

;; refere

(* (0.031058 apple_1) (0.029261 blood_1) (0.008678 colour_1) (0.009256 comb_1)

(0.029140 copper_1) (0.009537 diamond_1) (0.003015 fire_1) (0.073762 flame_1)

(0.005464 fox_1) (0.005152 heart_1) (0.098349 lake_2) (0.007025 lip_1)

(0.029140 orange_1) (0.007714 pepper_1) (0.196698 pink_1) (0.012294 pink_2)

(0.098349 pink_2) (0.018733 purple_2) (0.028100 purple_2) (0.098349 red_2)

(0.196698 red_2) (0.004230 signal_1) ))

Figure 3.2 A sample node of Paradigme (in S-expression).

The markers in bold-face indicate that they are

head-parts of the de�nitions; other makers indi-

cate rest-parts. (See [Markowitz, 1986; Alshawi,

1987; Nakamura and Nagao, 1988] for details of

the structure of dictionary de�nitions.)

3.1.2 Paradigme: A Semantic Network

The closed sub-dictionary Gloss�eme is then trans-

lated into a semantic network Paradigme. Each

entry of Gloss�eme is mapped onto a node in Para-

digme. Paradigme has 2,851 nodes (the same size

as that of Gloss�eme) which include 295,914 un-

named links between the nodes (103.79 links/node

on the average). Figure 3.2 shows a sample

node red 1 (corresponds to the entry of Gloss�eme

shown in Figure 3.1). Each node consists of a

headword, a word-class, an activity-value, and two

structures: a r�ef�erant and a r�ef�er�e.

A r�ef�erant provides information about the in-

tension (i.e. de�nition) of the headword. It con-

sists of several subr�ef�erants, each one containing

a set of links that is mapped from the correspond-

ing unit in the entry of Gloss�eme. For example,

the second unit in the entry red = adjective:

((of a bright brownish orange

or copper colour )

(of human hair) )

is mapped onto the following subr�ef�erant.

(0.277778

(* (0.000278 of_1) (0.000196 a_1)

(0.030997 bright_1) (0.065587 brown_1)

(0.466411 orange_1) (0.000184 or_1)

(0.385443 copper_1) (0.007330 colour_1)

(0.007330 colour_2) (0.000139 of_1)

(0.009868 human_1) (0.009868 human_2)

(0.016372 hair_1) ))

Each subr�ef�erant has a weight, e.g. 0.333333 and

0.277778, which is computed from the position

in the sequence of units arranged in order of their

signi�cance.

A morphological analysis on a�xes de�ned

in LDV maps all the derivations of LDV onto their

root forms (i.e. the headwords of the nodes in

Paradigme). For example, the word brownish in

the unit shown above is mapped onto the link to

brown 1, and the word colour onto two links to

colour 1 = adjective and colour 2 = noun. So,

a word can be identi�ed with the corresponding

node or nodes, and vice versa.

Each link in a subr�ef�erant, e.g. (0.065587

brown 1), consists of a weight and a headword

of the node to which the link refers. A weight

h

k

2 [0; 1] of a link to a node w

k

is computed from

the word frequency of the word w

k

in Gloss�eme

and other information (such as whether the word

is in a head-part or rest-part), and normalized as

9

Page 11: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

'

&

$

%

?

w

s(w)

'

&

$

%

?

w

s(w)

'

&

$

%

6

w

0

s(w

0

)

(1) Start activating w. (2) Produce a pattern. (3) Observe activity of w

0

.

Figure 3.3 Computing the lexical cohesiveness �(w;w

0

) on Paradigme.

P

h

k

=1 in each subr�ef�erant.

While a r�ef�er�e provides information about the

extensions (i.e. examples) of the headword, which

is the converse of the r�ef�erant showing the inten-

sion. A r�ef�er�e of a node w has the links to the

nodes referring to w. For example, the r�ef�er�e of

red 1 (shown in Figure 3.2) describes examples

of red things. The link to apple 1 in the r�ef�er�e

of red 1 means that apple 1 has a link to red 1

in its r�ef�erant, in other words, the entry apple

in Gloss�eme contains the word red. Each link

in a r�ef�er�e, e.g. (0.031058 apple 1), also has a

weight, which is computed from the weight of

the corresponding link (e.g. that from apple 1 to

red 1).

Details of the structure of Paradigme, especially

of the translation procedure from Gloss�eme, are

described in Appendix A.

3.2 Computing Lexical Cohesion

between Words

The lexical cohesiveness between words is com-

puted by spreading activation on the semantic

network Paradigme. At each point T in time, each

node w

i

in Paradigme has an activity value v

i

(T ).

The activity value can be seen as passing through

a set of uni-directional links to other nodes in

Paradigme. The weight of each link determines

the amount of e�ect that the referred node has on

the referring node. Note that the weights of the

links are �xed for all time, since this thesis does

not deal with learning or evolution of the network.

Each node w

i

computes its activity value v

i

(T )

at every point T (of discrete steps) in time. The

spreading activation rule is given by

v(T ) = � (R

i

(T�1); R

0

i

(T�1); e

i

(T�1)) ;

where R

i

(T ) is the sum of weighted activity values

(at time T ) of the nodes referred in the r�ef�erant,

and R

0

i

(T ) is the sum of those referred in r�ef�er�e.

And, e

i

(T ) is the activity value given to the node

w

i

from outside (at time T ); to activate a node is

to let e

i

(T )>0. The function � sums up three ac-

tivity values in appropriate proportion and limits

the output value to [0; 1]. Appendix B describes

the spreading activation rule in details.

3.2.1 Computing the Lexical Cohesiveness

The lexical cohesiveness �(w;w

0

) between words

w;w

0

is computed by spreading activation on

the semantic network Paradigme. As illustrated

in Figure 3.3: (1) the computing procedure ac-

tivates the node w, (2) and produces an activated

pattern on Paradigme, (3) then observes the ac-

tivity value of the word w

0

, which indicates the

strength of association from w to w

0

.

Activating a node w for a certain period of time

causes the activity to spread over Paradigme and

produces an activated pattern on it. Figure

3.4 shows an activated pattern produced from the

word red. The graph plots the activity values of

10 dominant nodes at each step of time. I em-

pirically found that the activated pattern approx-

imately gets equilibrium after 10 steps, whereas it

will never reach the actual equilibrium. The ac-

tivated pattern thus produced can be considered

as a 2,851-dimensional vector. Each of its dimen-

sions, i.e. the activity value of one node, represents

the strength of association with the node w.

The procedure of computing the lexical cohe-

siveness �(w;w

0

) 2 [0; 1] between words w;w

0

is

described as follows.

1. Activate the node w with strength s(w) for 10

steps of time, where s(w) is the signi�cance of

w (de�ned below).

2. Then (T = 10), an activated pattern P (w) is

produced on Paradigme, as shown in Figure

3.4.

3. Observe a(P (w); w

0

) | the activity value of

the node w

0

in P (w). Finally, the lexical cohe-

siveness �(w;w

0

) is given by s(w

0

)�a(P (w); w

0

).

Note that each node has no activity at the begin-

10

Page 12: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

A

A

A

A

A

A

A

A

2 4 6 8 10

0.2

0.4

0.6

0.8

1.0

red 2

red 1

orange 1

pink 1

pink 2

blood 1

copper 1

purple 1

purple 2

rose 2

activity

T (steps)

Figure 3.4 An activated pattern produced from

the word red. (Changing of activity values of 10 nodes

holding highest activity at T =10).

ning of this procedure, and a word and the cor-

responding node or nodes can be identi�ed with

the help of the morphological analysis (cf. Section

3.1).

The signi�cance s(w)2 [0; 1] is de�ned as the

normalized information of the word w in West's

corpus [West, 1953]. For example, the word red

appears 2,308 times in the 5,487,056-word corpus,

and the word and 106,064 times. So, s(red) and

s(and) are computed as follows.

s(red) =

� log(2308=5487056)

� log(1=5487056)

� 0:500955;

s(and) =

� log(106064=5487056)

� log(1=5487056)

� 0:254294:

Note that the estimation of the words excluded

from West's word list [West, 1953] virtually en-

larges the original 5,000,000-word corpus. The

frequency of the extra words (9.65% of LDV) are

estimated at the average frequency of their word

class.

For example, let us consider the lexical cohe-

siveness between red and orange. First, we pro-

duce an activated pattern P (red) on Paradigme

(as shown in Figure 3.4). In this case, both of

the nodes red 1 = adjective and red 2 = noun are

activated with strength s(red)=0:500955. Then,

we compute s(orange) = 0:676253, and observe

a(P (red); orange)= 0:390774. Finally, we obtain

the lexical cohesiveness �(red; orange) as follows.

�(red; orange)

= s(orange) � a(P (red); orange)

= 0:676253 � 0:390774

= 0:264262:

Note that the fractions are rounded o� to six dec-

imal places.

3.2.2 Examples of the Computation

The procedure described above can compute the

lexical cohesiveness �(w;w

0

) 2 [0; 1] between any

two words w;w

0

in LDV and its derivations. Com-

puter programs of the procedures, namely spread-

ing activation (written in C programming lan-

guage and translated by the compiler on SunOS

4.1.3), morphological analysis and others (written

in Common Lisp and executed on KCL), can com-

pute �(w;w

0

) within 2.5 seconds on the worksta-

tion (SPARCstation 2 = SunOS 4.1.3). Note that

most of the time is used for spreading activation.

The lexical cohesiveness �(w;w

0

) increases with

the strength of the systematic semantic relation

between words w;w

0

, as shown in the following

examples.

w w

0

�(w;w

0

)

wine alcohol 0.118078

wine line 0.002040

big large 0.120587

clean large 0.004943

buy sell 0.135686

buy walk 0.007993

Also the lexical cohesiveness � increases with the

strength of the non-systematic semantic relation

between words, as shown in the following exam-

ples.

w w

0

�(w;w

0

)

waiter restaurant 0.175699

computer restaurant 0.003268

red blood 0.111443

green blood 0.002268

dig spade 0.116200

fly spade 0.003431

Note that �(w;w

0

) has direction (from w to w

0

),

so that �(w;w

0

) may not be equal to �(w

0

; w). For

example:

w w

0

�(w;w

0

)

cow cattle 0.303977

cattle cow 0.379470

The lexical cohesiveness �(w;w

0

) increases with

the signi�cance s(w) and s(w

0

) that represent

meaningfulness of w and w

0

. The reason for this

is that � suggests the strength of associative rela-

tion between words, so meaningful words should

have higher lexical cohesiveness, while meaning-

less words (especially, function words) should have

lower one. For example:

11

Page 13: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

'

&

$

%

C

C

C

C

CW

C

C

C

CW

C

C

C

C

CW

��

��

��

w

1

; � � � ; w

n

z }| {

W

w

0

1

; � � � ; w

0

m

z }| {

W

0

LDOCE ! LDOCE

Figure 3.5 Computing lexical cohesiveness of extra

words. (An extra word is treated as a list of the words

in its de�nition.)

A

A

A

A

A

A

A

A

2 4 6 8 10

0.1

0.2

0.3

0.4

alcohol 1

drink 1

red 2

drink 2

red 1

bottle 1

wine 1

poison 1

swallow 1

spirit 1

activity

T (steps)

Figure 3.6 A pattern produced from a word list:

fred, alcoholic, drinkg. (Changing of activity val-

ues of 10 nodes holding highest activity at T =10.)

w w

0

�(w;w

0

)

north east 0.100482

to theatre 0.007259

films of 0.005914

to the 0.002240

Also the re ective lexical cohesiveness �(w;w),

i.e. the lexical cohesiveness with itself, depends

on the signi�cance s(w), so that �(w;w)�1. For

example:

w w

0

�(w;w

0

)

waiter waiter 0.596803

of of 0.045256

3.2.3 Lexical Cohesiveness of ExtraWords

The lexical cohesiveness of words in LDV and its

derivations is directly computed on Paradigme, as

we have seen above; the lexical cohesiveness of ex-

tra words (i.e. those excluded from LDV) is indi-

rectly computed by treating an extra word as a list

of the words in its de�nition of LDOCE, as illus-

trated in Figure 3.5. Note that each word in the

de�nition is included in LDV or its derivations.

The lexical cohesiveness between two word lists,

W =fw

1

;� � �; w

n

g andW

0

=fw

0

1

;� � �; w

0

m

g is de�ned

as follows:

�(W;W

0

) =

X

w

0

2W

0

s(w

0

)�a(P (W ); w

0

)

!

;

where P (W ) is an activated pattern produced by

activating each word w

i

in W with strength

s(w

i

)

2

=

X

k

s(w

k

):

And, is a function which limits the output value

to [0; 1].

Figure 3.6 illustrates the activated pattern

P (W ) produced from the word list W = fred,

alcoholic, drinkg. It is worth noting that the

nodes bottle 1 and wine 1 are highly activated

in the pattern P (W ), whereas those nodes never

get such high activity in any patterns produced

from a single word in W . So, we may say that the

overlapped pattern implies a bottle of wine.

For example, the lexical cohesiveness between

linguistics and stylistics | both are extra

words | is computed as follows.

�(linguistics, stylistics)

= �(f the, study, of, language, in, general,

and, of, particular, languages, and,

their, structure, and, grammar, and,

history g,

f the, study, of, style, in, written,

or, spoken, language g )

= 0.140089 .

Obviously, both �(w;W ) and �(W;w), where w

is included in LDV or its derivations and W is

not, are also computable in the same scheme (by

replacing w with the word list fwg). Therefore, we

can compute the lexical cohesiveness between any

two headwords in LDOCE and their derivations.

3.3 Computing Lexical Cohesion

between Texts

This section describes an application of the lexi-

cal cohesiveness between words, that is comput-

ing lexical cohesiveness between texts. Let us as-

sume that a text is a simple word list without any

syntactic structure or punctuations. Then, the

lexical cohesiveness �(X;X

0

) between two texts

X = fw

1

;� � �; w

n

g and X

0

= fw

0

1

;� � �; w

0

m

g can be

12

Page 14: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

'

&

$

%

C

C

C

C

CW

C

C

C

CW

C

C

C

C

CW

��

��

��

w

1

; � � � ; w

n

z }| {

X

w

0

1

; � � � ; w

0

m

z }| {

X

0

-

-

-

Figure 3.7 Computing lexical cohesiveness between

texts. (An overlapped pattern makes implicit infer-

ences.)

computed as follows. (See also Figure 3.7.)

�(X;X

0

) =

X

w

0

2X

0

s(w

0

)�a(P (X); w

0

)

!

:

The lexical cohesiveness between texts is com-

puted in the very same way as the lexical cohe-

siveness of extra words described above.

3.3.1 Text Cohesiveness and

Implicit Inferences

The lexical cohesiveness between portions of a text

or discourse suggests the coherence of them |

how naturally and reasonably they are connected.

The following examples suggest that �(X;X

0

) in-

dicates the strength of the coherence relation be-

tween text portions X;X

0

.

X

X

0

�(X;X

0

)

"I have a hammer."

"Take some nails."

0.100611

"I have a hammer."

"Take some apples."

0.005295

"I have a pen."

"Where is ink?"

0.113140

"I have a pen."

"Where do you live?"

0.007676

Note that the lexical cohesiveness between texts

has direction, so that �(X;X

0

) may not be equal

to �(X

0

;X). And re ective lexical cohesiveness

�(X;X) must be less than 1. Compare the fol-

lowing examples with the ones above.

X

X

0

�(X;X

0

)

"Where is ink?"

"I have a pen."

0.103681

"Take some apples."

"Take some apples."

0.434443

The directly activated nodes interact each other

and produce an overlapped pattern which in-

cludes other nodes indirectly associated or im-

plicitly inferred. For example, the phrase "red

alcoholic drink" (its activated pattern is shown

in Figure 3.6) has the strong coherence with

"a bottle of wine" and weak coherence with

"fresh orange juice" as follows.

X

X

0

�(X;X

0

)

"Red alcoholic drink."

"A bottle of wine."

0.280683

"Red alcoholic drink."

"Fresh orange juice."

0.096469

"Red alcoholic drink."

"An English dictionary."

0.008166

3.3.2 Text Cohesiveness and

Word Signi�cance

The lexical cohesiveness between texts re ects the

signi�cance of the words in the texts. Each word

in a text has its own weight for activation and

observation. This results in that meaningless iter-

ation of words (especially, of function words) has

less in uence on the lexical cohesiveness between

texts.

Let us consider the following examples of the

lexical cohesiveness between sentences:

X

X

0

�(X;X

0

)

"It is a dog."

"That must be your dog."

0.252536

"It is a dog."

"It is a log."

0.053261

where, the signi�cance s of the words in the ex-

amples above are as follows.

w s(w) w s(w)

it 0.280136 that 0.253374

is 0.297779 must 0.421726

a 0.274085 be 0.297779

dog 0.589734 your 0.382722

dog 0.589734

w s(w) w s(w)

it 0.280136 it 0.280136

is 0.297779 is 0.297779

a 0.274085 a 0.274085

dog 0.589734 log 0.621410

The sentences in the �rst pair have only one word

(namely, dog) in common; those in the latter

13

Page 15: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

have three words (namely, it, is, and a) in com-

mon. However, the signi�cant words or focuses

of the sentences (shown in bold-face) play dom-

inant role on computing the lexical cohesiveness

between them, so that the lexical cohesiveness be-

tween sentences re ect the semantic coherence be-

tween them.

3.4 Discussion

The lexical cohesiveness computed on Paradigme

works as an indicator of lexical cohesion between

words and also between texts, as we have seen

in Section 3.2 and Section 3.3. This section

discusses the nature of Paradigme, limits of the

lexical cohesiveness computed on it, and possible

application of the lexical cohesiveness.

3.4.1 Paradigme and the Semantic Space

Paradigme works as a �eld for semantic di�eren-

tial of a word or a set of words. A set of the

activity values of the nodes in Paradigme spans

a 2,851-dimensional semantic space, or a 2,851-

dimensional hypercube, where an activated pat-

tern is represented as a point. Each edge of the

hypercube corresponds to a word in the de�ning

vocabulary LDV.

LDV is originally based on the survey of word

frequency [West, 1953]. The frequency is a count

of the occurrence of words in the 5,000,000-word

corpus of written English, and has been updated

by Longman with reference to more recent fre-

quency information [LDOCE, 1987]. This crite-

rion implies objectivity of LDV. Also the follow-

ing criteria provide the basis for the selection of

LDV.

� Necessity

An indispensable word, which alone covers a

certain range of meaning, should be adopted

regardless of its frequency.

� E�ciency

The semantic range of a word should be as

wide as possible, so as to reduce the cost of

learning. This criterion is the converse of the

�rst one.

These criteria imply completeness of LDV | a

potential for covering all the concepts commonly

found in the world.

Objectivity and completeness of LDV as the

de�ning vocabulary suggest su�ciency of the se-

mantic space. Osgood [1952] used 50 dimensions

in his semantic di�erential procedure. SDD (se-

mantic di�erential on a dictionary) uses 2,851 di-

mensions with objectivity and completeness. Ob-

viously, SDD can be applied to construct a se-

mantic network from an ordinary dictionary whose

de�ning vocabulary is not restricted. However,

such a network is too large to compute spread-

ing activation on ordinary sequential computers.

Paradigme is the small but objective and complete

network for analysing meaning of words.

The lexical cohesiveness computed by SDD is

not a distance or closeness between two acti-

vated patterns in the semantic space. Osgood

[1952] measured similarity between words in terms

of the distance between two vectors in the 50-

dimensional semantic space. In SDD, the lexi-

cal cohesiveness between words w;w

0

is calculated

from a(P (w); w

0

) | the activity value of the word

w

0

in the activated pattern P (w) produced from

the word w. The reason for this is that the ac-

tivated pattern P (w) directly represents associa-

tive relations from w and to other words in LDV,

i.e. the de�nition of lexical cohesion.

3.4.2 Limits of Paradigme

The proposed lexical cohesiveness is based only on

the denotational and intensional de�nitions in

the English dictionary LDOCE. Lack of the con-

notational and extensional knowledge causes

some unexpected e�ects on the lexical cohesive-

ness. For example, let us consider the following

example.

�(tree; leaf) = 0:008693:

We can recognize the apparent relationship be-

tween tree and leaf. However, the lexical co-

hesiveness between them is much lower than the

estimation by our intuition.

The reason for this disagreement is due to the

nature of the dictionary de�nitions: they only in-

dicate su�cient conditions of the headwords. For

example, the de�nition of tree in LDOCE tells us

nothing about leaves.

tree n 1 a tall plant with a wooden trunk

and branches, that lives for many years 2 a

bush or other plant with a treelike form 3 a

drawing with a branching form, esp. as used

for showing family relationships

However, the de�nition is followed by pictures of

leafy trees, which provide readers with connota-

tional and extensional stereotypes of tree.

In SDD, each de�nition in LDOCE is treated as

a list of words, though it is a phrase with syn-

tactic structure. Let us consider the following

de�nition of the verb lift.

14

Page 16: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

'

&

$

%

Paradigme

#

#

#c

c

c

e

e

e

e%

%

%

%

c

c

c#

#

#

J

J

J

J

J

text: t

e

1

e

2

e

3

| {z }

episodes

Figure 3.8 Text retrieval by the cohesiveness be-

tween texts. (Recalling the most similar episode in the

memory to the given text.)

lift v 1 to bring from a lower to a higher level;

raise 2 (of movable parts) to be able to be

lifted 3 � � �

Anyone can imagine that something is moving up-

wards. However, such a movement cannot be ex-

pressed by the corresponding word list, nor by the

activated pattern produced from the word list.

The measurement of the lexical cohesiveness be-

tween words is intended to provide bottom-up in-

formation for analysing the semantic and syntactic

structure of a phrase, sentence, or text. However

the measurement requires such structure of higher

levels. As far as the lexical cohesiveness between

words is concerned, I assume that an activated

pattern on Paradigme will approximate the mean-

ing of a word w, like a still picture can express a

story.

3.4.3 Application to Text Retrieval

The lexical cohesiveness between texts computed

on Paradigme can be applied to text retieval. It

is to recall the most similar episode e to the given

text t:

e = argmax

e

i

2 E

�(t; e

i

);

where E = fe

1

;� � �; e

n

g is a set of episodes

(i.e. texts) stored in the memory. Fig-

ure 3.8 illustrates this mechanism: once P (t)

is produced on Paradigme, the cohesiveness

�(t; e

1

);� � �; �(t; e

n

) can immediately be computed

and compared. This text retrieval scheme is a

mapping:

t 7! P (t) 7! e;

in other words, a mapping from the given text t

to another text e in the memory.

Also a text set T =ft

1

;� � �; t

m

g can associate an

episode e in the memory which is most similar to

'

&

$

%

Paradigme

�@

@

@

@

,

,

,l

l

l

�\

\

\

\

\

\

\

\

\

\�

l

l

l,

,

,

@

@

@

@�

t

2

t

1

t

3

z }| {

text and context

e

1

e

2

e

3

| {z }

episodes

Figure 3.9 Context-sensitive text retrieval. (Re-

calling the most similar episode to the given text and

context.)

T . This association scheme can be written in the

following form:

T 7! P (T ) 7! e;

where P (T ) is an overlapped pattern of t

i

2

T . Pattern overlapping provides interaction be-

tween texts and produces an activated pattern

which includes novel nodes indirectly associated

or implicitly inferred, as we have seen in Section

3.3. If it is necessary, each text t

i

2 T can be

weighted according to its signi�cance in T .

The mapping from a text set to an episode

(T 7!P (T ) 7!e) works as context-sensitive text

retrieval. As illustrated in Figure 3.9, P (t

1

) and

P (ft

2

;� � �; t

m

g) are overlapped on Paradigme. The

main key t

1

is strongly activated so as to produce

the �gure, and the others ft

2

;� � �; t

m

g is weakly

activated so as to produce the ground or context

for text retrieval.

This text retrieval scheme provides a new

method for semantic retrieval which recalls the

most semantically similar episodes in the mem-

ory, regardless of typological identity of the key-

words. Moreover, it can be applied to automatic

text classi�cation which determines categories

or genres of given texts. Each category is de-

�ned not by its intension (or attributes), but by

its extension (or members). This suggests that the

scheme can provide exibility for EBR (example-

based reasoning) and EBL (example-based learn-

ing) systems.

3.5 Summary

This chapter described the computation of the

lexical cohesiveness between words, i.e. a mea-

surement of the strength of lexical cohesion. The

15

Page 17: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

lexical cohesiveness between words is computed

by spreading activation on the semantic net-

work Paradigme which is systematically con-

structed from a subset of the English dictio-

nary LDOCE (Longman Dictionary of Contem-

porary English). Paradigme can directly compute

the lexical cohesiveness between any two words in

LDV (Longman De�ning Vocabulary, consists of

2,851 words) and its derivations, and indirectly

the lexical cohesiveness of all other headwords of

LDOCE and their derivations. The lexical cohe-

siveness provides a new method for analysing the

coherent text structure. It can be applied to cap-

ture coherent relations between sentences or text

portions.

I regard Paradigme as a �eld for the inter-

action between texts and episodes in memory,

i.e. the interaction between what one is reading

or listening and what one knows [Minsky, 1980,

1986; Schank, 1990]. The meaning of words, sen-

tences, or even texts can be projected in a uniform

way on Paradigme, as we have seen in Section

3.2 and Section 3.3. Similarly, we can overlap

the �gure and ground, and recall the most rele-

vant episode for interpretation of the �gure; the

recalled episode will change the ground for the

next step. A preliminary model for this episode

association cycle is described in [Kozima and Fu-

rugori, 1991a, 1991b, 1991c].

In future research, I intend to deal with syn-

tagmatic relations between words. Meaning

of a text lies in the texture of paradigmatic and

syntagmatic relations of lexical items [Hjelmslev,

1943]. Paradigme provides the former dimension

| the associative system of words that works

as a screen onto which the meaning of a word is

projected like a still picture. The latter dimen-

sion | the syntactic process | will be treated

as pattern changing in time, like a �lm pro-

jected dynamically onto Paradigme. This enables

us to compute the coherent relation between texts

as syntactic and semantic processes, not as the

static cohesiveness between lists of words.

The next chapter describes an application of the

lexical cohesiveness to text segmentation [Grosz

and Sidner, 1986; Youmans, 1991], as the evalua-

tion of the lexical cohesiveness proposed here.

4 Segmenting Narratives

into Scenes

This chapter describes a computationally feasible

method for text segmentation. It is an applica-

LCP

� � � � � � � � � � � � � � � � �� � �

scene 1 scene 2

Figure 4.1 Correlation between LCP (mutual lexi-

cal cohesiveness in the moving window) and a bound-

ary of coherent scenes.

tion of the lexical cohesiveness proposed in Chap-

ter 3, and also the evaluation of the lexical cohe-

siveness.

Most studies on text structures assume that a

text can be partitioned into units that form a co-

herent structure [Grosz and Sidner, 1986; Mann

and Thompson, 1987], and recognizing the text

structure is an essential task in text understand-

ing, as we have seen in Chapter 1 and Chapter

2. However, there is no clear discussion on how to

segment a text into such units computationally.

This thesis focuses its e�ort on scenes, i.e. con-

tiguous and non-overlapping units of a narrative

text. A scene is a sequence of sentences which dis-

plays local coherence or semantic continuity

on objects (characters and properties) and situa-

tions (time, place, and backgrounds).

Lexical Cohesion Pro�le (LCP) is a quan-

titative indicator proposed here for marking scene

boundaries of narratives. LCP is a record of the

mutual lexical cohesiveness of words in a win-

dow (of 51 words long, for instance) that moves

forwards word by word on a text. Since a coher-

ent text tends to be lexically cohesive [Halliday

and Hasan, 1976; Morris and Hirst, 1991], LCP

indicates local coherence and therefore continu-

ity of scenes in the text. Figure 4.1 (same as

Figure 1.2) illustrates the basic idea of LCP.

Section 4.1 reviews related work on text seg-

mentation. Section 4.2 describes how to com-

pute LCP, the mutual lexical cohesiveness of

words in the moving window. Section 4.3 com-

pares LCP with scene boundaries marked by a hu-

man experiment. Section 4.4 discusses the na-

ture and limits of LCP, and Section 4.5 gives a

summary of this chapter.

16

Page 18: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

4.1 Related Work on

Text Segmentation

A number of methods for segmenting a text into

coherent units have been proposed in the studies

of text structure. One of the valuable indicators is

a cue phrase [Grosz and Sidner, 1986] (or clue

words [Reichman-Adar, 1984]). For example, "by

the way" and "anyway" indicate the beginning of

new units.

In narratives, several types of cue phrases that

specify time or place at the beginning of sen-

tences are recognized. For example, a new scene

begins with the cue phrase "In the summer of

last year" in the following text portion.

� � � she could see the windowless brick

wall of the box factory in the next

street. But she thought of grassy

walks and trees and bushes and roses.

In the summer of last year Sarah had

gone into the country and fallen in

love with a farmer. � � �

Note that paragraph breaks explicit in the original

text (O.Henry's Springtime �a la Carte [Thornley,

1960]) are discarded in the examples here.

Scenes in narratives do not always begin with

cue phrases, however. Let us consider the follow-

ing text portion.

� � � Sarah knew that it was time for

her to read. She got out her book,

settled her feet on her box, and began.

The front-door bell rang. The landlady

answered it. Sarah left the book and

listened. � � �

Anyone can capture the discontinuity of scenes

at the sentence "The front-door bell rang".

However, this is not a cue phrase, and an assertion

is that we need a stronger device to capture the

scene alternation like this.

LCP is a quantitative device to mark the conti-

nuity and discontinuity of objects and situations

described in a text. But before going on to de�ne

LCP, let us brie y review two related studies that

have intended to capture scene coherence.

4.1.1 Word Reiteration and

Scene Coherence

Youmans [1991] has proposedVocabulary Man-

agement Pro�le (VMP) as a quantitative indi-

cator of scene alternation of written texts. VMP

is a record of the proportion of new words intro-

duced in a window (of 35 words long, for instance)

moving word by word on a text. For example,

the underlined words in the following text are new

0 100 200 300 400 500 600 700

0.2

0.3

0.4

0.5

VMP

i (words)

Figure 4.2 An example of VMP [Youmans, 1991].

(Text: O.Henry's Springtime �a la Carte [Thornley,

1960].)

words.

A new word is a word in the text

which never appears in the

preceding text. VMP counts the new

words in the window moving on the

text.

Figure 4.2 shows the VMP of O.Henry's short

story, Springtime �a la Carte [Thornley, 1960],

plotted against window position (in word num-

bers). Note that VMP takes the given text for a

list of words without any punctuations or para-

graph breaks.

The principle of VMP is based on information

ow in a text which suggests the introduction and

succession of scenes.

� Introduction

At the beginning of a scene, new vocabulary

(for objects and situations) will be introduced

into the scene.

� Succession

Once a scene is created by vocabulary intro-

duction, rest of the scene will reuse the intro-

duced vocabulary.

VMP presented in a graph has hills and valleys,

as shown in Figure 4.2. They suggest the scene

alternation: (1) an ascending slope suggests the

introduction of a new scene, (2) a descending slope

suggests the succession of the scene thus intro-

duced.

VMP is a neat but rather simple indicator for

segmenting narrative texts. However, the method

based onword reiteration causes some problems

to deal with various aspects of scene coherence.

My experiments on VMP have revealed that it

does not work well on high-density texts rich

in vocabulary. The reason for this seems obvious:

the words assumed to be reiterated in a scene were

often restated (or paraphrased) by using di�erent

words or phrases.

17

Page 19: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

4.1.2 Lexical Cohesion and

Scene Coherence

A better way to capture scene coherence is using

lexical cohesion, especially semantic relations, be-

tween words in a text. Morris and Hirst [1991],

as we have seen in Section 2.2, used Roget's

thesaurus as the knowledge base for determining

whether or not two words are semantically related.

They also proposed lexical chains, i.e. chains of

the thesaural relations between words in a text,

as an indicator of the text structure proposed by

Grosz and Sidner [1986].

A text in general has several lexical chains.

Each chain indicates a range of semantic con-

tinuity on certain objects and situations. So,

the density of the lexical chains will suggest lo-

cal coherence of the text, and minimum points

of the density can be considered as scene bound-

aries of the text. However, as Hearst and Plaunt

[1993] have claimed, the lexical chains of a lengthy

text tend to overlap so often that it is not pos-

sible to place scene boundaries of the text. In

other words, many chains would end at a particu-

lar scene boundary, while at the same time many

other chains would cross it.

Hearst and Plaunt [1993] then incorporated

the thesaural information into their segmentation

scheme based on tf.idf (i.e. an information re-

trieval measurement) between contiguous blocks

of sentences. The tf.idf value of a word is the

frequency of the word within a text divided by

the frequency of the word throughout a large cor-

pus. Words that are frequent in an individual text

but relatively infrequent throughout the corpus

are considered as good indicators of the contents

of the text.

Their segmentation scheme is a two step pro-

cess: (1) all pairs of adjacent blocks of a text

(where each block is usually 3{5 sentences long)

are compared and assigned a similarity value

computed by tf.idf of the words in common and

of the words have thesaural relations, then (2) the

resulting sequence of similarity values, after be-

ing graphed and smoothed by some special algo-

rithms, is examined for hills and valleys. The hills

indicate that the adjacent blocks are coherent; the

valleys indicate scene boundaries.

This method is the pioneering attempt for text

segmentation using lexical cohesion (in the the-

saurus) between words. However, there is still

room for improvement: (1) the size of the block is

de�ned arbitrarily (as 3{5 sentences long), and (2)

the smoothing algorithms are so complicated that

it seems to have no psychological validity. These

points are to be improved in LCP described in the

next section.

4.2 LCP: Lexical Cohesion Pro�le

I have devised a method to capture semantic con-

tinuity in a text and developed an objective and

quantitative indicator of scene boundaries. This

method segments a narrative text only by using

the following lexical information.

� The mutual lexical cohesiveness of words

that interact each other in a portion of the text

(i.e. the words in the window).

� The strength of each cohesive relation be-

tween words which has its own strength of con-

tribution to coherence of a scene.

This section describes (1) the computation of

the mutual lexical cohesiveness of a text portion,

which estimates the strength of text coherence

from the lexical cohesiveness between words, (2)

the computation of LCP as an indicator of local

coherence of the text, and (3) the resulting scene

boundaries in a graph of LCP.

4.2.1 Mutual Lexical Cohesiveness

Coherence of a text portion is estimated by mu-

tual lexical cohesiveness of words in the text

portion. The mutual lexical cohesiveness c(S) of

the text portion S=fw

1

;� � �; w

n

g is de�ned as the

density of the lexical cohesiveness of the words in

S:

c(S) =

X

w

i

2S

s(w

i

) � a(P (S); w

i

)

!

:

where, P (S) is an activated pattern pro-

duced by activating each word w

i

2 S with

strength s(w

i

)

2

=

P

k

s(w

k

) at the same time, and

a(P (S); w

i

) is an activity value of the node w

i

in

the activated pattern P (S). The function lim-

its the output value to [0; 1]. Note that c(S) =

�(S; S), cf. Section 3.3.

The activated pattern P (S) is the result of the

interaction of w

i

2 S. It represents the mean-

ing of S as a whole. So, the mutual lexical cohe-

siveness c(S) represents how cohesively each word

w

i

2 S is related to the whole meaning P (S). In

other words, c(S) represents the semantic ho-

mogeneity of S, which is closely related to dis-

tortion in clustering techniques, since P (S) can

be considered as a centroid of the word cluster S.

The mutual lexical cohesiveness c(S) suggests

how coherent S is. The following examples show

18

Page 20: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

0 100 200 300 400 500 600 700

0.3

0.4

0.5

0.6

LCP

i (words)

Figure 4.3 An example of LCP. (Computed

with the rectangular window of 51 words long; Text:

O.Henry's Springtime �a la Carte [Thornley, 1960].)

the mutual lexical cohesiveness of a coherent text

portion in a short story, and that of an incoherent

text portion consists of three sentences randomly

selected from an English dictionary.

c ("Molly saw a cat.

It was her family pet.

She wished to keep a lion." )

= 0.403239 (coherent),

c ("There is no one but me.

Put on your clothes.

I can not walk more." )

= 0.235462 (incoherent).

Thus the mutual lexical cohesiveness c(S) works

as a quantitative indicator of coherence of the text

portion S.

4.2.2 Computing LCP and

Estimating Scene Boundaries

LCP is a record of the mutual lexical cohesiveness

c(S

i

) of the local text S

i

at every position i in

a text. Let us assume the text T is a word list

fw

1

;� � �; w

N

g without any punctuations or para-

graph breaks. Then the local text S

i

at position

i=1;� � �; N in the text T is de�ned as follows:

S

i

= fw

l

; w

l+1

;� � �; w

i�1

; w

i

; w

i+1

;� � �; w

r�1

; w

r

g;

where, w

i

is the i-th word of T . The indices l and

r are de�ned as follows.

l=

i�� (if i>�)

1 (otherwise),

r=

i+� (if i�N��)

N (otherwise).

The local text S

i

is a text portion which can be

seen through a window whose center is w

i

. The

constant � determines the width of the window

(as 2�+1).

Figure 4.3 shows a graph of the LCP computed

on O.Henry's short story, Springtime �a la Carte

Rectangular

J

J

J

J

Triangular Hanning

Figure 4.4 Various types of windows.

0 100 200 300 400 500 600 700

0.3

0.4

0.5

0.6

0.7

LCP

i (words)

Figure 4.5 LCP computed with the Hanning win-

dow (of 51 words long).

[Thornley, 1960]. The mutual lexical cohesiveness

c(S

i

) is plotted against the position i of the win-

dow. The graph has hills and valleys that sug-

gest scene alternation in the text. Large valleys

can be considered as scene boundaries. However,

the graph has unnecessary noise that makes dif-

�cult to determine which minimum points should

be considered as scene boundaries.

In order to eliminate the noise from the graph

of LCP, a window function is introduced into

pattern production of the text portion S

i

=

fw

l

;� � �; w

i

;� � �; w

r

g. The window function W (i; j)

de�nes the weight of w

j

2S

i

. The activated pat-

tern P (S

i

) is produced by activating each w

j

2S

i

with strength s

0

(w

j

)

2

=

P

k

s

0

(w

k

), where s

0

(w

j

) is

de�ned as s(w

j

)�W (i; j)=W (i; i). Comparing var-

ious types of windows, such as shown in Figure

4.4, I empirically found that the Hanning win-

dow,

W (i; j) =

1

2

1 + cos

ji�jj

��

;

gives the most remarkable e�ect on eliminating

the noise. Figure 4.5 shows that the Hanning

window illuminates themacroscopic features of

LCP better than the rectangular window used in

the LCP shown in Figure 4.3.

Window width is also an important factor

in clarifying the macroscopic features of LCP. If

the window is too wide, LCP cannot detect short

scene alternation. If the window is too narrow,

on the other hand, it makes much noise on LCP.

Figure 4.6 compares the LCPs taken for the Han-

19

Page 21: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

0 100 200 300 400 500

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LCP

i (words)

(1) Hanning window

of 25 words long

0 100 200 300 400 500

0.4

0.5

0.6

0.7

LCP

i (words)

(2) Hanning window

of 75 words long

Figure 4.6 LCP and window width. (Compari-

son between the Hanning windows of 25 and 75 words

long.)

ning windows of 25 words long and 75 words long.

I empirically found with experimenting 18 window

widths (from 11 to 121 words) that the Hanning

window of 51 words long (�= 25) gives us the

best correlation with the actual scene boundaries.

4.3 Veri�cation of LCP

with Human Judgements

LCP seems to provide a reasonable measurement

for segmenting scenes in a text. To see the point

in case, LCP have been compared with a hu-

man experiment of marking scene boundaries

of O.Henry's short story, Springtime �a la Carte

[Thornley, 1960]. The whole text is described in

Appendix C with its LCP graphs. Another ex-

periment on a biography,Mahatma Gandhi [Leav-

itt, 1958], is also described in Appendix D with

its LCP graphs.

0 100 200 300 400 500 600 700

Segs

i (words)

0

2

4

6

8

10

12

14

16

� ��

��

� � �

Figure 4.7 Histogram of human judgements.

(The solid bars represent histogram of human judge-

ments; the dotted lines represent the original para-

graph breaks.)

4.3.1 Human Judgements

In the human experiment, the text given to sub-

jects contains no original paragraph breaks. Sen-

tences in the text are aligned line by line, as shown

in the following example | the head part of the

text given to the subjects.

It was a day in March.

Never, never begin a story this way

when you write one.

No opening could possibly be worse.

There is no imagination in it.

It is flat and dry.

But it is allowable here.

The instruction to the subjects is \Putting your-

self in a position of a �lm director, place scene

boundaries wherever you think there may be a

cut".

Figure 4.7 shows the histogram of scene

boundaries marked by 16 subjects. The solid bars

indicate the number of the subjects who placed a

scene boundary at the text position i, and the dot-

ted lines indicate the original paragraph breaks.

The number of total scene boundaries is 214 (13.38

boundaries/person on the average); the number of

their types is 50. The histogram suggests the fol-

lowing points.

� Agreement

The subjects segment the text in a similar way.

158 scene boundaries (of 16 types), 73.83% of

the total, are the dominant scene boundaries

on which more than 1=3 of the subjects agreed.

� Correlation with paragraphs

The reported scene boundaries closely corre-

late with the original paragraph breaks. 179

scene boundaries (of 29 types), 83.64% of the

total, correspond with the original paragraph

breaks.

20

Page 22: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

0 100 200 300 400 500 600 700

0.3

0.4

0.5

0.6

0.7

LCP

i (words)

0

2

4

6

8

10

12

14

16

segs

� ��

��

� � �

Figure 4.8 LCP and human judgements. (LCP is

computed with Hanning window of 51 words long.)

4.3.2 Correlation between LCP and

Human Judgements

The LCP computed with Hanning window of 51

words long (shown in Figure 4.5) and the his-

togram of Human judgements (shown in Figure

4.7) are overlapped in Figure 4.8. It is clear

that the minimum points of the LCP corre-

spond mostly to the dominant scene boundaries

reported by the subjects.

In order to examine the correlation between

the dominant scene boundaries and the minimum

points of LCP, let us de�ne a break point as

a sentence break which is nearest to one of the

dominant minimum points of LCP. The dominant

minimum point of LCP is a text position i which

satis�es

8j 2 [i��; i+�]; L

j

> L

i

;

where � is a constant which determines the degree

of localization of minimum points, and L

i

is the

value of LCP, i.e. c(S

i

), at text position i. In case

of �=20, a set of dominant minimum points,

f 51, 112, 194, 235, 273, 374, 442, 501,

533, 664, 759, 833, 909, 975, 1016, 1080,

1152, 1208, 1263, 1302, 1360, 1432 g

is obtained. Then, coercing each of the dominant

minimum points into the nearest sentence break,

the set B

20

of break points is obtained as follows.

f 39, 110, 192, 242, 281, 381, 449, 511, 537,

652, 749, 834, 900, 974, 1012, 1076, 1155,

1210, 1275, 1301, 1350, 1433 g

I have computed the following sets of break

points: B

5

; B

10

; B

15

;� � �; B

100

, and compared them

with the dominant scene boundaries. As shown

in Figure 4.9, the recall rate and the precision

rate of estimating the dominant scene boundaries

by B

indicate that the sets of break points (es-

pecially, B

20

) closely correlate with the dominant

� (words)

(%)

precisionrecall

0 20 40 60 80 100

0

20

40

60

80

100

Figure 4.9 Correlation between LCP and human

judgements. (The recall rate and the precision rate

of estimating the dominant scene boundaries by the

break points of LCP.)

scene boundaries. The recall and precision rate

are de�ned as follows:

Recall = Hit / Human,

Precision = Hit / Machine,

where, Human is the number of the dominant

scene boundaries. In the human experiment, 16

dominant scene boundaries:

f 65, 110, 192, 227, 281, 465, 537, 652, 749,

834, 974, 1076, 1155, 1210, 1301, 1346 g

are observed. While Machine is the number of the

break points (i.e. jB

j), and Hit is the number of

correctly estimated dominant scene boundaries by

B

.

4.3.3 Comparing LCP and

Human Judgements with the Text

Let us see concretely the close relationship of (1)

the graph of the LCP, (2) the human judgements

(both are shown in Figure 4.8), and (3) the text

used in the experiment (namely, Springtime �a la

Carte [Thornley, 1960]).

The clear valley at i=192, for instance, exactly

corresponds to the dominant scene boundary (and

also to the paragraph break). The following is the

portion of the original text from i=157 to 227.

Sarah had managed to

160

open the

world a little with her typewriter.

That was

170

her work --- typing.

She did not type very quickly,

and

180

so she had to work alone, and

not in a

190

great office.

The most successful of Sarah's

battles with the

200

world was the

arrangement that she made with

Schulenberg's Home

210

Restaurant.

The restaurant was next door to the

old red-brick

220

building in which she

had a room. � � �

21

Page 23: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

Note that the original paragraph breaks are shown

in the examples here, and the subscritps in the

examples indicate the text position i (words). We

can see the discontinuity of scenes: the �rst part

(before the paragraph break) is focused on Sarah's

job, and the second (after the paragraph break) on

Schulenberg's restaurant. (Sarah is the heroine of

the story.)

It is worth noting that LCP can detect scene

alternation irrespective of the paragraph breaks

placed by the author of the story. For example,

the paragraph break at i=156 is not a minimum

point of the LCP. And the likely continuation of a

scene indicated by the LCP at that point is sup-

ported by the human judgements. The following

is the text portion from i=111 to 192.

The gentleman who said the world

was an oyster which

120

he would open

with his sword became more famous

than

130

he deserved. It is not

difficult to open an oyster

140

with

a sword. But did you ever notice

anyone try

150

to open it with a

typewriter?

Sarah had managed to

160

open the

world a little with her typewriter.

That was

170

her work --- typing.

She did not type very quickly, and

180

so she had to work alone, and not in

a

190

great office.

On the other hand, the author of the story did

not place a paragraph break at i = 228, but the

LCP and a half of the 16 subjects mark a scene

boundary at that point. The following is the text

portion from i=193 to 281.

The most successful of Sarah's

battles with the

200

world was the

arrangement that she made with

Schulenberg's Home

210

Restaurant.

The restaurant was next door to the

old red-brick

220

building in which she

had a room. One evening, after

230

dining at Schulenberg's, Sarah took

away with her the bill

240

of fare.

It was written in almost unreadable

handwriting, neither

250

English nor

German, and was so difficult to

understand that

260

if you were not

careful you began with the sweet

270

and ended with the soup and the day

of the

280

week.

It is obvious that a new scene begins with the

underlined cue phrase "One evening" which in-

dicates discontinuity of time.

There are some discrepancies between the LCP

and the human judgements, however. For exam-

ple, the minimum point at i=450 disagrees with

the dominant scene boundary at i=465. The fol-

lowing is the portion in question.

Both were satisfied with the

agreement. Those who ate

430

at

Schulenberg's now knew what the food

they were eating

440

was called, even

if its nature sometimes puzzled

them. And

450

Sarah had food during

a cold dull winter, which was

460

the

main thing with her.

When the spring months arrived

470

,

it was not spring. Spring comes

when it comes. The

480

frozen snows

of January still lay hard in the

streets

490

. � � �

The �rst part (before the paragraph break) is fo-

cused on the agreement made between Sarah and

Schulenberg, and the second (after the paragraph

break) on the severe weather of winter. This

disagreement between the LCP and the human

judgements may be accounted for by the lexical

similarity between the last part of the �rst para-

graph and the �rst part of the second paragraph

| the words used there are related with the severe

weather of winter.

4.4 Discussion

LCP is based on the hypothesis that a local text

tends to be coherent when the local text is lex-

ically cohesive [Halliday and Hasan, 1976; Mor-

ris and Hirst, 1991]. This section discusses (1)

the relationship between the lexical cohesiveness

and coherence of a text, and (2) the width of the

window used in computing LCP.

4.4.1 Lexical Cohesiveness and

Text Coherence

LCP deals with the lexical cohesiveness of words

in a text, and left out any syntactic structure

or punctuations in the text. The mutual lexical

cohesiveness c(S) does not work well on an ill-

structured (or incoherent) but lexically cohesive

text. Compare the following example with those

in Section 4.2.

c ("I saw cats.

A lion belongs to the cat family.

My family keeps a pet." )

= 0.653580 (incoherent, but cohesive).

The reason for this lies in the shortcomings of the

lexical cohesiveness between words de�ned on the

English dictionary. For instance, it ignores the

connotational and extensional meaning of words

22

Page 24: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

Role on

coherence

P

P

P

P

P

P

P

P

P

P

P

P

P

P

Syntagmatic

P

P

P

P

P

P

P

P

P

P

P

P

P

P

Paradigmatic

HH

��

Word distance

Figure 4.10 Role of syntagmatic and paradigmatic

relation on text coherence.

and any syntactic structure in the dictionary def-

initions.

Syntagmatic relations between words can

make up for the limits of LCP which is based only

on paradigmatic relations between words. As il-

lustrated in Figure 4.10, coherence of a scene is

maintained by (1) syntagmatic relations between

words closely positioned and (2) paradigmatic re-

lations between distant words. Syntagmatic rela-

tions can be computed as the co-occurrence prob-

ability of words in corpora [Church and Hanks,

1990] or in dictionary de�nitions [Wilks, et al.,

1989].

4.4.2 Adapting Window Width

The width of the window should be as narrow as

possible, if noises are not present, since a nar-

row window can capture alternations of both short

and long scenes. The experiments on the vari-

ous widths of the window revealed that Hanning

window of 51 words long gives the best correla-

tion with human judgements, as we have seen in

Section 4.2. Obviously, this window width is

however applicable only to the text examined in

the experiment. The best window width will de-

pend on genres and styles of texts. For example,

the following factors may a�ect the best window

width.

� Average length of scenes or mini-

mum/maximum length of scenes of the text

(however, they depend on the result of human

judgements), or those computed on a large cor-

pus.

� Lexical density of the text, i.e. the propor-

tion of the number of types (the size of vo-

cabulary used) and the number of tokens (the

length of the text).

At present, I have no e�ective method for adapting

the window width to these data.

In the present stage of my research, I am trying

to adapt the window width to the total signi�-

cance of words in the window. In this scheme, the

window width is dynamically determined so as to

make the total signi�cance of words w2S

i

in the

window be a certain constant value G. In other

words, the scheme is to �nd �, such minimize

G�

X

w2S

i

s(w)

:

If it is the case, we can apply the window function

to the computation described above. It seems that

G can be derived from corpus analysis. However,

this is an unsolved problem.

4.4.3 Structure of Scenes

LCP partitions a text into scenes, i.e. contigu-

ous and non-overlapping units of the text. How-

ever, LCP tells us nothing about hierarchical

structure of the scenes; it provides only push/pop

clues for constructing such structure as the ones

discussed in [Grosz and Sidner, 1986; Mann and

Thompson, 1987].

One may say that we can capture super-scenes

of higher level by using wider window and then

construct tree-like structure of a text. However,

some anticipated problems in this method are:

� De�nition of super-scenes

It seems di�cult to de�ne the super scenes in

terms of local coherence or the lexical cohesive-

ness between words. (What is the linguistic

de�nition of a super-scene?)

� Tree vs network

Structure of scenes sometimes would not �t in

tree-like structure; it may have network-like

structure. (Consider a text ABA

0

B

0

� � �, where

A and A

0

are scenes on a hero, B and B

0

on a

heroine.)

I take a position where the structure of a text is

considered as a network of scenes. The network is

based on coherence relations between scenes | the

lexical cohesiveness between two texts described in

Section 3.3. It is assumed here that:

� Cohesive scenes tend to share a topic or se-

mantically related topics.

� Anaphora and ellipsis beyond one scene can

be resolved in a set of adjacent scenes.

In the next stage of my research, I intend to in-

corporate the idea of a scene network in the study

of text segmentation and of text structure.

23

Page 25: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

4.5 Summary

This chapter proposed Lexical Cohesion Pro-

�le (LCP) as a quantitative indicator of scene

boundaries of narratives. LCP is a record of the

mutual lexical cohesiveness of words in a win-

dow moving word by word on the text. The mu-

tual lexical cohesiveness is computed by spreading

activation on the semantic network of the subset

of LDOCE, as described in Chapter 3. Hills and

valleys of a graph of LCP closely correlate with

scene alternation. The hills indicate continuity of

each scene, and the valleys indicate scene bound-

aries.

LCP deals only with lexical cohesion between

words; it ignores any grammatical information

(even that of sentence or paragraph breaks) or

other linguistic devices (such as cue phrases). The

reason for this is the purpose of this paper | to

see the role of the lexical cohesiveness of words on

local coherence of a text. As examined in Section

4.3, LCP closely correlates with human judge-

ments. This means that (1) local coherence of

a text is a valuable indicator of scene alternation,

and (2) the local coherence can be estimated by

the lexical cohesiveness between words in the local

text.

The text segmentation scheme described in this

chapter is a bottom-up analysis of coherent

structure of a text. The information provided by

this analysis works as top-down clues for further

text analysis, for example:

� Resolving anaphora and ellipsis

A scene is the smallest domain in which text

coherence can be de�ned. In other words, a

scene is a text portion which describes certain

objects in a situation. This suggests that most

of the referents of anaphoric or elliptic expres-

sions can be found inside the scene.

� Information retrieval

Each scene has a topic phrase (or sentence),

that is the semantic center of the scene com-

puted from an activated pattern produced

from the scene. A set of topic phrases works as

a key for text retrieval and also for text sum-

marization.

Meantime I have to make clear the relationship

between the window width and the word sig-

ni�cance that we have discussed in Section 4.4,

and examine validity of LCP on other genres and

styles of texts. Also to make the segmentation

scheme more robust, it is necessary to incorporate

syntagmatic relation (i.e. co-occurrence proba-

bility) in computing text coherence.

5 Retrospects and Prospects

The lexical cohesiveness, described in Chapter

3, objectively and computationally measures the

strength of lexical cohesion between words in

terms of their associative relations in the English

dictionary. As it is evaluated by text segmenta-

tion, described in Chapter 4, the lexical cohe-

siveness works as an indicator of text coherence,

and also provides valuable information for further

analysis on text structure.

This chapter discusses various theoretical as-

pects of the proposed measurement of lexical co-

hesion, in view of the past, the present, and the

future. Section 5.1 discusses the relationships

with recent work in other �elds. Section 5.2 de-

scribes how to capture the syntagmatic relations

between words, that is the converse of the paradig-

matic relations captured by the method proposed

here. Section 5.3 puts this thesis in perspective

for future research.

5.1 Relations with Other Fields

The proposed method for measuring lexical co-

hesion has been constructed from the evaluation

of related studies in other �elds. The idea of

Gloss�eme, the closed subsystem of English, is

based on the studies of core vocabulary and dic-

tionaries. Knowledge and semantic representation

on the semantic network Paradigme are based on

recent development in psychology.

5.1.1 Lexicology and Lexicography

| Backgrounds of Gloss�eme

Several methods to construct a basic minimum

language and its core vocabulary, like Gloss�eme,

have been proposed. The proposal of Basic En-

glish was �rst put forward in the early 1930s [Og-

den, 1968]. Basic English is English as a secondary

world language which is simpli�ed by restricting

the vocabulary to 850 words and by reducing the

rules for using them to the smallest number nec-

essary to dearly state ideas.

Basic English is designed as a basis for learn-

ing general English; it is based on the minimum

learning cost for communication, not on the fre-

quency of word use in general English. However,

the following points suggest Basic English is not

a subsystem of general English but another inde-

pendent one.

� Vocabulary selection

The criteria for vocabulary selection are sub-

24

Page 26: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

0 2000 4000 6000 8000 10000

0

20

40

60

80

100

size (words)

coverage

(%)

Figure 5.1 Coverage of frequent words in a corpus.

(Accummulative frequency plotted against the vocab-

ulary size. Computed on the LOB corpus (1,006,815

words; 47,888 types).)

jective and unclear. For example, the 850-

word basic vocabulary contains only 18 verbs,

so that even common verbs in general English

have to be paraphrased as follows.

general English Basic English

ask put a question

walk have a walk

� Sense selection

Learning 850 words is not the same thing as

learning 850 senses, since each of the words

may have several senses. However, Basic En-

glish o�ers no guidance for this. (One calcula-

tion is that the 850 words have 12,425 mean-

ings. [Carter and McCarthy, 1988])

Basic English is a consistent language system

which works as a useful tool for communication.

However, it is not the core of general English in

everyday use.

The most remarkable proposal for core vocabu-

lary after Basic English is A General Service List

of English Words [West, 1953] (hereafter, GSL),

which is the outcome of major studies of the

1930's on vocabulary selection for language teach-

ing. GSL consists of 2,000 words drawn from a

corpus of 5,000,000 words. The main criteria for

the selection of GSL are: (1) the frequency of

words (not only the occurrence of words but also

the proportion of di�erent meanings of each word),

and (2) coverage and granularity of meaning,

that determine semantic range and separability,

respectively. GSL can be seen as a result from a

mixture of the objective frequency (as shown in

Figure 5.1) and subjective criteria on meaning.

GSL has had the most lasting in uence among

core vocabulary proposals, and it is widely used

today in forming the basis of the principles un-

derlying Longman Simpli�ed English Series and

Longman Structural Readers of simpli�ed �ction,

non-�ction, poems, and plays. The narrative texts

used in text segmentation (described in Chapter

4), namely Springtime �a la Carte [Thornley, 1960]

andMahatma Gandhi [Leavitt, 1958], are adopted

from these series.

GSL has also been applied to lexicography:

techniques for compiling dictionaries. LDOCE

(Longman Dictionary of Contemporary English)

[1987, �rst ed. 1978] is one of the remarkable out-

comes of GSL. All the de�nitions and examples

in LDOCE are written in the restricted vocabu-

lary LDV (Longman De�ning Vocabulary), which

is originally based on GSL and updated by Long-

man. LDV consists of 2,191 words (correspond-

ing to 2,851 headwords of LDOCE, with distin-

guishing homographs) and 48 a�xes. LDV cov-

ers 83.07% of 1,006,815 words in the Lancaster-

Oslo/Bergen corpus (hereafter, LOB corpus) with

the help of a morphological analysis.

The result of using LDV as de�ning vocabu-

lary is the ful�lment of the most basic lexico-

graphic principle: the de�nitions of headwords are

always written by using more simple words than

the headwords they describe. This principle pro-

vides the basis of the work described in this thesis

| Gloss�eme, since it is based on the de�ning vo-

cabulary LDV and their de�nitions in LDOCE,

works as a closed subsystem of English.

5.1.2 Psychology of Memory

| Backgrounds of Paradigme

Psychological studies of organization of human

memory have revealed the functional distinction

between semantic and episodic memory [Tulv-

ing, 1972]. Semantic memory is the knowledge

shared by the people, while episodic memory

stores personal experiences. This distinction is

summarized as follows.

Semantic memory

contents socially shared codes

elements linguistic concepts

relations associative relations

Episodic memory

contents personal experiences

elements episodes and events

relations temporal/spatial relations

These two functions of memory are installed in bi-

ologically di�erent ways, as it is proved through

recent studies of amnesia and aphasia [Squire,

1986].

25

Page 27: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

The work described in this thesis deals mainly

with the semantic memory. The reason for this

is that the common knowledge, on which lexi-

cal cohesion is de�ned, corresponds to the seman-

tic memory. In view of structural linguistic [Saus-

sure, 1916], the semantic memory corresponds to

langue, i.e. the knowledge for using one's �rst lan-

guage (mother tongue), while the episodic mem-

ory corresponds to parole, i.e. one's whole use of

the language.

There have been a number of arguments regard-

ing on the way of representing the semantic mem-

ory. Even recent work on network representation

has two mutual-exclusive but complementary ap-

proaches.

� Local representation

Di�erent concepts are embodied in di�erent

nodes. Each node is individual and self-

explanatory on meaning or value of the concept

it represents. (For example, see frame-based

models [Minsky, 1975; Schank, 1980].)

� Distributed representation

Di�erent concepts correspond to di�erent pat-

terns of activity over the very same nodes.

Each node is involved in representing a num-

ber of (almost all) concepts. (For example, see

PDP models [Rumelhart et al., 1986].)

Both approaches have their own advantages:

(1) local representation is a explicit and well-

articulated representation which can perform log-

ical and sequential inferences and reasoning,

while (2) distributed representation can perform

implicit analogical and metaphoric inferences,

and it also has tolerance of noise in input texts

and statistical learnability.

The semantic network Paradigme is a result of

a mixture of local and distributed representation:

(1) each node in Paradigme corresponds to one

headword in the dictionary Gloss�eme, and (2)

the meaning of a headword is represented by an

activated pattern distributed over the nodes (or

the headwords of Gloss�eme). In other words, dif-

ferent headwords correspond to di�erent nodes,

while meaning of a word is represented by using

all nodes of Paradigme.

Essence of knowledge and semantic representa-

tion on Paradigme lies in one of the principles

in structural linguistics and semiology [Saussure,

1916; Hjelmslev, 1943]: the value of a word is de-

�ned only by its relationships with other words in

the language. Each word has no value or mean-

ing by itself, but structural relations with other

words de�ne the value or meaning of the word.

This means that the language is the system of

-

Syntagmatic relations

?

Paradigmatic relations

I + can + go

We + must + walk

Boys + will + run

.

.

.

.

.

.

.

.

.

Figure 5.2 Syntagmatic and paradigmatic relations

between words.

signs (or of words).

5.2 Syntagmatic Relations

between Words

Words in a text display mutual dependence which

creates coherent textural structure, as outlined at

the beginning of Chapter 1. The mutual depen-

dence can be classi�ed into two categories of rela-

tionships between lexical items, namely paradig-

matic and syntagmatic relations. As illustrated

in Figure 5.2, these two kinds of thread can

be recognized in the texts. Paradigmatic rela-

tions are based on association between concepts,

while syntagmatic relations are based on co-

occurrency of lexical items in actual texts.

The focus of this thesis has been mainly on

paradigmatic relations, not on syntagmatic rela-

tions. The reason for this is obvious: the com-

mon knowledge for measuring lexical cohesion

is mainly maintained by paradigmatic relations.

However, as we have seen in Section 4.4, paradig-

matic relations are not enough to cover all the

aspects of lexical cohesion and text coherence, and

syntagmatic relations can make up for this limita-

tion.

This section describes two experiments for

extracting syntagmatic relations between words

from a machine-readable corpus. A corpus is a

representative sample of a language that consists

of massive quantities of texts. For example, the

LOB corpus, one of the standard corpora, con-

sists of about one million words of British En-

glish. (Cf. the Bible has approximately one million

words.)

5.2.1 Extracting n-gram Data

The increasing availability of machine-readable

corpora has suggested new statistical and prob-

abilistic methods for capturing linguistic informa-

tion [Church and Mercer, 1993], especially col-

locations. Collocation is the co-occurrence ten-

26

Page 28: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

Table 5.1 The most frequent trigrams and tetra-

grams computed on the LOB corpus (total 1,006,815

words; 47,888 types). (Note that the total number of

the n-grams becomes 1; 006; 815�n+1.)

trigram (n=3)

w

i

w

i+1

w

i+2

freqency

one of the 390

there was a 204

out of the 192

the end of 185

some of the 184

part of the 182

there is no 170

it was a 167

there is a 165

the fact that 165

tetragram (n=4)

w

i

w

i+1

w

i+2

w

i+3

freqency

the end of the 102

at the same time 95

on the other hand 94

in the case of 77

at the end of 74

for the first time 72

as a result of 48

in the form of 41

the fact that the 41

the rest of the 38

dency of words to work together in predictable

ways. This approach is summarized with the

memorable line: `You shall know a word by the

company it keeps' [Firth, 1957].

One of the traditional indicators of the co-

occurrence tendency of words is n-gram data

[Brown, et al., 1992; Church and Mercer, 1993].

The n-gram analysis is quite similar to word-

frequency analysis which counts the occurrence of

words; the n-gram analysis counts the occurrence

of tuples of adjacent n words. For example, the

text "We need to provide the solution" pro-

duces the following trigrams (n = 3), ft

i

j t

i

=

hw

i

; w

i+1

; w

i+2

ig.

w

i

w

i+1

w

i+2

t

1

we need to

t

2

need to provide

t

3

to provide the

t

4

provide the solution

As shown in Table 5.1, frequent n-grams display

collocative relations between words, and they can

be considered as phrases or phrasal lexemes.

The n-gram analysis provides syntagmatic

prediction, i.e. the probability of occurrence of

a word w immediately after given two contigu-

ous words w

1

; w

2

. For example, when we observe

two adjacent words "we need", the trigram data

computed on the LOB corpus can predict which

Table 5.2 The trigram prediction of the third word.

(The most probable sequences with their probabili-

ties.)

w

1

w

2

w probability

we need a 22.73%

not 18.18%

to 18.18%

more 4.55%

� � � � � �

need to be 17.24%

provide 3.45%

make 3.45%

keep 3.45%

� � � � � �

to provide a 18.18%

the 12.99%

for 6.49%

such 3.90%

� � � � � �

provide the means 12.50%

solution 6.25%

money 6.25%

food 6.25%

� � � � � �

words tend to follow: a (22.73%), not (18.18%),

to (18.18%), etc. Table 5.2 shows that the n-

gram prediction captures a number of important

frequency-based relations between words. How-

ever, this method cannot capture syntagmatic

relations of long range; it can only detect co-

occurrence of words within a window of n words

long.

5.2.2 Mutual Information

A wide window might be able to capture the long-

range relationships between words. For example,

mutual information [Church and Hanks, 1990]

computed by a wider window works as a more e�-

cient indicator. The mutual information I(w;w

0

)

between words w;w

0

is de�ned as follows.

I(w;w

0

) = log

Pr(w;w

0

)

Pr(w) � Pr(w

0

)

:

It compares the probability Pr(w;w

0

) of observ-

ing w and w

0

together in the window (i.e. the

joint probability) with the probabilities Pr(w)

and Pr(w

0

) of observing w and w

0

independently

(i.e. chance). If there is a strong relationship

between w;w

0

, then I(w;w

0

) � 0. If there is

no interesting relationship between w;w

0

, then

I(w;w

0

) � 0. If w;w

0

are in complementary dis-

tribution, then I(w;w

0

)�0.

The following list shows the words w

i

that have

the highest mutual information I(w;w

i

), where w

is the word hair, computed on the LOB corpus

by using the window of 16 words long.

27

Page 29: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

I(w;w

i

) w

i

4.513954 cuticle

4.420844 coal-black, crinkly, fairish,

flaxen, frizzled, iron-grey,

itched, red-gold, tufting,

volos, waist-long, wavy

4.321308 falkirk, hugs, imprudently,

looped-up

4.268841 ruddy

Most of words in the list above appear only once

in the corpus, however. Such low-frequency words

can be considered as noise. The following two

lists show the highest mutual information between

hair and the words whose frequencies are more

than 2 and 4, respectively.

w

i

I(w;w

i

) (freq.�2)

4.268841 ruddy

4.420844 auburn

4.016454 mousy

4.157810 greying

3.883187 scurf

w

i

I(w;w

i

) (freq.�4)

3.371935 brushing

3.321308 cropped,

ribbons,

thinning

3.098916 coppery

3.050006 colouring

After eliminating its noise, the mutual information

works as an indicator of syntagmatic relation (or

collocation) of words in actual texts.

5.2.3 Problems and Perspectives

of Corpus-based Analysis

The corpus-based analysis of word co-occurrency,

like the ones described above, poses the following

problems. One problem is the size of the corpus,

and the other is the quality of the corpus. Both

problems are deeply concerned with the nature of

corpora.

The LOB corpus that have been used in the ex-

periments above consists of 1,006,815 words. How-

ever, it is not large enough for extracting syntag-

matic relations of words. As we have seen above,

a large number of words appear only once in the

corpus. The list below illustrates the relationship

between the word frequency and the coverage in

the corpus.

coverage in vocabulary coverage in words

freq. (total 47,888 types) (total 1,006,815 words)

and its accumulation and its accumulation

1 44.57% (44.57%) 2.12% (2.12%)

2 14.48% (59.06%) 1.37% (3.50%)

3 7.94% (66.99%) 1.13% (4.63%)

4 5.02% (72.01%) 0.95% (5.58%)

5 3.47% (75.48%) 0.83% (6.41%)

It is obvious that the more frequently a word ap-

pears in the corpus, the more accurate the statis-

tical analysis of the word is. Most words of the

Table 5.3 The composition of the LOB corpus.

(The size and proportion of each text genre.)

size proportion

text categories (words) (%)

Press: reportage 88727 8.81

Press: editorial 54293 5.39

Press: reviews 34213 3.39

Religion 34226 3.39

Skills, trades, and hobbies 76556 7.60

Popular lore 88679 8.80

Bells letters, biography, essays 155111 15.40

Miscellaneous 60591 6.01

Learned and scienti�c writings 161215 16.01

General �ction 58476 5.80

Mystery and detective �ction 48211 4.78

Science �ction 12026 1.19

Adventure and western �ction 58274 5.78

Romance and love story 58148 5.77

Humour 18069 1.79

vocabulary appear only a few times in the corpus,

however.

A corpus is intended to be a representative sam-

ple of the real use of a language, and its quality is

determined by sampling techniques. The texts

in the LOB corpus were selected by strati�ed ran-

dom sampling based on several bibliographical al-

manacs, where the texts are classi�ed into cate-

gories according to the Dewey Decimal Classi�-

cation on the subjects of the texts. The texts

are then classi�ed into 15 genres, as shown in

Table 5.3, based on rhetorical properties of

the text. There is no clear discussion about the

amount of information in each text (or the num-

ber of its copies) exchanged by people. Also the

relationship between the Dewey Decimal Classi�-

cation and the genres of the corpus is unclear.

Important points of the corpus-based analysis

is that (1) all corpora are limited in their size

and quality, and (2) texts in corpora tend to

be novel and impressive, so that the corpora no

longer contain the whole common syntagmatic re-

lations shared by people. The corpus-based anal-

ysis needs to be supplemented by data derived

from the intuitions of informants through either

introspection or experimentation, or of lexicogra-

phers as the work of this thesis depends on them.

The general approach of the corpus-based analysis

is illuminating, with considerable research poten-

tial. By eliminating the noise appropriately, the

corpus-based analysis will provide valuable infor-

mation for natural language processing.

5.3 Future Research

The work described in this thesis has twomajor di-

rections for further research. One is to go deeper:

towards the interaction of paradigmatic associa-

28

Page 30: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

tion and syntagmatic prediction, and its applica-

tion for metaphoric processing. The other is to

go wider: towards the analysis on Japanese texts,

which requires a corpus of Japanese, selecting core

vocabulary, and a well-structured dictionary.

5.3.1 Interaction between

Paradigme and Syntagme

Paradigmatic association is the lexical co-

hesiveness or similarity �(w;w

0

) between words

w;w

0

, as described in Section 3.2. So, it can

be considered as a mapping � : w 7! w

0

, where

w

0

( 6=w) is the most similar word to w.

Syntagmatic prediction, on the other hand,

is the recency �(w;w

0

), i.e. the probability of ob-

serving w;w

0

together in the window (where w is

followed by w

0

) or of the n-gram prediction, like

those described in Section 5.2. So, it can be con-

sidered as a mapping � : w 7! w

0

, where w

0

has

the highest probability of co-occurrence with w.

Most of semantic and syntactic relations be-

tween words can be de�ned by combination or

interaction of the paradigmatic association �

and the syntagmatic prediction �. The following

example illustrates the relationship from supply

to meat. (Note that there is no signi�cant direct

mapping from supply to meat.)

supply meat

� �

provide

7�! food

The paradigmatic association from supply to

provide and from food to meat are computed re-

spectively as follows:

�(supply, provide) = 0.174675,

�(food, meat) = 0.155881,

where provide is the most similar word to supply,

and meat is the most similar word to food. And, if

syntagmatic prediction � is de�ned as the trigram

prediction (as shown in Table 5.2), the syntag-

matic prediction from provide to food is com-

puted as follows:

�(provide; food) = 6:25%:

where the word food has the second-highest prob-

ability of co-occurrence with provide in the table.

The interaction of paradigmatic association and

syntagmatic prediction can be applied to process-

ing metaphoric expressions. For example, let

us consider the sentence "She is shining". The

de�nition of shine in LDOCE is as follows.

She is shining

bright

burn

polish

flash

cheerful

She is bright.

She is cheerful.

Figure 5.3 An example of metaphoric interpreta-

tion. (The paradigmatic association � provides possi-

ble meanings of each word; the syntagmatic prediction

� selects relevant sequences of the meanings.)

shine v 1 to produce light 2 to re ect light;

be bright 3 to direct (a lamp, beam of light,

etc.) 4 to polish; make bright by rubbing 5

� � �

However, the sentence does not mean that she is

re ecting light nor has she been polished. Rather

it means that she is cheerful and lively.

The paradigmatic association � from shine pro-

vides the following words that have the highest

similarity to shine.

w

0

�(shine; w

0

)

bright 0.249966

burn 0.190900

polish 0.180333

flash 0.145012

cheerful 0.143962

Then, the syntagmatic prediction � between words

which can be used as a person and each of

the words above is examined; it makes clear

that bright and cheerful tend to co-occur

with expressions used as a person. Finally,

the sentence "She is shining" is interpreted as

"She is bright" or "She is cheerful". (This

scheme is illustrated in Figure 5.3.)

5.3.2 Constructing the System

of Japanese Language

In the following stage of my work described in this

thesis, I intend to apply the scheme of computing

lexical cohesion to Japanese language process-

ing. So, I have to obtain the following informa-

tion: (1) a list of core words and their de�nition,

(2) the combinations which the words typically

form. Such information should be extracted from

a corpus, because it must have objectivity and

completeness beyond intuition of researchers.

Here, I should refer to a remarkable example:

The Collins COBUILD English Language Dictio-

nary [1987], which is an outcome of recent lex-

icographic work after LDOCE. The COBUILD

29

Page 31: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

project is an ambitious lexicographic research pro-

gramme designed to construct a mono-lingual for-

eign learner's dictionary of English which is based

on naturally-occurring data extracted from the

Birmingham Corpus (20 million words). Note

that `COBUILD' stands for the Collins Birming-

ham University International Language Database.

[Carter and McCarthy, 1988]

At present, however, there are no Japanese cor-

pora with enough size and quality available for re-

searchers. I have to start with building a large and

objective Japanese corpus. From the corpus, the

following lexicographic data can be extracted.

� Core vocabulary

The smallest and complete core vocabulary

can be selected with respect to the word fre-

quency and the coverage of use in the corpus.

The size of the vocabulary should be on the

order of several thousands.

� Word patterns

Word patterns, namely syntagmatic relations

or collocation of words, can be extracted

from the corpus. The word patterns provide

example-based de�nitions of words and also

syntactic rules.

From these lexicographic data, a well-structur-

ed dictionary of the Japanese language can be

constructed. The core vocabulary, of course,

works as the de�ning vocabulary, like LDV in

LDOCE; the word pattern determines the range

of contexts or the correct meanings of the words,

and also provides a lot of examples for actual use.

A closed subset of the dictionary can be consid-

ered as a closed sub-system of the Japanese lan-

guage. Like Gloss�eme, it will consist of the dictio-

nary entries whose headwords are included in the

de�ning vocabulary. Such subset is quite useful for

research in computational linguistics and other re-

lated �elds, because the size is small enough to be

computationally feasible, while still covering most

of words in general use of the Japanese language.

6 Conclusion

This thesis described (1) an objective and com-

putationally feasible method for measuring lexi-

cal cohesion between words and between texts

of any size, (2) its application to text segmen-

tation of narratives into coherent scenes, as the

evaluation of the measurement of lexical cohesion,

and (3) discussions about various aspects of this

work and prospects for future research.

The lexical cohesiveness, namely the strength

of lexical cohesion between words, is computed on

the semantic network Paradigme. Paradigme is

systematically constructed from Gloss�eme, a sub-

set of the English dictionary LDOCE (Longman

Dictionary of Contemporary English). Gloss�eme

consists of every entry of LDOCE whose headword

is included in LDV (Longman De�ning Vocabu-

lary), so that Gloss�eme is a closed subsystem

of English where each of its headwords is de�ned

by a phrase being composed of the headwords and

their derivations. Spreading activation on the

semantic network can directly compute the lexi-

cal cohesiveness �(w;w

0

)2 [0; 1] between any two

words w;w

0

in LDV and its derivations. It can also

indirectly compute the lexical cohesiveness of all

headwords of LDOCE and their derivations, and

as well as the lexical cohesiveness between texts.

The lexical cohesiveness �(w;w

0

) represents the

strength of association from w to w

0

and works as

an indicator of lexical cohesion.

The text segmentation is based on the Lexical

Cohesion Pro�le (LCP) which is a record of the

mutual lexical cohesiveness of words in a win-

dow moving word by word on a text. The mutual

lexical cohesiveness is de�ned as the density of the

lexical cohesiveness of words in the window, and

it suggests local coherence of the text. A graph

of LCP has hills and valleys which suggest scene

alternations, because (1) when the window is in-

side a scene, the words in the window tend to be

cohesive, and (2) when the window is crossing a

scene boundary, the words in the window tend to

be incohesive. So, the minimum points of the LCP

can be considered as marking scene boundaries

of the text. Comparison with the scene bound-

aries marked by human judgements proved that

minimum points of LCP closely correlate with

the dominant scene boundaries on which most of

the subjects agreed. The proposed segmentation

scheme works as a new tool for analysing the text

structure, resolving anaphora and ellipsis, infor-

mation retrieval, etc.

Conclusions of this proposal for the measure-

ment of lexical cohesion and its evaluation by text

segmentation are:

� Lexical cohesion of words in a text (or in a

text portion) suggests coherence of the text (or

local coherence of the text portion).

� Lexical cohesion can be computed as associa-

tive relations in the common knowledge de-

scribed in the English dictionary.

And we may say that a dictionary contains infor-

mation for detecting lexical cohesion. However,

lexical cohesion cannot cover all aspects of the

30

Page 32: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

common knowledge shared by the people in a lin-

guistic community. This is due to the di�erence

between cohesion and coherence: lexical cohesion

is the relationship between words in a text, and

coherence is the whole structure of the text made

up mainly by lexical cohesion.

The work described in this thesis focuses on

paradigmatic relations between lexemes, which

represents how the concepts are formed into the

whole knowledge of the world. Future research

will focus on syntagmatic relations, which rep-

resents how the concepts and ideas are expressed

as sequences found in actual texts. The syntag-

matic relations should be objectively extracted

from corpora, i.e. massive quantities of represen-

tative texts. So, the next stage of this work will

deal mainly with English corpora. Also this work

should be extended to Japanese language process-

ing where constructing a Japanese corpus and se-

lecting core vocabulary will be required.

Acknowledgements

I thank Dr. Teiji Furugori, my thesis advisor, for

his thoughtful suggestions and comments on this

work. Throughout the long and painful evolution

of this study, he has guided me with his insight,

encouragement, and persistence. The work would

not have been possible without his supervision.

I am grateful to the other members of my the-

sis committee: Drs. Kohei Noshita, Kiyoshi

Hashimoto, Makoto Yasuhara, and Kazuhiko

Ozeki. They have given acute criticisms and

suggestions on the thesis. I am also indebted

to Dr. Ken Church (AT&T Bell Laboratories),

Dr. Graeme Hirst (University of Toronto), Dr. Pim

van der Eijk (Digital Equipment Corporation),

and Dr. Marti Hearst (University of California,

Berkeley), who made a number of contributions

to my work with their comments and suggestions.

And, discussions with the following people in UEC

produced many of the ideas that my work is based

upon: Prof. Mituo Kobayasi, Takuzi Suzuki, Ed-

uardo de Paiva Alves, Hidemi Nishiyama, and the

members of Furugori laboratory. Had I taken

their advice more thoroughly, the thesis would

have been improved substantially.

Finally, my thanks go to my parents and

Takako. They have given me the in�nite amount

of moral support throughout this undertaking of

the laborious research.

References

[Alshawi, 1987] H. Alshawi : Processing dictio-

nary de�nitions with phrasal pattern hierarchies,

Computational Linguistics , Vol.13, pp.195{202.

[Beaugrande and Dressler, 1981] R. de Beau-

grande and W. U. Dressler : Introduction to Text

Linguistics, Longman, Harlow, Essex.

[Brown et al., 1992] P. F. Brown, V. J. Della

Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer

: Class-based n-gram models of natural language,

Computational Linguistics , Vol.18, pp.467{479.

[Carter and McCarthy, 1988] R. Carter and

M. McCarthy : Vocabulary and Language Teach-

ing, Longman, Harlow, Essex.

[Charniak, 1983] E. Charniak : Passing mark-

ers: a theory of contextual in uence in language

comprehension, Cognitive Science, Vol.7, pp.171{

190.

[Church and Hanks, 1990] K. W. Church and

P. Hanks : Word association norms, mutual in-

formation, and lexicography, Computational Lin-

guistics, Vol.16, pp.22{29.

[Church and Mercer, 1993] K. W. Church and

R. L. Mercer : Introduction to the special issue

on computational linguistics using large corpora,

Computational Linguistics , Vol.19, pp.1{24.

[Firth, 1957] J. R. Firth : A synopsis of lin-

guistic theory 1930{1955, In Studies in Linguistic

Analysis, Philological Society, Oxford. (Reprinted

in F. Palmer (ed.), Selected Papers of J. R. Firth,

Longman, Harlow, Essex, 1968.)

[Grosz and Sidner, 1986] B. J. Grosz and C. L.

Sidner : Attention, intentions, and the structure

of discourse, Computational Linguistics , Vol.12,

pp.175{204.

[Hahn, 1992] U. Hahn : On text coherence

parsing, in Proceedings of the Fifteenth Interna-

tional Conference on Computational Linguistics

(COLING-92, Nante), pp.25{31.

[Halliday and Hasan, 1976] M. A. K. Halliday

and R. Hasan : Cohesion in English, Longman,

Harlow, Essex.

[Hearst and Plaunt, 1993] M. Hearst and

C. Plaunt : Subtopic structuring for full-length

document access, in Proceedings of ACM/SIGIR

(Pittsburgh, PA).

[Hendler, 1989] J. A. Hendler : Marker-

passing over microfeatures: towards a hybrid

symbolic/connectionist model, Cognitive Science,

Vol.13, pp.79{106.

[Hirst, 1988] G. Hirst : Resolving lexical am-

biguity computational with spreading activation

and polaroid words, in S. Small et al. (eds.), Lexi-

cal Ambiguity Resolution, Morgan Kaufmann, San

Mateo, California.

[Hjelmslev, 1943] L. Hjelmslev : Omkring

Sprogteoriens Grundl�ggelse, Akademisk Forlag,

K�benhavn.

[Hobbs, 1979] J. R. Hobbs : Coherence and

coreference, Cognitive Science, Vol.3, pp.67{90.

31

Page 33: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

[Kozima, 1993] H. Kozima : Text segmenta-

tion based on similarity between words, in Pro-

ceedings of 31st Annual Meeting of the Association

for Computational Linguistics (ACL-93, Ohio),

pp.286{288.

[Kozima and Furugori, 1991a] H. Kozima

and T. Furugori : A computational model for

text disambiguation using knowledge and context

(in Japanese), in Proceedings of the 42nd Annual

Convention IPS Japan, Vol.3, pp.43{44.

[Kozima and Furugori, 1991b] H. Kozima and

T. Furugori : Building conceptual system under

the adaptation to texts (in Japanese), in Proceed-

ings of the 43rd Annual Convention IPS Japan,

Vol.3, pp.219{220.

[Kozima and Furugori, 1991c] H. Kozima

and T. Furugori : A disambiguation model for

text interpretation using knowledge and context

(in Japanese), Transactions of Information Pro-

cessing Society of Japan, Vol.32, pp.1366{1373.

[Kozima and Furugori, 1993a] H. Kozima and

T. Furugori : Semantic similarity between words

(in Japanese), Technical Report of IEICE , AI92-

100, pp.81-88.

[Kozima and Furugori, 1993b] H. Kozima and

T. Furugori : Word similarity computed on an

English dictionary (in Japanese), in Proceedings

of the 46th Annual Convention IPS Japan, Vol.3,

pp.93{94.

[Kozima and Furugori, 1993c] H. Kozima

and T. Furugori : Similarity between words com-

puted by spreading activation on an English dic-

tionary, in Proceedings of 6th Conference of the

European Chapter of the Association for Compu-

tational Linguistics (EACL-93, Utrecht), pp.232{

239.

[Kozima and Furugori, 1993d] H. Kozima and

T. Furugori : Text segmentation based on lexical

cohesion (in Japanese), IPSJ SIG Reports, NL95-

7, pp.49{56.

[Kozima and Furugori, to appear] H. Kozima

and T. Furugori : Segmenting narrative text into

coherent scenes, Literary and Linguistic Comput-

ing, to appear.

[Leavitt, 1958] L. W. Leavitt : Great Men and

Women, in Longman Structured Readers, Long-

man, Harlow, Essex.

[LDOCE, 1987] Longman Dictionary of Con-

temporary English , Longman, Harlow, Essex.

[Mann and Thompson, 1987] W. C. Mann

and S. A. Thompson : Rhetorical structure the-

ory: a theory of text organization, Technical Re-

port of Information Science Institute (University

of Southern California), ISI/RS-87-190.

[Markowitz, 1986] J. Markowitz : Semantically

signi�cant patterns in dictionary de�nitions, in

Proceedings of 24th Annual Meeting of the Asso-

ciation for Computational Linguistics (ACL-86),

pp.112{119.

[Minsky, 1975] M. L. Minsky : A framework for

representing knowledge, in P. H. Winston (ed.),

The Psychology of Computer Vision , McGraw-

Hill, New York.

[Minsky, 1980] M. L. Minsky : K-lines: a theory

of memory, Cognitive Science, Vol.4, pp.117-133.

[Minsky, 1986] M. L. Minsky : Society of Mind ,

Simon and Schuster, New York.

[Morris and Hirst, 1991] J. Morris and

G. Hirst : Lexical cohesion computed by thesaural

relations as an indicator of the structure of text,

Computational Linguistics , Vol.17, pp.21{48.

[Nakamura and Nagao, 1988] J. Nakamura

and M. Nagao : Extraction of semantic infor-

mation from an ordinary English dictionary and

its evaluation, in Proceedings of the 11th Inter-

national Conference on Computational Linguistics

(COLING-88), pp.459{464.

[Ogden, 1968] C. K. Ogden : Basic English In-

ternational Second Language: A Revised and Ex-

panded Version of the System of Basic English,

Brace and World, New York.

[Osgood, 1952] C. E. Osgood : The nature and

measurement of meaning, Psychological Bulletin,

Vol.49, pp.197{237.

[Reichman-Adar, 1984] R. Reichman-Adar :

Extended person-machine interface, Arti�cial In-

telligence, Vol.22, pp.157{218.

[Roget, 1911] P. M. Roget (ed.) : Roget's The-

saurus of English Words and Phrases, Crowell.

[Rumelhart et al., 1986] D. E. Rumelhart,

J. L. McClelland, and the PDP Research Group

: Parallel Distributed Processing: Explorations in

the Microstructure of Cognition, MIT Press, Cam-

bridge, Mass.

[Sapir, 1921] E. Sapir : Language: An Intro-

duction to the Study of Speech, Brace and World,

New York.

[Saussure, 1916] F. de Saussure : Cours de Lin-

guistique G�en�erale, Payot, Paris.

[Schank, 1980] R. C. Schank : Language and

memory, Cognitive Science, Vol.4, pp.243{284.

[Schank, 1990] R. C. Schank : Tell Me a Story:

A New Look at Real and Arti�cial Memory, Scrib-

ner, New York.

[Squire, 1986] L. R. Squire : Mechanisms of

memory, Science, Vol.232, pp.1612{1619.

[Thornley, 1960] G. C. Thornley (edited and

simpli�ed) : British and American Short Stories,

in Longman Simpli�ed English Series, Longman,

Harlow, Essex.

[Tulving, 1972] E. Tulving : Episodic and se-

mantic memory, in E. Tulving and W. Donaldson

(eds.), Organization of Memory, Academic Press,

New York.

[Veronis and Ide, 1990] J. Veronis and N. M.

Ide : Word sense disambiguation with very large

neural networks extracted from machine readable

dictionaries, in Proceedings of the 13th Interna-

tional Conference on Computational Linguistics

(COLING-90), pp.389{394.

[Waltz and Pollack, 1985] D. L. Waltz and

J. B. Pollack : Massively parallel parsing: a

32

Page 34: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

strongly interactive model of natural language in-

terpretation, Cognitive Science, Vol.9, pp.51{74.

[Wilks et al., 1989] Y. Wilks, D. Fass, C. M.

Guo, J. McDonald, T. Plate, and B. Slator :

A tractable machine dictionary as a resource for

computational semantics, in B. Boguraev and

T. Briscoe (eds.), Computational Lexicography for

Natural Language Processing, Longman, Harlow,

Essex.

[West, 1953] M. West : A General Service List

of English Words: with Semantic Frequencies and

a Supplementary Word-list for the Writing of Pop-

ular Science and Technology, Longman, Harlow,

Essex.

[Youmans, 1991] G. Youmans : A new tool for

discourse analysis: the vocabulary-management

pro�le, Language, Vol.67, pp.763{789.

Appendicies

A. Structure of Paradigme

| Mapping Gloss�eme

onto Paradigme

The semantic network Paradigme is systemat-

ically constructed from the small but closed

English dictionary Gloss�eme. Each entry of

Gloss�eme is mapped onto a node of Paradigme

in the following way. (See also Figure 3.1 and

Figure 3.2.)

Step 1. For each entry G

i

of Gloss�eme, make

an empty node P

i

in Paradigme and copy the

headword and word-class from G

i

. Add a su�x

(like ` 1' and ` 2') to the headword in order to

distinguish the same headword used in entries of

Gloss�eme (e.g. red/adjective!red 1, red/noun

!red 2g).

Then, for each entry G

i

, map each unit u

ij

onto

a subr�ef�erant s

ij

of the corresponding node P

i

.

The mapping from a word w

ijn

in u

ij

to a link or

links in s

ij

is described as follows.

1. Let h

n

be the reciprocal of the number of ap-

pearance of the root form of w

ijn

in Gloss�eme.

(A morphological analysis on LDV and a�xes

determines the root form.)

2. If w

ijn

is in a head-part, let h

n

be doubled.

(Since a head-part provides the basis of the

meaning of the headword.)

3. Find the node or nodes fp

n1

; p

n2

; � � �g which

correspond to w

ijn

. Then, divide h

n

into

fh

n1

; h

n2

; � � �g in proportion to their frequency.

4. Add links l

n1

; l

n2

; � � � to the subr�ef�erant s

ij

,

where l

nm

is a link to the node p

nm

with the

weight h

nm

.

Thus, s

ij

becomes a set of links: fl

ij1

; l

ij2

; � � �g,

where l

ijk

is a link with a weight h

ijk

. Then, nor-

malize each weight of the links as

P

k

h

ijk

= 1,

in each s

ij

. Namely, let h

ijk

(of l

ijk

2 s

ij

) be

h

ijk

=

P

k

h

ijk

.

Step 2. For each node P

i

, compute the weight

H

ij

of each subr�ef�erant s

ij

(which indicates the

signi�cance of s

ij

) in the following way:

1. Let m be the number of subr�ef�erants of P

i

(i.e. the number of units in the entry G

i

of

Gloss�eme).

2. Let H

ij

be 2m�1�j. For instance, if m= 3,

H

i1

:H

i2

:H

i3

= 4 :3 :2. Note that H

i1

:H

im

=

2:1 (m�2).

3. Normalize each weight H

ij

as

P

j

H

ij

= 1, in

each P

i

. Namely, let H

ij

be H

ij

=

P

j

H

ij

.

Thus, each node P

i

obtains its r�ef�erant.

Step 3. The �nal step is to generate r�ef�er�es

(i.e. sets of reverse links). Map each link in

r�ef�erants of all nodes in Paradigme onto a reverse

link in their r�ef�er�e, in the following way.

1. For each node P

i

, let its r�ef�er�e r

i

be an empty

set (of links).

2. For each P

i

, for each subr�ef�erant s

ij

of P

i

, map

each link l

ijk

2 s

ij

onto the corresponding re-

verse link, in the following way.

2.1 Let p

ijk

be the node referred by l

ijk

, and

let h

ijk

be the weight of l

ijk

.

2.2 Let l

0

be a new link referring to P

i

with the

weight H

ij

� h

ijk

, where H

ij

is the weight

of s

ij

. Then, add l

0

to r�ef�er�e of p

ijk

. (The

link l

0

is the reverse link corresponding to

l

ijk

.)

Then, the r�ef�er�e r

i

of each node P

i

becomes a set

of links: fl

0

i1

; l

0

i2

; � � �g, where l

0

ij

is a link with a

weight h

0

ij

. Then, for each node P

i

, normalize each

weight of the links in its r�ef�er�e r

i

as

P

j

h

0

ij

= 1.

Namely, let h

0

ij

be h

0

ij

=

P

j

h

0

ij

.

Thus, each node P

i

of Paradigme is mapped

from the corresponding entry G

i

of Gloss�eme. A

computer program (written in Common Lisp and

executed on KCL) carries out the procedures de-

scribed above.

B. Function of Paradigme

| Spreading Activation Rules

Each node P

i

of the semantic network Paradigme

computes its activity value v

i

(T+1) at time T+1

33

Page 35: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

as follows:

v

i

(T+1) = �

R

i

(T ) +R

0

i

(T )

2

+ e

i

(T )

;

where R

i

(T ) and R

0

i

(T ) are the activity (at time

T ) collected from the nodes referred in the r�ef�erant

and r�ef�er�e of P

i

, respectively. And, e

i

(T )2 [0; 1] is

the activity given from outside (at time T ). The

output function � limits the value to [0,1].

R

i

(T ) is the activity value of the most plausible

subr�ef�erant in P

i

, de�ned as follows:

R

i

(T ) = A

im

(T );

m = argmax

j

(H

ij

�A

ij

(T )) ;

where H

ij

is the weight of s

ij

(i.e. the j-th

subr�ef�erant of P

i

). And, A

ij

(T ) is the sum of

weighted activity of the nodes referred in s

ij

, de-

�ned as follows:

A

ij

(T ) =

X

k

h

ijk

�a

ijk

(T );

where h

ijk

is the weight of l

ijk

(i.e. the k-th link

of s

ij

), and a

ijk

(T ) is activity (at time T ) of the

node referred by l

ijk

.

R

0

i

(T ) is the sum of weighted activity of the

nodes referred in the r�ef�er�e r

i

of P

i

:

R

0

i

(T ) =

X

j

h

0

ij

�a

0

ij

(T );

where h

0

ij

is the weight of l

0

ij

(i.e. the j-th link

of r

i

), and a

0

ij

is activity (at time T ) of the node

referred by l

0

ij

.

In the experiments described in this thesis, I

have used the output function de�ned as follows.

�(x) =

8

<

:

1 (x > 1)

C �x (0�C �x�1)

0 (x < 0)

The constant C determines the decaying factor (in

the experiments, C=0:9).

As mentioned in Section 3.2, a computer pro-

gram carrys out this spreading activation proce-

dure. The program is written in C programming

language and translated by the compiler on SunOS

4.1.3. It computes the transition of an activated

pattern (from T to T+1) within 0.25 seconds on

the workstation (SPARCstation 2 =SunOS 4.1.3).

C. Text Used in the Experiment:

Springtime �a la Carte

The following text is a simpli�ed version of the

short story, Springtime �a la Carte by O.Henry

(adopted from the book, British and American

Short Stories , edited and simpli�ed by Thorn-

ley [1960]). It is the text used in the experiment

of scene segmentation described in Section 4.3.

The two graphs of LCP below are computed with

Hanning window of 51 words long. The vertical

solid lines in the graphs show the histogram of

the scene boundaries marked by 16 subjects.

Springtime �a la Carte

It was a day in March.

Never, never begin a

10

story this way when you write

one. No opening could

20

possibly be worse. There is no

imagination in it. It

30

is at and dry. But it is allowable

here. For

40

the following paragraph, which should have

started the story, is

50

too wild and impossible to be thrown

in the face

60

of the reader without preparation.

Sarah was crying over the

70

bill of fare.

To explain this youmay guess that

80

oysters were not on

the list, or that she had

90

promised not to eat ice-cream

just now. But your guesses

100

are wrong, and you will

please let the story continue

110

.

The gentleman who said the world was an oyster

which

120

he would open with his sword became more fa-

mous than

130

he deserved. It is not di�cult to open an

oyster

140

with a sword. But did you ever notice anyone

try

150

to open it with a typewriter?

Sarah had managed to

160

open the world a little with

her typewriter. That was

170

her work | typing. She did

not type very quickly, and

180

so she had to work alone, and

not in a

190

great o�ce.

The most successful of Sarah's battles with the

200

world

was the arrangement that she made with Schulenberg's

Home

210

Restaurant. The restaurant was next door to the

old red-brick

220

building in which she had a room. One

evening, after

230

dining at Schulenberg's, Sarah took away

with her the bill

240

of fare. It was written in almost un-

readable handwriting, neither

250

English nor German, and

was so di�cult to understand that

260

if you were not care-

ful you began with the sweet

270

and ended with the soup

and the day of the

280

week.

The next day Sarah showed Schulenberg a card on

290

which the bill of fare had been beautifully typewritten

with

300

the food temptingly listed in the right and proper

places

310

, from the beginning to the words at the bottom:

\not

320

responsible for overcoats and umbrellas".

Schulenberg was delighted. Before Sarah

330

left him

he had willingly made an agreement. She was

340

to pro-

vide typewritten bills of fare for the twenty-one tables

350

in the restaurant | a new bill for each days dinner

360

, and

new ones for breakfast and lunch as often as

370

there were

changes in the food or as neatness made

380

necessary.

In return for this Schulenberg was to send three

390

meals

a day to Sarah's room, and send her also

400

each afternoon

a list in pencil of the foods that

410

Fate had in store for

Schulenberg's visitors on the next

420

day.

Both were satis�ed with the agreement. Those who

ate

430

at Schulenberg's now knew what the food they were

eating

440

was called, even if its nature sometimes puzzled

them. And

450

Sarah had food during a cold dull winter,

which was

460

the main thing with her.

When the spring months arrived

470

, it was not spring.

Spring comes when it comes. The

480

frozen snows of Jan-

34

Page 36: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

0 100 200 300 400 500 600 700 800

0.3

0.4

0.5

0.6

0.7

LCP

words

0

4

8

12

16

segs

� ��

��

� � �

800 900 1000 1100 1200 1300 1400 1500 1600

0.3

0.4

0.5

0.6

0.7

LCP

words

0

4

8

12

16

segs

� �

uary still lay hard in the streets

490

. Men in the streets with

their musical instruments still played

500

\In the Good Old

Summertime", with their December activity and

510

expres-

sion. The steam-heat in the houses was shut o�. And

520

when these things happen, one may know that the city

530

is still in the power of winter.

One afternoon Sarah

540

was shaking with cold in her

bed-room. She had no

550

work to do except Schulenberg's

bills of fare. Sarah sat

560

in her rocking-chair and looked

out of the window. The

570

month was a spring month

and kept crying to her

580

: \Springtime is here, Sarah |

springtime is here, I tell you

590

. You've got a neat �gure,

Sarah | a nice, springtime

600

�gure | why do you look

out of the window so

610

sadly?"

Sarah's room was at the back of the house

620

. Looking

out of the window she could see the windowless

630

brick

wall of the box factory in the next street

640

. But she

thought of grassy walks and trees and bushes

650

and roses.

In the summer of last year Sarah had

660

gone into the

country and fallen in love with a

670

farmer.

(In writing a story, never go backwards like this

680

. It

is bad art and destroys interest. Let it go

690

forwards.)

Sarah stayed two weeks at Sunnybrook Farm. There

she

700

learned to love old Farmer Franklin's son Walter.

Farmers have

710

been loved and married in less time. But

young Walter

720

Franklin was modern agriculturist. He

had a telephone in his

730

cow-house, and he could calcu-

late exactly what e�ect next year's

740

Canada wheat crop

would have on what he planted.

It

750

was in this shady place that Walter had won

her

760

. And together they had sat and woven a crown of

770

dandelion for her hair. He had praised the e�ect of

780

the

yellow owers against her brown hair; and she had

790

left

the owers there, and walked back to the house

800

swinging

her straw hat in her hands.

They were to

810

marry in the spring | at the very �rst

signs of

820

spring, Walter said. And Sarah came back to

the city

830

to hit the typewriter.

A knock at the door drove

840

away Sarah's dreams of

that happy day. A waiter had

850

brought the rough pencil

list of the Home Restaurant's next

860

day's food written in

old Schulenberg's ugly handwriting.

Sarah sat

870

down to her typewriter and slipped a card

beneath the

880

rollers. She was a clever worker. Generally

in an hour

890

and a half the twenty-one cards were typed

and ready

900

.

Today there were more changes on the bill of fare

910

than usual. The soups were lighter; there were changes

in

920

the meat dishes. The spirit of spring �lled the

entire

930

list. Fried foods seemed to have gone.

Sarah's �ngers danced

940

over the typewriter like little

ies above a summer stream

950

. Down through the di�er-

ent foods she worked, giving the name

960

of each dish its

proper position according to its length

970

with a watchful

eye.

Just before she reached the fruit

980

, Sarah was crying

over the bill of fare. Tears from

990

the depths of despair

rose in her heart and gathered

1000

in her eyes. Down went

her head on the little

1010

typewriter stand.

For she had received no letter from Walter

1020

in two

weeks, and the next thing on the bill

1030

of fare was dande-

lions | dandelions with some kind of egg

1040

| but never

mind the egg! | dandelions, with whose golden owers

1050

Walter had crowned her his queen of love and future

1060

wife | dandelions, the messengers of spring, her sorrow's

crown of

1070

sorrow | reminder of her happiest days.

But what a wonderful

1080

thing spring is! Into the great

cold city of stone

1090

and iron a message had to be sent.

There was

1100

none to bring it except the little messenger

of the

1110

�elds with his rough green coat, the dandelion

| this lion's

1120

tooth, as the French call him. When he

is in

1130

ower, he will help at love-making, twisted in my

lady's

1140

nut-brown hair; when young, before he has his

owers, he

1150

goes into the boiling pot.

In a short time Sarah

1160

forced back her tears. The

cards must be typed. But

1170

still in a dream she touched

the typewriter without thinking

1180

of it, with her mind

and heart in the country

1190

with her young farmer. But

then she came back to

1200

the stones of Manhattan, and

the typewriter began to jump

1210

.

At six o'clock the waiter brought her dinner and

carried

1220

away the typewritten bills of fare. Sarah ate

her dinner

1230

sadly. At 7.30 the two people in the next

room

1240

began to quarrel; the man in the room above

began

1250

to play something like music; the gas light went

a

1260

little lower; someone started to unload coal; cats

could be

1270

heard on the back fences. By these signs

Sarah knew

1280

that it was time for her to read. She

got

1290

out her book, settled her feet on her box, and

1300

began.

The front-door bell rang. The landlady answered it.

Sarah

1310

left the book and listened. Oh, yes; you would,

just

1320

as she did.

And then a strong voice was heard

1330

in the hall below,

and Sarah jumped for her door

1340

, leaving the book on the

oor.

You have guessed it

1350

. She reached the top of the

stairs just as her

1360

farmer came up, three steps at a jump,

and gathered

1370

her to him.

\Why haven't you written | oh, why?" cried

1380

Sarah.

\New York is a rather large town," said Walter

1390

Franklin. \I came in, a week ago, to your old

1400

address.

I found that you went away on a Thursday

1410

. I've hunted

for you with the police and otherwise

1420

ever since!"

\I wrote to you," said Sarah, with force

1430

.

\Never got it!"

\Then how did you �nd me?"

35

Page 37: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

The

1440

young farmer smiled a springtime smile.

\I went into the

1450

Home Restaurant next door this

evening," said he. \I don't

1460

care who knows it; I like

a dish of some

1470

kind of greens at this time of the year.

I

1480

ran my eye down that nice typewritten bill of fare

1490

looking for something like that. But when I looked, I

1500

turned my chair over and shouted for the owner. He

1510

told me where you lived."

\Why?"

\I'd know that

1520

capital W away above the line that

your typewriter makes

1530

anywhere in the world," said

Franklin.

The young man drew

1540

a bill of fare from his pocket,

and pointed to

1550

a line.

She recognized the �rst card she had typed

1560

that

afternoon. There was still a mark in the upper

1570

right-

hand corner where a tear had fallen. But over the

1580

spot where one should have read the name of a

1590

certain

plant, the memory of the golden owers had allowed

1600

her �ngers to strike strange keys.

Between two dishes on

1610

the list was the description:

\DEAREST WALTER, WITH HARD-BOILED

EGG

1620

."

D. Another Experiment

on a Biography: Mahatma Gandhi

The following text is the biography, Mahatma

Gandhi (adopted from the book, Great Men and

Women, by Leavitt [1958]). The four graphs of

LCP below are computed with Hanning window of

51 words long. The dotted lines in the graphs show

scene boundaries of the text carefully marked by

intuition of the author of this thesis.

Mahatma Gandhi

On the evening of January 30, 1948, a little old

10

man

was slowly crossing the courtyard of his home on

20

his way

to prayers. Suddenly the sound of four gun-shots

30

was

heard, and the man fell to the ground. That

40

night his

great friend, Pandit Nehru, speaking on the radio

50

to the

people of India, said: \The light has gone

60

out of our lives

and everywhere it is dark." The

70

life-story of this little,

old and very great man, Mahatma

80

Gandhi, is one which

everyone should know.

Mohandas Gandhi was

90

born in a city in the west part

of India

100

on October 2, 1869. Mohandas was his �rst

name. The

110

word Mahatma means \Great Soul" and is

a title which

120

was given him later. For many years mem-

bers of the

130

Gandhi family had held important govern-

ment posts, and for a

140

long time the father of Mohandas

was chief o�cer in

150

one of the states of India. The father

was �ne

160

and brave man, and very good at his work.

The

170

son loved his father very much, and also his

mother

180

. His mother was very serious in her religion

and never

190

thought of beginning a meal without prayer.

At one time

200

she felt that her religion demanded that she

should not

210

eat until she saw the sun. It was then the

220

season of rain, and often the sun was not seen

230

for a long

time. Her children were much troubled and

240

spent long

hours looking up at the sky to be

250

able to hurry to tell

her that the sun was

260

shining and that she could eat.

0 200 400 600 800 1000 1200

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LCP

words

In his later life

270

Gandhi wrote a book which tells us

many things about

280

his early years. In this book he

says that it

290

was not easy for him to make friends with

other

300

boys in school, and that his only companions were

his

310

books and his lessons. He used to run home from

320

school as soon as classes were over for fear that

330

someone

would talk to him or make fun of him

340

. As a little boy

he was very honest. One day

350

a small event concerned

with school games troubled him very

360

much. Because

he did not enjoy being with other boys

370

and also because

he wanted to help his father after

380

school, he did not like

to take part in school

390

games. He thought they were a

waste of time.

One

400

day when they had had their classes in the

morning

410

only, Mohandas was supposed to return to

school at four

420

o'clock for school games. He had no watch

and the

430

cloudy weather deceived him. He arrived late;

the games were

440

over; the boys had gone home. The

next day when

450

he explained to the head of the school

why he

460

was late, he was not believed. He, Mohandas

Gandhi, a

470

liar? No! No! But how could he prove that

he

480

was telling the truth? At this early age he began

490

to understand that a man of truth must also be

500

a care-

ful man. Carelessness often leads others to have wrong

510

ideas about a person.

Later Mohandas changed his mind about

520

the value

of games in the playground. Fortunately he had

530

read

in books that walking was a valuable exercise, and

540

while

still a boy began to take long walks in

550

the open air, a

form of exercise which he enjoyed

560

and carried on during

all his life.

He also says

570

in his book that his handwriting was

very poor, and

580

that he did nothing to improve it because

he believed

590

that it was not important. Later, when he

was in

600

South Africa, he saw the excellent handwriting

of lawyers and

610

young men of that country and became

ashamed of his

620

own. He saw that bad handwriting

should be considered a

630

weakness in a person. When he

then tried to improve

640

his own handwriting, he found it

was too late.

Mohandas

650

was married at the early age of thir-

teen, which in

660

India at that time was not thought to

be too

670

young. The oldest son of the family was already

married

680

, and the father and mother decided that the

second son

690

and the third son, Mohandas, together with

an uncle's son

700

, should all be married at the same time.

Marriages, with

710

their presents, dinners, �ne clothes and

all the rest, cost

720

the families a lot of money, and a mar-

riage of

730

all three together would save much. The young

wife of

740

Mohandas had never been to school. This early

marriage did

750

not help his lessons, and he lost a year

in

760

high school. Fortunately, by hard work he was later

able

770

to �nish two classes in one year.

Among his few

780

friends at school was a young man

whose character was

790

not very good. Mohandas knew

36

Page 38: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

1000 1200 1400 1600 1800 2000 2200

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LCP

words

this, but refused to accept

800

the advice of others and felt

that he would be

810

able to change the character of his

friend. The family

820

of Gandhi belonged to a religious

group which did not

830

believe in taking the life of any

creature, and so

840

the eating of meat was forbidden them.

But Mohandas's friend

850

set out to make him believe that

the eating of

860

meat was good for him. He explained it

in this

870

way: \We are a weak people. The English are

able

880

to rule over us because they eat meat. I myself

890

am strong and a �ne runner. It is because I

900

am a meat-

eater. You should eat meat, too. Eat some

910

and see what

strength it gives you." After a time

920

the young Mohan-

das partly believed his companion. He himself was

930

cer-

tainly not strong and could hardly jump or run. He

940

was

afraid of the dark, too, and always had a

950

light burning

in his bed-room at night. The desire to

960

eat meat was

great, even though he hated to deceive

970

his father and

mother. One day the two boys went

980

o� to a quiet place

by the river alone, and

990

there Mohandas tasted meat,

goat meat, for the �rst time

1000

. It made him sick. For

about a year after that

1010

, from time to time his friend ar-

ranged for him to

1020

eat meat. At last Mohandas stopped

completely, believing that nothing

1030

was worse than de-

ceiving his father and mother in this

1040

way. They never

learned of what he had done, but

1050

from that time on

through his whole life he never

1060

tasted meat again.

At about this time he and another

1070

young man began

to smoke, not because they really liked

1080

it but because

they thought that they got pleasure in

1090

blowing smoke

from their mouths like grown-up men. They had

1100

little

money to buy cigarettes, and the unsmoked ends of

1110

their uncle's cigarettes were not enough. So occasionally

they stole

1120

a little money from the servants in the house.

Mohandas

1130

soon gave up smoking, and came to feel that

it

1140

was dirty and harmful.

These actions of his troubled the

1150

young man Mo-

handas because he had determined to build his

1160

life on

truth, and he knew that in deceiving his

1170

father and

mother and breaking the rules laid down by

1180

his reli-

gion he was not honest. There was one more

1190

event of

the same kind. Once when �fteen years of

1200

age he stole

a small piece of gold from his

1210

older brother, and the

deed lay heavy on his mind

1220

. Finally he wrote out the

story of what he had

1230

done asking that he be punished

and promising that he

1240

never again would steal. Feel-

ing very much ashamed, he gave

1250

this letter to his good

father, then a sick man

1260

. The father read it carefully,

closed his eyes in thought

1270

, and the tears came. He

slowly tore up the letter

1280

. The boy had expected an-

gry words, and the sorrowful but

1290

loving feelings of the

father were never to be forgotten

1300

by the son.

At the age of eighteen Gandhi went

1310

to a college,

but remained for only part of the

1320

year. The lessons

did not interest him and he did

1330

not do well. Soon after

this he was advised to

1340

go to England to study to be a

lawyer. This

1350

would not be easy. It was di�cult for

him to

1360

leave India and to go to a foreign land where

1370

he would have to eat and drink with foreigners. This

1380

was against his religion, and most leaders of his group

1390

were against his going. Yet, in spite of all di�culties

1400

,

the young Mohandas, at the age of eighteen, sailed for

1410

England, leaving a wife and child behind.

On board ship

1420

he wore, for the �rst time, the new

foreign clothes

1430

provided by his friends. He wore his

black suit, carefully

1440

keeping his new white clothes until

he reached England. This

1450

was at the end of autumn,

and on landing he

1460

was much troubled to �nd he was

the only person

1470

so dressed. To make matter worse, he

could not get

1480

at his baggage to change his clothes. In

his own

1490

account of his early days in London, we �nd

two

1500

interesting events.

One of these was his di�culty in �nding

1510

suitable

food. Unlike most of the Indians in England, he

1520

fol-

lowed the rule of his religion and would not eat

1530

meat.

This was not easy, and he was often hungry

1540

at the

end of a meal. What was his joy

1550

when he discovered a

dining-place where no meat of any

1560

sort was served. He

learned for the �rst time that

1570

there were many people

in England who for health reasons

1580

ate no meat. It

pleased him to �nd science giving

1590

support to his re-

ligious beliefs. Later he found it easier

1600

to prepare

breakfasts and suppers in his own room, and

1610

to buy

his meals in the middle of the day

1620

.

The other event is one which later gave him and

1630

his friends much amusement. The young Indian tried to

\play

1640

the English gentleman". He decided that if he

could not

1650

eat like an Englishman, he would dress like

one and

1660

act like one in other ways. He bought new

clothes

1670

and a tall silk hat, and asked his brother to

1680

send him a gold watch-chain. Then he spent some time

1690

each morning dressing with care and brushing his thick

hair

1700

. Following the advice of friends, he took lessons

in dancing

1710

, French, playing a musical instrument and

speaking in public. But

1720

in these arts he did not do very

well, and

1730

his money was rapidly disappearing. At the

end of three

1740

months he saw that he was not making

the best

1750

use of his time, and gave up all this. He

1760

began to study law.

At this time also he became

1770

more interested in re-

ligions. When friends asked him to help

1780

them in their

understanding of the Gita, the holy book

1790

of his own

Hindu religion, he began to see how

1800

beautiful it was.

Before long it became for him the

1810

one book for the best

knowledge of Truth. Someone gave

1820

him a Bible, and

in it he found some teachings

1830

of Jesus which he liked

very much because they were

1840

so like certain ideas in the

Gita. Then from a

1850

reading of a book by the English

writer Carlyle, he

1860

learned about the Prophet Muham-

mad and about his greatness and

1870

bravery and simple

living. At this time he was beginning

1880

to learn that the

truth he loved was not to

1890

be found in any one religion

only.

After four years

1900

of study, young Gandhi passed his

law examinations and in

1910

1891 returned to India. When

he landed he was met

1920

by friends who told him of his

mother's death. This

1930

was an even greater shock to him

than the death

1940

of his father before he went to England.

The next

1950

few years were not happy ones. He found his

work

1960

as a lawyer not at all interesting, and came to

1970

feel that he was not �tted for this kind of

1980

occupation.

He had trouble on the one occasion when he

1990

was in

court. He almost fainted, and when his turn

2000

came to

speak he could not say a word. He

2010

would welcome a

37

Page 39: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

2000 2200 2400 2600 2800 3000 3200

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LCP

words

change. This came when he was invited

2020

to go to South

Africa to advise a rich Indian

2030

merchant who was trying

to collect a large amount of

2040

money from a member of

his family. We �nd him

2050

at the age of twenty-four in

Durban, South Africa.

Gandhi

2060

soon found that conditions among the many

Indians in South

2070

Africa were not at all right. He

learned this �rst

2080

when he went to court wearing foreign

clothes and a

2090

turban. He refused and left the court.

This turban was

2100

soon to become famous all over South

Africa. Most of

2110

the Indians who had left their own

land to look

2120

for work in Africa were considered of a

low rank

2130

and were known as \coolies". Gandhi was

thus a \coolie

2140

" lawyer.

A few days after he arrived, Gandhi was sent

2150

o�

to another city on business for his employer, Abdullah

2160

Sheth. When a white man travelling in the same train

2170

discovered him in a �rst-class seat he called a railway

2180

guard who ordered him to leave the �rst-class carriage.

Gandhi

2190

replied that he had bought a �rst-class ticket

and intended

2200

to use it. A policeman came and forced

him to

2210

leave the train. The next day something even

worse happened

2220

. While making a journey in a large

public carriage, he

2230

was given a seat outside with the

driver. During the

2240

journey the white man in charge

wanted his seat. When

2250

Gandhi refused to move, the

man struck him, but the

2260

other white people in the car-

riage made the man stop

2270

. When he reached the city

he drove to the main

2280

hotel, and there received another

shock. The hotel would not

2290

take him in. It was

events like these which made

2300

Gandhi feel that someone

was needed to help the Indians

2310

in Africa. He himself

was not proud, and he was

2320

not dependent upon a com-

fortable way of living. Later he

2330

accepted for himself

the simple living of the poorest Indians

2340

, and travelled

third-class in trains at all times. But it

2350

hurt him to

see the people of his country treated

2360

badly, and so he

continued to work against all attempts

2370

to treat him and

others in a way that was

2380

not fair and just.

After a time he came to

2390

feel that it would be un-

wise for the merchant who

2400

employed him to go to the

courts to get back

2410

the money that was owed him. As a

result of

2420

very hard work lasting months, he was able to

get

2430

the two merchants to agree outside of court upon

the

2440

amount of money to be paid and how it was

2450

to be paid. This success led him to believe that

2460

most

quarrels between people could be, and should be settled

2470

in a peaceful manner with the aid of friends.

During

2480

this year he met a number of Christians

who were

2490

eager that he should become a Christian and

Moslems who

2500

hoped that he would become a Moslem.

He read from

2510

the Bible and Koran and from books

about both religions

2520

. But at the same time he was

coming to enjoy

2530

and depend more and more upon the

holy books of

2540

hinduism and was coming to �nd for him-

self deep happiness

2550

and peace in them.

At the end of a year

2560

his work with Abdullah Sheth

was �nished and he planned

2570

to return to India. But

at a good-bye dinner given

2580

him in Durban he learned

that a law was being

2590

planned to take away from all

Indians still more of

2600

their rights. During the talk at

the dinner it was

2610

decided that Gandhi must remain in

South Africa and work

2620

for the rights of the Indians.

Thus began twenty years

2630

of hard work for the Indians

of South Africa.

At

2640

the end of three years he returned to India for

2650

several months, and then came back by ship with his

2660

wife and two children. While in India he had tried

2670

to

tell the people there how Indians were treated in

2680

South

Africa, and news of what he had spoken and

2690

written

had reached the white people living in natal before

2700

he

arrived. When he attempted to land he was recognized

2710

and cries of \Gandhi, Gandhi!" quickly brought a crowd

together

2720

. The crowd gathered around him, threw

stones and eggs at

2730

him and struck him. He was saved

by the courage

2740

of the wife of the English Chief of Po-

lice, who

2750

walked along him until policemen came to

his help. He

2760

was then able to escape from the angry

crowd by

2770

dressing himself as an Indian policeman and

slipping out of

2780

the back door while the Police Chief

held the crowd's

2790

attention in front.

It is not possible to describe all

2800

the events of the

years that Gandhi spent in South

2810

Africa serving his fel-

low Indians, and working to improve their

2820

conditions

and to make the government treat them more justly

2830

.

He gave up a position in which he was earning

2840

a lot of

money in order to join with the

2850

poor people for whom

he was working. In all his

2860

work his wife helped him,

and believed in him and

2870

gave him courage to go on.

From the struggle in

2880

South Africa he gained a strong

belief in certain ways

2890

of action which were to be so

important later in

2900

his own country. More and more

he came to believe

2910

in a \soulforce". This was a strug-

gle against evil and

2920

force not by using hatred and force,

but by love

2930

and by quietly refusing to obey unjust laws.

Those who

2940

believed as he did and followed him would

not work

2950

with the government or obey and unjust law.

In the

2960

end there was little that the government could

do about

2970

it. Gandhi was often put in prison, but

his followers

2980

continued to carry on the work. When

Gandhi left South

2990

Africa in 1914 very great improve-

ments in the conditions of

3000

the Indians there had taken

place.

Gandhi returned to India

3010

at the beginning of the

First World War to �nd

3020

himself already recognized as

a leader. His work in South

3030

Africa had been followed

by the people, and he now

3040

was everywhere spoken as

\Mahatma" Gandhi. He settled down near

3050

Ahmed-

abad, where he started an Ashram, a religious group-home.

People

3060

of any race or religion were invited to come

and

3070

join him, if they were willing to make certain

promises

3080

. There were: (1) always to speak the truth;

(2) not

3090

to �ght or hate other people; (3) to eat only

3100

what was necessary to keep them healthy; (4) not to

3110

own anything that was not necessary.

The Untouchables were the

3120

lowest rank in the Hindu

religion; they were allowed to

3130

do only the lowest kind

of work; but they were

3140

welcome in the Gandhi home.

When a family of Untouchables

3150

did come to join the

group trouble arose. The neighbours

3160

threatened that

they would have nothing to do with them

3170

, and the

rich Hindus who were helping to support the

3180

home

with money suddenly stopped giving. Gandhi was not

38

Page 40: Computing Lexical Cohesion as a Tool for Text Analysis Hideki

3000 3200 3400 3600 3800 4000 4200

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LCP

words

troubled

3190

, but started making plans to move the whole

group into

3200

the part of the city where the Untouchables

live. He

3210

planned that they all would get their living

by doing

3220

the low work that only Untouchables were

allowed to do

3230

. While these plans were being made,

the Mahatma was called

3240

aside by a Moslem merchant,

who asked him if he

3250

would accept money from him for

the help of the

3260

Ashram. The next day the man re-

turned with a large

3270

amount of money, enough to keep

the home going for

3280

a year. Gandhi said: \God has

sent us help at

3290

the last moment." This event was

the �rst of many

3300

which was to give the Untouchables a

new place in

3310

Indian life. At this time, and for the rest

of

3320

his life, the Mahatma was wearing the simple native

clothing

3330

made of cotton cloth spun in a home.

Gandhi's great

3340

aim in life was to help to improve

the conditions

3350

of poor and su�ering people, and to aid

his people

3360

in any way he could, but always without

using force

3370

. He was against every sort of evil, no mat-

ter of

3380

what kind. When he tried to �nd out about

the

3390

conditions among poor farm workers, the people

crowded around him

3400

by the hundreds. A friend had

come among them, someone

3410

who wanted to help them,

and to them this was

3420

something new. When the po-

lice ordered Gandhi to leave the

3430

place, he refused, and

in court he explained why he

3440

could not obey. Then

he asked the court to punish

3450

him for breaking the

law. The court did not know

3460

what to do with such

a man, and so let

3470

him go free. This was the �rst

step in what

3480

came to be an important and common

event in many

3490

parts of India | to refuse to obey a law

considered

3500

to be unjust, and at the same time calmly

to

3510

accept any punishment that might be given.

Little by little

3520

the people of India came to under-

stand what the Mahatma

3530

meant by �ghting force with

love, instead of �ghting force

3540

with force. In 1930 there

was the famous Salt March

3550

.

According to the law, no one was allowed to make

3560

salt from sea water, but must buy it through the

3570

gov-

ernment. Gandhi considered that this was a bad and

unjust

3580

law and so should not be obeyed. He said

publicly

3590

that he would lead his followers to the sea,

two

3600

hundred miles away, and there disobey the law.

For three

3610

weeks, while the whole world watched and

while conditions of

3620

India were troubled, the little old

man, dressed in the

3630

white cotton which he had spun

himself, walked steadily on

3640

. Crowds followed him, the

people changing from village to village

3650

, on and on, un-

til they reached the sea. There he

3660

made a handful of

salt. God had given the sea

3670

; no government of man

could keep it from the people

3680

. He was put in prison

for a time, but not

3690

for long.

The struggle of the Indian people for self-

government

3700

had begun. Gandhi wanted self-

government, but he knew that Indian

3710

must show that

they were ready for it. \Even God

3720

" he said, \Can-

not grant it; we must work for it

3730

and win it ourselves."

He began to attack the British

3740

government in his writ-

ings because it was unwilling to free

3750

India, but he still

believed in love and not hatred

3760

, and he set his face

against the use of force

3770

. He was sent to prison several

times because of what

3780

he said and what he did. When

his followers did

3790

not obey him and used force, he went

without food

3800

, sometime for so long that he almost died.

His followers

3810

grew in number and in strength.

Crowds gathered to see

3820

him pass and to hear him speak.

Al India read

3830

what he wrote. Important leaders of

India and other parts

3840

of the world came to talk with

him about their

3850

plans, and to listen to his message of

peace and

3860

love for the world. The struggle for self-

government was long

3870

, and in the end success came.

After long years an

3880

Act was passed making India a free

nation. Everyone knew

3890

that the man who had done

more than anyone else

3900

to bring this about was Gandhi.

But Gandhi was troubled

3910

in spite of his success. Such

terrible quarrels had arisen

3920

between the Moslems and

the Hindus that India had had

3930

to be divided between

them, and there were now two

3940

countries: India for the

Hindus and Pakistan for the Moslems

3950

. Gandhi so loved

his country and so hated quarrels that

3960

this division

made him very unhappy.

Terrible things happened in

3970

many parts of India,

especially where Hindus and Moslems lived

3980

side by

side. Fighting between the two groups broke out

3990

,

and men, women and children were killed. Hundreds of

thousands

4000

of people were without homes and there was

very great

4010

su�ering. In the part of the country in

which Gandhi

4020

was living, peace came sooner than in

other parts of

4030

India, because Gandhi had said that he

would refuse to

4040

eat until the �ghting stopped. Both

Hindus and Moslems respected

4050

him so much that they

kept the peace. But Gandhi's

4060

life was coming to its

end. On January 30, 1948

4070

, he was walking slowly from

his home to attend a

4080

prayer meeting. A young Hindu

thought that Gandhi had done

4090

harm to the Hindus be-

cause he was friendly with the

4100

Moslems; he pushed his

way through the crowd and shot

4110

Gandhi in the stom-

ach. Some minutes later a man came

4120

out of the house

into which the body had been

4130

carried and said to the

waiting crowd: \Gandhi is dead

4140

!"

Another great Indian leader, Pandit Nehru, speaking

over the radio

4150

that night, said: \The light has gone

out of our

4160

lives and everywhere it is dark. The father of

the

4170

nation is no more. The best prayer we can o�er

4180

is to give ourselves to Truth and carry on the

4190

noble

work for which he lived and for which he

4200

died." A few

days later, following the custom of the

4210

Hindu religion,

Mahatma Gandhi's body was burned in the presence

4220

of

a great crowd, and later the ashes were scattered

4230

over

the waters of the sacred rivers. So ended the

4240

life, but

not the spirit, of one of the great

4250

men of the world.

39